[Zlib-devel] infnew-5 available for testing
Chris Anderson
christop at fellspt.charm.net
Thu Jan 9 19:39:00 EST 2003
On Thu, 2 Jan 2003, Chris Anderson wrote:
>
> Another observation is that the infnew code is faster when the profiler
> feedback optimization (-prof_use) is thrown at it. The original
> zlib-1.1.4 doesn't get as much speedup with this optimization, but is
> pretty fast without it anyway.
>
In a fit of sleeplessness, I generated some asm files with different
compiler optimizations and found that icc generates a faster inflate()
when loop unrolling is disabled (-unroll0). One of the differences
between icc -O3 and icc -O3 -prof_use is that icc -O3 unrolls this loop
and icc -O3 -prof_use does not (infnew-5/inffast.c):
229 from = out - dist;
230 do {
231 PUP(out) = PUP(from);
232 PUP(out) = PUP(from);
233 PUP(out) = PUP(from);
234 len -= 3;
235 } while (len > 2);
The asm for this loop without unrolling looks like this (well, the tail of
the unrolled looks like this too):
.B1.41: # Preds .B1.40 .B1.41 # Infreq
lea 1(%eax), %edx #231.36
movb 1(%eax), %bl #231.36
lea 1(%ecx), %ebp #231.25
movb %bl, 1(%ecx) #231.25
movb 2(%eax), %bl #232.36
addl $3, %eax #233.36
movb %bl, 2(%ecx) #232.25
movb 2(%edx), %bl #233.36
addl $3, %ecx #233.25
movb %bl, 2(%ebp) #233.25
movl 48(%esp), %ebx #234.25
addl $-3, %ebx #234.25
movl %ebx, 48(%esp) #234.25
cmpl $2, %ebx #235.30
ja .B1.41 # Prob 66% #235.30
icc -O3 -c inffast.c
zbuflen 16384, clock 12.210, time 12.434
zbuflen 16384, clock 12.210, time 12.411
icc -unroll0 -O3 -c inffast.c
zbuflen 16384, clock 12.110, time 12.306
zbuflen 16384, clock 12.110, time 12.328
Defining POSTINC does have a 2 instruction advantage in the above loop,
but 2 more instructions are added to the loop prefix and the unrolled
version didn't show a consistent advantage like the unroll disabled one.
icc -unroll0 -O3 -DPOSTINC -c inffast.c
zbuflen 16384, clock 11.980, time 12.181
zbuflen 16384, clock 11.980, time 12.183
icc -O3 -DPOSTINC -c inffast.c
zbuflen 16384, clock 12.260, time 12.467
zbuflen 16384, clock 12.050, time 12.270
In any case, all of the above differences are still very small and system
noise easily overwhelms them. The profile feedback optimization still
shows the best improvement:
icc -O3 -prof_use -c inffast.c
zbuflen 16384, clock 10.850, time 11.042
zbuflen 16384, clock 10.850, time 11.032
The other differences spotted by eyeballing the asm files are not as easy
for me to grok, although there seems to be a different ordering of basic
blocks and different instruction scheduling.
More information about the Zlib-devel
mailing list