[Zlib-devel] infnew-5 available for testing

Thu Jan 9 19:39:00 EST 2003

On Thu, 2 Jan 2003, Chris Anderson wrote:

>
> Another observation is that the infnew code is faster when the profiler
> feedback optimization (-prof_use) is thrown at it.  The original
> zlib-1.1.4 doesn't get as much speedup with this optimization, but is
> pretty fast without it anyway.
>

In a fit of sleeplessness, I generated some asm files with different
compiler optimizations and found that icc generates a faster inflate()
when loop unrolling is disabled (-unroll0).  One of the differences
between icc -O3 and icc -O3 -prof_use is that icc -O3 unrolls this loop
and icc -O3 -prof_use does not (infnew-5/inffast.c):

    229                     from = out - dist;
    230                     do {
    231                         PUP(out) = PUP(from);
    232                         PUP(out) = PUP(from);
    233                         PUP(out) = PUP(from);
    234                         len -= 3;
    235                     } while (len > 2);

The asm for this loop without unrolling looks like this (well, the tail of
the unrolled looks like this too):

.B1.41:                         # Preds .B1.40 .B1.41           # Infreq
        lea       1(%eax), %edx                                 #231.36
        movb      1(%eax), %bl                                  #231.36
        lea       1(%ecx), %ebp                                 #231.25
        movb      %bl, 1(%ecx)                                  #231.25
        movb      2(%eax), %bl                                  #232.36
        addl      $3, %eax                                      #233.36
        movb      %bl, 2(%ecx)                                  #232.25
        movb      2(%edx), %bl                                  #233.36
        addl      $3, %ecx                                      #233.25
        movb      %bl, 2(%ebp)                                  #233.25
        movl      48(%esp), %ebx                                #234.25
        addl      $-3, %ebx                                     #234.25
        movl      %ebx, 48(%esp)                                #234.25
        cmpl      $2, %ebx                                      #235.30
        ja        .B1.41        # Prob 66%                      #235.30

icc -O3 -c inffast.c
zbuflen  16384, clock 12.210, time 12.434
zbuflen  16384, clock 12.210, time 12.411

icc -unroll0 -O3 -c inffast.c
zbuflen  16384, clock 12.110, time 12.306
zbuflen  16384, clock 12.110, time 12.328

Defining POSTINC does have a 2 instruction advantage in the above loop,
but 2 more instructions are added to the loop prefix and the unrolled
version didn't show a consistent advantage like the unroll disabled one.

icc -unroll0 -O3 -DPOSTINC -c inffast.c
zbuflen  16384, clock 11.980, time 12.181
zbuflen  16384, clock 11.980, time 12.183

icc -O3 -DPOSTINC -c inffast.c
zbuflen  16384, clock 12.260, time 12.467
zbuflen  16384, clock 12.050, time 12.270

In any case, all of the above differences are still very small and system
noise easily overwhelms them.  The profile feedback optimization still
shows the best improvement:

icc -O3 -prof_use -c inffast.c
zbuflen  16384, clock 10.850, time 11.042
zbuflen  16384, clock 10.850, time 11.032

The other differences spotted by eyeballing the asm files are not as easy
for me to grok, although there seems to be a different ordering of basic
blocks and different instruction scheduling.