[Zlib-devel] infnew-5 available for testing

Thu Jan 9 23:58:01 EST 2003

On Thu, 9 Jan 2003, Mark Adler wrote:

> On Thursday, January 9, 2003, at 05:43  PM, Chris Anderson wrote:
>
> > One of the differences
> > between icc -O3 and icc -O3 -prof_use is that icc -O3 unrolls this loop
> > and icc -O3 -prof_use does not (infnew-5/inffast.c):
>
> This does not surprise me.  I tried several variations on the amount of
> loop unrolling and used the optimal amount for the distribution of
> lengths in typical deflate streams (I should say optimal for my
> processor).
>

That explains a lot.

I just tried icc version 7.0 (was using 6.0) and it no longer unrolled
that loop with -O3 and improved speed slightly.  Moreover, POSTINC
improved times with both -O3 and -O3 -prof_use:

icc -O3
zbuflen  16384, clock 11.530, time 12.173
zbuflen  16384, clock 11.550, time 12.066

icc -O3 -DPOSTINC
zbuflen  16384, clock 11.440, time 12.004
zbuflen  16384, clock 11.520, time 12.045

icc -O3 -prof_use
zbuflen  16384, clock 10.850, time 11.385
zbuflen  16384, clock 10.930, time 11.449

icc -O3 -prof_use -DPOSTINC
zbuflen  16384, clock 10.670, time 11.226
zbuflen  16384, clock 10.720, time 11.242

The intel gods must have been listening.

> > The asm for this loop without unrolling looks like this
> ...
> >         movl      48(%esp), %ebx                                #234.25
> >         addl      $-3, %ebx                                     #234.25
> >         movl      %ebx, 48(%esp)                                #234.25
>
> That's interesting.  I would think that any self-respecting compiler
> would keep the loop counter in a register instead of on the stack.
> Then again, perhaps I'm more used to a processor with 32 registers than
> one with eight.
>
> mark
>

Looks like gcc does something simular, but doesn't have the -2
instructions with POSTINC:

gcc3.2 -O3 -DPOSTINC (15 instructions)

.L46:
        movb    (%edx), %cl
        movb    %cl, (%edi)
        incl    %edx
        movb    (%edx), %cl
        incl    %edi
        movb    %cl, (%edi)
        incl    %edx
        incl    %edi
        movb    (%edx), %cl
        movb    %cl, (%edi)
        subl    $3, -64(%ebp)
        incl    %edx
        incl    %edi
        cmpl    $2, -64(%ebp)
        ja      .L46

icc7 -O3 -DPOSTINC (13 instructions, w/o POSTINC it was 15)

..B1.37:                        # Preds ..B1.37 ..B1.36         # Infreq
        movb      (%edx), %bl                                   #231.36
        movb      %bl, (%ebp)                                   #231.25
        movb      1(%edx), %cl                                  #232.36
        lea       1(%edx), %edi                                 #231.36
        lea       1(%ebp), %ebx                                 #231.25
        movb      %cl, 1(%ebp)                                  #232.25
        movb      2(%edx), %cl                                  #233.36
        movb      %cl, 2(%ebp)                                  #233.25
        addl      $-3, %eax                                     #234.25
        addl      $3, %edx                                      #233.36
        addl      $3, %ebp                                      #233.25
        cmpl      $2, %eax                                      #235.30
        ja        ..B1.37       # Prob 90%                      #235.30

But, the times are better with POSTINC!

gcc3.2 -O3 -DPOSTINC
zbuflen  16384, clock 11.940, time 12.481
zbuflen  16384, clock 12.020, time 12.592

gcc3.2 -O3
zbuflen  16384, clock 12.680, time 13.257
zbuflen  16384, clock 12.680, time 13.226