[Zlib-devel] [PING] Improve longest_match performance

Tue Aug 21 05:30:58 EDT 2012

On 24/07/12 16:56, John Reiser wrote:
> On 07/24/2012, Andreas Krebbel wrote:
> 
>> any comments regarding this one?
>>
>> http://mail.madler.net/pipermail/zlib-devel_madler.net/2012-June/002907.html
> 
> These changes in the width of variables which contribute to indexing
> of arrays:
> -----
> +    uLong cur_match = pcur_match; /* extend to pointer width */
> 
> +    uLong wmask = s->w_mask;
> -----
> should use 'ptrdiff_t' instead.  It ought to take only a minor increase
> in intelligence for the compiler to do such a widening automatically,
> because all uses are associated with array indexing, and a machine with
> 64-bit addresses often charges a high price for indexing by a 32-bit quantity.
> 
> 
> These changes in the width of scan_start, scan_end, scan_end1:
> -----
> -    register ush scan_start = *(ushf*)scan;
> -    register ush scan_end   = *(ushf*)(scan+best_len-1);
> +    register uInt scan_start = *(ushf*)scan;
> +    register uInt scan_end   = *(ushf*)(scan+best_len-1);
>  #else
>      register Bytef *strend = s->window + s->strstart + MAX_MATCH;
> -    register Byte scan_end1  = scan[best_len-1];
> -    register Byte scan_end   = scan[best_len];
> +    register uInt scan_end1  = scan[best_len-1];
> +    register uInt scan_end   = scan[best_len];
> -----
> are an indictment of the compiler, so you should demand better
> from its maintainers.  All uses depend on only the width in memory,
> such as:
> -----
>         if (match[best_len]   != scan_end  ||
>             match[best_len-1] != scan_end1 ||
> -----
> thus any widening and narrowing are the compiler's fault alone.
> Also, widening these in the source code likely will hurt performance
> for lower-end compilers (in particular, many instances of gcc on
> [or for] i386, i486, other low-end processors) which tend to believe
> the declarations, then skimp on range analysis.

I fully agree that it would be best if the compiler could figure this all out. I'm
maintaining the GCC back-end for S/390 and have already tried a couple of things in the
back-end. But I wasn't able to find something which fully solves the issue without
creating regressions in other cases.

Perhaps an existing optimization pass could be enhanced to do this kind of type promotions
automatically - or an completely new pass could be written. But this probably would be a
major change which will take some time. And as usual cases will appear where this actually
impacts the performance negatively :(

GCC also for x86 and Power isn't extra clever here. Also for these targets no type
promotions will be done automatically. Currently it just happens to be the case that the C
code matches the their instruction sets much better than S/390.

E.g. looking at the loop exit test:
(cur_match = prev[cur_match & wmask]) > limit

In order to access the short int array prev. The code needs to do quite some arithmetic on
cur_match and sooner or later has to zero extend it to address width in order to add it to
&prev. On x86 that zero extend comes for free since the AND instruction already does this
implicitely when using 32 bit operands. On S/390 an extra instruction is needed which can
be avoided by directly keeping cur_match at 64 bit.

x86:

  403eb0:	21 de                	and    %ebx,%esi
  403eb2:	0f b7 74 75 00       	movzwl 0x0(%rbp,%rsi,2),%esi
  403eb7:	41 39 f2             	cmp    %esi,%r10d

S/390 old

    80004616:	14 37             	nr	%r3,%r7
    80004618:	b9 16 00 33       	llgfr	%r3,%r3		32>64 zero extend
    8000461c:	eb 33 00 01 00 0d 	sllg	%r3,%r3,1
    80004622:	e3 43 60 00 00 95 	llh	%r4,0(%r3,%r6)  16>32 zero extend
    80004628:	b9 16 00 34       	llgfr	%r3,%r4 	32>64 zero extend
    8000462c:	15 e4             	clr	%r14,%r4

S/390 new

    80004690:   b9 80 00 3a             ngr     %r3,%r10
    80004694:   eb 33 00 01 00 0d       sllg    %r3,%r3,1
    8000469a:   e3 33 90 00 00 91       llgh    %r3,0(%r3,%r9)
    800046a0:   b9 21 00 38             clgr    %r3,%r8

The first llgfr disappears because the calculation there already uses 64bit. The llh is
not necessary anymore because no 32 bit copy of cur_match needs to be kept around anymore
so the 64 bit zero extend will do.

To my understanding 64 bit operations on 64 bit targets will be at least as efficient as
32 bit operations. Compiling zlib for 32 bit will result in having a 32 bit long int data
type again. So this should not hurt.

Bye,

-Andreas-