[Zlib-devel] Performance patch set

Wed May 12 04:37:48 EDT 2010

>
> Joakim Tjernlund wrote:
> >> Hi devs,
> >>
> >> as I wrote some time ago, I have been working on
> >> Subversion performance issues and came across
> >> some optimization potential in zlib as well.
> >>
> >> Inflate is now about 50% faster and a few minor
> >> optimizations for deflate were done along the way,
> >> too.
> >>
> >> Although the changes are mostly independent
> >> from each other, it would have been difficult to
> >> create truly independent patches. Therefore, it's
> >> all in one package. The patch was made against
> >> 1.2.5 release version.
> >>
> >> -- Stefan^2.
> >>
> >> [[[
> >> Major performance enhancement in deflate_fast
> >> plus a few minor ones in other places (see below).
> >> All of them take advantage of various platform-
> >> specific capabilities.
> >>
> >> * deflate.c: make sure the UNALIGNED_OK
> >>   optimization is only used when allowed, i.e.
> >>   enforce the condition mentioned in line 1125.
> >>
> >> * deflate.h: increase output buffer capacity such
> >>   that we can add new values in a single step
> >>   and flush on overflow afterwards (see trees.c).
> >>   Also move push_short from trees.c to here
> >>   because push_byte is there, too.
> >>
> >> * inffast.c: major rework:
> >>   - hold / pre-fetch up to 8 bytes of data
> >>     to minimize the number of 'top up' operations
> >>   - prefetch data using a single memory access,
> >>     if allowed by CPU architecture
> >>   - copy data in large chunks w/o the need to
> >>     check for buffer ends
> >>   - unroll the literal handling loop
> >>   - general latency tuning on the critical path
> >>
> >> * inftree.c: explicitly eliminate redundant memory
> >>   accesses as at least MS VC is not able to do it
> >>   (fearing pointer aliases?).
> >>
> >> * trees.c: optimize send_bits based on the larger
> >>   output buffer.
> >>
> >> * zconf.h: enable unaligned access for x86 and
> >>   x64 using GCC or MS VC. Same for little
> >>   endianess optimizations. Enable SSE2 code
> >>   when defined by compiler settings (GCC,
> >>   MS VC).
> >>
> >> patch by Stefan Fuhrmann (stefanfuhrmann< at > alice-dsl.de)
> >> ]]]
> >>
> >
> >
> >
> >>  #define put_byte(s, c) {s->pending_buf[s->pending++] = (c);}
> >>
> >> +/* Output a short LSB first on the stream.
> >> + * IN assertion: there is enough room in pendingBuf.
> >> + */
> >> +#if defined(LITTLE_ENDIAN) && defined(UNALIGNED_OK)
> >>
> >
> > defined(LITTLE_ENDIAN), not sure how LITTLE_ENDIAN is defined but this
> > typically break as both __LITTE_ENDIAN and __BIG_ENDIAN are defined if you include
> > some system header such as stdlib.h. One must use
> > #if __BYTE_ORDER == __LITTLE_ENDIAN to be sure
> >
> Thanks for the hint, I will change that in zconf.h
> (this is where LITTLE_ENDIAN gets defined).

oh, zconf.h defines LITTLE_ENDIAN? Some systems(glibc does on linux) defines both
__LITTE_ENDIAN and LITTLE_ENDIAN so it is not a good to do it in zconf.h too.

try:
echo "#include <stdlib.h>" > le_tst.c
cpp -dM le_tst.c  | grep ENDIAN

> > Any reason you can't use this a BE CPU? Isn't it enough that UNALIGED_OK is defined?
> > PowerPC can do unaligned too and is BE.
> >
> >
> The output stream is LE. So, the LSB must be
> stored first.

Yes, but you could do cpu_tole(w) to fix that. Then PowerPC and other
BE CPU could benefit too.

> >> +#  define put_short(s, w) { \
> >> +    *(ush*)(s->pending_buf + s->pending) = (ush)(w);\
> >> +    s->pending += 2; \
> >> +}
> >> +#else
> >> +#  define put_short(s, w) { \
> >> +    put_byte(s, (uch)((w) & 0xff)); \
> >> +    put_byte(s, (uch)((ush)(w) >> 8)); \
> >> +}
> >> +#endif
> >>
> >>

> >> +#ifdef USE_SSE2
> >> +#  include <emmintrin.h>
> >> +#  define TRY_CHUNKY_COPY\
> >> +    if ((dist >= sizeof (__m128i)) || (last <= out)) { \
> >> +        do {\
> >> +            _mm_storeu_si128 ((__m128i*)(out+OFF), \
> >> +                              _mm_loadu_si128((const __m128i*)(from+OFF)));\
> >> +            out += sizeof (__m128i);\
> >> +            from += sizeof (__m128i);\
> >> +        } while (out < last); \
> >> +    }
> >> +#else
> >> +#  define TRY_CHUNKY_COPY\
> >>
> >
> > Have tested this is faster too? I recall testing this on PowerPC and it wasn't
> > faster. Hence I did my optimization with shorts instead.
> >
> Yes it is. However, using 4/8-byte copies instead of
> 2-byte ones has the following drawbacks:
>
> * 2 byte copies be aligned in 75% of all cases,
>   4 byte copies are aligned in 25% of all cases
>   and can only be brought up to 62.5%
>   -> on systems with large unalignment penalty,
>   this makes 4-byte copies less than 50% faster
>   than 2-byte ones
>
> * longer "tail" code to match 'len' exactly (unnecessary
>   but most people do it anyways). Since many copies
>   are < 10 bytes, another hard-to-predict jump can make
>   all the difference.

I see, I never tried to align the 4 byte copy if I remember correctly
or removing the tail code.

Furthermore (out < last) is costs more on ppc since
(--len) is basically free so it would be good if
you could change the do {} while loop to use that instead.

>
> QUICK_COPY accounts for about 15% of the
> inflate_fast runtime in my scenario (compression
> factor 2.2), the actual copy about 10%. YMMV,
> especially on different architectures.
> >
> >> +    if (dist >= sizeof(long) || (last <= out)) { \
> >> +        do {\
> >> +            *(long*)(out+OFF) = *(long*)(from+OFF);\
> >> +            out += sizeof (long);\
> >> +            from += sizeof (long);\
> >> +        } while (out < last); \
> >> +    }
> >> +#endif
> >>
> >
> >
> -- Stefan^2.
>