[Zlib-devel] Performance patch set

Mon May 10 05:02:27 EDT 2010

>
> Hi devs,
>
> as I wrote some time ago, I have been working on
> Subversion performance issues and came across
> some optimization potential in zlib as well.
>
> Inflate is now about 50% faster and a few minor
> optimizations for deflate were done along the way,
> too.
>
> Although the changes are mostly independent
> from each other, it would have been difficult to
> create truly independent patches. Therefore, it's
> all in one package. The patch was made against
> 1.2.5 release version.
>
> -- Stefan^2.
>
> [[[
> Major performance enhancement in deflate_fast
> plus a few minor ones in other places (see below).
> All of them take advantage of various platform-
> specific capabilities.
>
> * deflate.c: make sure the UNALIGNED_OK
>   optimization is only used when allowed, i.e.
>   enforce the condition mentioned in line 1125.
>
> * deflate.h: increase output buffer capacity such
>   that we can add new values in a single step
>   and flush on overflow afterwards (see trees.c).
>   Also move push_short from trees.c to here
>   because push_byte is there, too.
>
> * inffast.c: major rework:
>   - hold / pre-fetch up to 8 bytes of data
>     to minimize the number of 'top up' operations
>   - prefetch data using a single memory access,
>     if allowed by CPU architecture
>   - copy data in large chunks w/o the need to
>     check for buffer ends
>   - unroll the literal handling loop
>   - general latency tuning on the critical path
>
> * inftree.c: explicitly eliminate redundant memory
>   accesses as at least MS VC is not able to do it
>   (fearing pointer aliases?).
>
> * trees.c: optimize send_bits based on the larger
>   output buffer.
>
> * zconf.h: enable unaligned access for x86 and
>   x64 using GCC or MS VC. Same for little
>   endianess optimizations. Enable SSE2 code
>   when defined by compiler settings (GCC,
>   MS VC).
>
> patch by Stefan Fuhrmann (stefanfuhrmann< at > alice-dsl.de)
> ]]]

>  #define put_byte(s, c) {s->pending_buf[s->pending++] = (c);}
>
> +/* Output a short LSB first on the stream.
> + * IN assertion: there is enough room in pendingBuf.
> + */
> +#if defined(LITTLE_ENDIAN) && defined(UNALIGNED_OK)

defined(LITTLE_ENDIAN), not sure how LITTLE_ENDIAN is defined but this
typically break as both __LITTE_ENDIAN and __BIG_ENDIAN are defined if you include
some system header such as stdlib.h. One must use
#if __BYTE_ORDER == __LITTLE_ENDIAN to be sure

Any reason you can't use this a BE CPU? Isn't it enough that UNALIGED_OK is defined?
PowerPC can do unaligned too and is BE.

> +#  define put_short(s, w) { \
> +    *(ush*)(s->pending_buf + s->pending) = (ush)(w);\
> +    s->pending += 2; \
> +}
> +#else
> +#  define put_short(s, w) { \
> +    put_byte(s, (uch)((w) & 0xff)); \
> +    put_byte(s, (uch)((ush)(w) >> 8)); \
> +}
> +#endif
>

I think you should wrap these new macros with a do { ... } while(0)
That way it won't break if you do something like
if (something)
  MACRO;

> +/* A reusable code-snippet.  It copies 'len' bytes from 'from'
> + * to 'out'.  'len' must be 3 or larger.  This code will be used
> + * when no optimization will is available.
> + */
> +#define STANDARD_MIN3_COPY\
> +    while (len > 2) {\
> +        PUP(out) = PUP(from);\
> +        PUP(out) = PUP(from);\
> +        PUP(out) = PUP(from);\
> +        len -= 3;\
> +    }\
> +    if (len) { \
> +        PUP(out) = PUP(from);\
> +        if (len > 1)\
> +            PUP(out) = PUP(from);\
> +    }
> +
> +/* A reusable code-snippet.  It copies data from 'from'to 'out'.
> + * up to 'last' with the last chunk possibly exceeding 'last'
> + * by up to 15 bytes.
> + */
> +#ifdef USE_SSE2
> +#  include <emmintrin.h>
> +#  define TRY_CHUNKY_COPY\
> +    if ((dist >= sizeof (__m128i)) || (last <= out)) { \
> +        do {\
> +            _mm_storeu_si128 ((__m128i*)(out+OFF), \
> +                              _mm_loadu_si128((const __m128i*)(from+OFF)));\
> +            out += sizeof (__m128i);\
> +            from += sizeof (__m128i);\
> +        } while (out < last); \
> +    }
> +#else
> +#  define TRY_CHUNKY_COPY\

Have tested this is faster too? I recall testing this on PowerPC and it wasn't
faster. Hence I did my optimization with shorts instead.

> +    if (dist >= sizeof(long) || (last <= out)) { \
> +        do {\
> +            *(long*)(out+OFF) = *(long*)(from+OFF);\
> +            out += sizeof (long);\
> +            from += sizeof (long);\
> +        } while (out < last); \
> +    }
> +#endif