[Zlib-devel] Performance patch set

Wed May 12 03:53:37 EDT 2010

Joakim Tjernlund wrote:
>> Hi devs,
>>
>> as I wrote some time ago, I have been working on
>> Subversion performance issues and came across
>> some optimization potential in zlib as well.
>>
>> Inflate is now about 50% faster and a few minor
>> optimizations for deflate were done along the way,
>> too.
>>
>> Although the changes are mostly independent
>> from each other, it would have been difficult to
>> create truly independent patches. Therefore, it's
>> all in one package. The patch was made against
>> 1.2.5 release version.
>>
>> -- Stefan^2.
>>
>> [[[
>> Major performance enhancement in deflate_fast
>> plus a few minor ones in other places (see below).
>> All of them take advantage of various platform-
>> specific capabilities.
>>
>> * deflate.c: make sure the UNALIGNED_OK
>>   optimization is only used when allowed, i.e.
>>   enforce the condition mentioned in line 1125.
>>
>> * deflate.h: increase output buffer capacity such
>>   that we can add new values in a single step
>>   and flush on overflow afterwards (see trees.c).
>>   Also move push_short from trees.c to here
>>   because push_byte is there, too.
>>
>> * inffast.c: major rework:
>>   - hold / pre-fetch up to 8 bytes of data
>>     to minimize the number of 'top up' operations
>>   - prefetch data using a single memory access,
>>     if allowed by CPU architecture
>>   - copy data in large chunks w/o the need to
>>     check for buffer ends
>>   - unroll the literal handling loop
>>   - general latency tuning on the critical path
>>
>> * inftree.c: explicitly eliminate redundant memory
>>   accesses as at least MS VC is not able to do it
>>   (fearing pointer aliases?).
>>
>> * trees.c: optimize send_bits based on the larger
>>   output buffer.
>>
>> * zconf.h: enable unaligned access for x86 and
>>   x64 using GCC or MS VC. Same for little
>>   endianess optimizations. Enable SSE2 code
>>   when defined by compiler settings (GCC,
>>   MS VC).
>>
>> patch by Stefan Fuhrmann (stefanfuhrmann< at > alice-dsl.de)
>> ]]]
>>     
>
>
>   
>>  #define put_byte(s, c) {s->pending_buf[s->pending++] = (c);}
>>
>> +/* Output a short LSB first on the stream.
>> + * IN assertion: there is enough room in pendingBuf.
>> + */
>> +#if defined(LITTLE_ENDIAN) && defined(UNALIGNED_OK)
>>     
>
> defined(LITTLE_ENDIAN), not sure how LITTLE_ENDIAN is defined but this
> typically break as both __LITTE_ENDIAN and __BIG_ENDIAN are defined if you include
> some system header such as stdlib.h. One must use
> #if __BYTE_ORDER == __LITTLE_ENDIAN to be sure
>   
Thanks for the hint, I will change that in zconf.h
(this is where LITTLE_ENDIAN gets defined).
> Any reason you can't use this a BE CPU? Isn't it enough that UNALIGED_OK is defined?
> PowerPC can do unaligned too and is BE.
>
>   
The output stream is LE. So, the LSB must be
stored first.
>> +#  define put_short(s, w) { \
>> +    *(ush*)(s->pending_buf + s->pending) = (ush)(w);\
>> +    s->pending += 2; \
>> +}
>> +#else
>> +#  define put_short(s, w) { \
>> +    put_byte(s, (uch)((w) & 0xff)); \
>> +    put_byte(s, (uch)((ush)(w) >> 8)); \
>> +}
>> +#endif
>>
>>     
>
> I think you should wrap these new macros with a do { ... } while(0)
> That way it won't break if you do something like
> if (something)
>   MACRO;
>   
I will change that. No problem.
>   
>> +/* A reusable code-snippet.  It copies 'len' bytes from 'from'
>> + * to 'out'.  'len' must be 3 or larger.  This code will be used
>> + * when no optimization will is available.
>> + */
>> +#define STANDARD_MIN3_COPY\
>> +    while (len > 2) {\
>> +        PUP(out) = PUP(from);\
>> +        PUP(out) = PUP(from);\
>> +        PUP(out) = PUP(from);\
>> +        len -= 3;\
>> +    }\
>> +    if (len) { \
>> +        PUP(out) = PUP(from);\
>> +        if (len > 1)\
>> +            PUP(out) = PUP(from);\
>> +    }
>> +
>> +/* A reusable code-snippet.  It copies data from 'from'to 'out'.
>> + * up to 'last' with the last chunk possibly exceeding 'last'
>> + * by up to 15 bytes.
>> + */
>> +#ifdef USE_SSE2
>> +#  include <emmintrin.h>
>> +#  define TRY_CHUNKY_COPY\
>> +    if ((dist >= sizeof (__m128i)) || (last <= out)) { \
>> +        do {\
>> +            _mm_storeu_si128 ((__m128i*)(out+OFF), \
>> +                              _mm_loadu_si128((const __m128i*)(from+OFF)));\
>> +            out += sizeof (__m128i);\
>> +            from += sizeof (__m128i);\
>> +        } while (out < last); \
>> +    }
>> +#else
>> +#  define TRY_CHUNKY_COPY\
>>     
>
> Have tested this is faster too? I recall testing this on PowerPC and it wasn't
> faster. Hence I did my optimization with shorts instead.
>   
Yes it is. However, using 4/8-byte copies instead of
2-byte ones has the following drawbacks:

* 2 byte copies be aligned in 75% of all cases,
  4 byte copies are aligned in 25% of all cases
  and can only be brought up to 62.5%
  -> on systems with large unalignment penalty,
  this makes 4-byte copies less than 50% faster
  than 2-byte ones

* longer "tail" code to match 'len' exactly (unnecessary
  but most people do it anyways). Since many copies
  are < 10 bytes, another hard-to-predict jump can make
  all the difference.

QUICK_COPY accounts for about 15% of the
inflate_fast runtime in my scenario (compression
factor 2.2), the actual copy about 10%. YMMV,
especially on different architectures.
>   
>> +    if (dist >= sizeof(long) || (last <= out)) { \
>> +        do {\
>> +            *(long*)(out+OFF) = *(long*)(from+OFF);\
>> +            out += sizeof (long);\
>> +            from += sizeof (long);\
>> +        } while (out < last); \
>> +    }
>> +#endif
>>     
>
>   
-- Stefan^2.