[Zlib-devel] patch-in-progress: vectorized adler32 calculation

Mon Apr 12 05:54:28 EDT 2010

On 04/12/2010 12:45 PM, Vincent Torri wrote:
>
>
> On Mon, 12 Apr 2010, Török Edwin wrote:
>
>> On 04/12/2010 11:17 AM, Stefan Fuhrmann wrote:
>>> Mark Adler wrote:
>>>> On Apr 11, 2010, at 12:56 PM, Stefan Fuhrmann wrote:
>>>>> So, I looked into it. ~15% of the zlib runtime is spent in adler32
>>>>> and the C implementation is as fast as it gets (close to 1 byte
>>>>> per cycle). The attached masm32 code provides a vectorized
>>>>> version of the hotspot of that function.
>>>> ...
>>>>> ; * adler32_fast_ssse3 ... fastest code, requires SSSE3 CPU feature
>>>>> ; * adler32_fast_sse2 ... almost as fast, requires SSE2 CPU feature
>>>>> ; * adler32_fast_mmx ... slowest code, for old CPUs
>>>>
>>>> Stefan,
>>>>
>>>> So how much faster are those than the C code for adler32?
>>> Throughput:
>>> SSSE3: ~3.5 bytes / clock tick
>>> SSE2: ~3 bytes / clock tick
>>> MMX: ~1.5 bytes / clock tick
>>> C-Code: ~1 byte / clock tick
>>>
>>> IOW, for x86 processors released since 2003 (SSE2), the checksum portion
>>> is reduced from ~15% to ~5% or better of the inflate runtime.
>>
>> Have you considered writing the SSE code using compiler intrinsics in C?
>> They are supported on all the major compilers: GCC, ICC, and MSVC, and
>> they appear to work quite well on GCC:
>> http://www.liranuna.com/sse-intrinsics-optimizations-in-popular-compilers/
>>
>>
>> The advantage would be that:
>> - the compiler could inline the SSE-optimized adler32 (assuming you
>> put it into a .h, or .inc file so the compiler sees the
>> implementation), and maybe even do some constant propagation
>> - you would get SSE optimized x86-64 code too (your code is 32-bit
>> only from what I can tell). ALL x86-64 CPUs have at least SSE2.
>> - you could choose which variant to use based on preprocessor defines,
>> so that if zlib is compiled with -march=foo, it'll only build the SSE
>> variant that works on foo
>>
>> The disadvantage is that MSVC generates horrible code with SSE
>> intrinsics, at least according to that blogpost, so that'd still need
>> a .asm file, but the .asm could be generated from C by using mingw for
>> example.
>
> imho, the best thing is writing asm code in external files for some
> operations and not let the compiler try to optimize it (for high speed).

Well it won't be able to inline adler32.asm. On x86-32 you have overhead 
when calling functions (since parameters are passed on stack).

Is adler32() slow because it works on a lot of data, or because it is 
called often and works on small amount of data each time?

In the former case there probably isn't any advantage in inlining.

Another advantage of being C is that it'd be used automatically, get 
more testing, instead of living somewhere in contrib what most people 
don't use it, and I don't think distribution packagers use it either.

Best regards,
--Edwin