[Zlib-devel] patch-in-progress: vectorized adler32 calculation

Mon Apr 12 05:45:23 EDT 2010

On Mon, 12 Apr 2010, Török Edwin wrote:

> On 04/12/2010 11:17 AM, Stefan Fuhrmann wrote:
>> Mark Adler wrote:
>>> On Apr 11, 2010, at 12:56 PM, Stefan Fuhrmann wrote:
>>>> So, I looked into it. ~15% of the zlib runtime is spent in adler32
>>>> and the C implementation is as fast as it gets (close to 1 byte
>>>> per cycle). The attached masm32 code provides a vectorized
>>>> version of the hotspot of that function.
>>> ...
>>>> ; * adler32_fast_ssse3 ... fastest code, requires SSSE3 CPU feature
>>>> ; * adler32_fast_sse2 ... almost as fast, requires SSE2 CPU feature
>>>> ; * adler32_fast_mmx ... slowest code, for old CPUs
>>> 
>>> Stefan,
>>> 
>>> So how much faster are those than the C code for adler32?
>> Throughput:
>> SSSE3: ~3.5 bytes / clock tick
>> SSE2: ~3 bytes / clock tick
>> MMX: ~1.5 bytes / clock tick
>> C-Code: ~1 byte / clock tick
>> 
>> IOW, for x86 processors released since 2003 (SSE2), the checksum portion
>> is reduced from ~15% to ~5% or better of the inflate runtime.
>
> Have you considered writing the SSE code using compiler intrinsics in C?
> They are supported on all the major compilers: GCC, ICC, and MSVC, and they 
> appear to work quite well on GCC:
> http://www.liranuna.com/sse-intrinsics-optimizations-in-popular-compilers/
>
> The advantage would be that:
> -  the compiler could inline the SSE-optimized adler32 (assuming you put it 
> into a .h, or .inc file so the compiler sees the implementation), and maybe 
> even do some constant propagation
> - you would get SSE optimized x86-64 code too (your code is 32-bit only from 
> what I can tell). ALL x86-64 CPUs have at least SSE2.
> - you could choose which variant to use based on preprocessor defines, so 
> that if zlib is compiled with -march=foo, it'll only build the SSE variant 
> that works on foo
>
> The disadvantage is that MSVC generates horrible code with SSE intrinsics, at 
> least according to that blogpost, so that'd still need a .asm file, but the 
> .asm could be generated from C by using mingw for example.

imho, the best thing is writing asm code in external files for some 
operations and not let the compiler try to optimize it (for high speed).

and again, for cross-platform purposes, using nasm or yasm.

Vincent Torri