[Zlib-devel] patch-in-progress: vectorized adler32 calculation

Mon Apr 12 04:17:57 EDT 2010

Mark Adler wrote:
> On Apr 11, 2010, at 12:56 PM, Stefan Fuhrmann wrote:
>   
>> So, I looked into it. ~15% of the zlib runtime is spent in adler32
>> and the C implementation is as fast as it gets (close to 1 byte
>> per cycle). The attached masm32 code provides a vectorized
>> version of the hotspot of that function.
>>     
> ...
>   
>> ; *    adler32_fast_ssse3 ... fastest code, requires SSSE3 CPU feature
>> ; *    adler32_fast_sse2  ... almost as fast, requires SSE2 CPU feature
>> ; *    adler32_fast_mmx   ... slowest code, for old CPUs
>>     
>
> Stefan,
>
> So how much faster are those than the C code for adler32?
>   
Throughput:
SSSE3: ~3.5 bytes / clock tick
SSE2:  ~3 bytes / clock tick
MMX:  ~1.5 bytes / clock tick
C-Code: ~1 byte / clock tick

IOW, for x86 processors released since 2003 (SSE2), the checksum portion
is reduced from ~15% to ~5% or better of the inflate runtime.

-- Stefan^2.