[Zlib-devel] patch-in-progress: vectorized adler32 calculation
Stefan Fuhrmann
stefanfuhrmann at alice-dsl.de
Mon Apr 12 04:17:57 EDT 2010
Mark Adler wrote:
> On Apr 11, 2010, at 12:56 PM, Stefan Fuhrmann wrote:
>
>> So, I looked into it. ~15% of the zlib runtime is spent in adler32
>> and the C implementation is as fast as it gets (close to 1 byte
>> per cycle). The attached masm32 code provides a vectorized
>> version of the hotspot of that function.
>>
> ...
>
>> ; * adler32_fast_ssse3 ... fastest code, requires SSSE3 CPU feature
>> ; * adler32_fast_sse2 ... almost as fast, requires SSE2 CPU feature
>> ; * adler32_fast_mmx ... slowest code, for old CPUs
>>
>
> Stefan,
>
> So how much faster are those than the C code for adler32?
>
Throughput:
SSSE3: ~3.5 bytes / clock tick
SSE2: ~3 bytes / clock tick
MMX: ~1.5 bytes / clock tick
C-Code: ~1 byte / clock tick
IOW, for x86 processors released since 2003 (SSE2), the checksum portion
is reduced from ~15% to ~5% or better of the inflate runtime.
-- Stefan^2.
More information about the Zlib-devel
mailing list