[Zlib-devel] [4/4][RFC Patch] Add an x86 version of Adler32
Jan Seiffert
kaffeemonster at googlemail.com
Mon Mar 14 21:20:23 EDT 2011
And finally an x86 version of Adler32.
it covers:
* Plain
* MMX
* SSE
* SSE2
* SSSE3
* 32 & 64 Bit
* PIC and non PIC
And features a runtime cpu detection + dispatch.
It heavily uses inline ASM, so it's restricted to GCC (or compatible
compiler, like clang).
This gives us help from the compiler for boilerplate code, all the
fiddling with calling conventions, PIC and bit-ness of the code.
Here are some numbers:
And old AMD Athlon64 X2 which SSE unit is only 64 bit wide
-------- orig ------
a: 0x0CB4B676, 10000 * 160000 bytes t: 12100 ms
a: 0x25BEB273, 10000 * 159999 bytes t: 12100 ms
a: 0x733CB174, 10000 * 159998 bytes t: 12400 ms
a: 0x1144AF76, 10000 * 159996 bytes t: 12700 ms
a: 0x3F4ECB8A, 10000 * 159992 bytes t: 12600 ms
a: 0x1902A382, 10000 * 159984 bytes t: 12600 ms
-------- MMX ------
a: 0x0CB4B676, 10000 * 160000 bytes t: 6700 ms
a: 0x25BEB273, 10000 * 159999 bytes t: 6800 ms
a: 0x733CB174, 10000 * 159998 bytes t: 6900 ms
a: 0x1144AF76, 10000 * 159996 bytes t: 6800 ms
a: 0x3F4ECB8A, 10000 * 159992 bytes t: 6900 ms
a: 0x1902A382, 10000 * 159984 bytes t: 6900 ms
speedup: 1.805970
-------- SSE ------
a: 0x0CB4B676, 10000 * 160000 bytes t: 6800 ms
a: 0x25BEB273, 10000 * 159999 bytes t: 6800 ms
a: 0x733CB174, 10000 * 159998 bytes t: 6800 ms
a: 0x1144AF76, 10000 * 159996 bytes t: 6900 ms
a: 0x3F4ECB8A, 10000 * 159992 bytes t: 6800 ms
a: 0x1902A382, 10000 * 159984 bytes t: 6900 ms
speedup: 1.779412
-------- SSE2 ------
a: 0x0CB4B676, 10000 * 160000 bytes t: 5600 ms
a: 0x25BEB273, 10000 * 159999 bytes t: 5700 ms
a: 0x733CB174, 10000 * 159998 bytes t: 5600 ms
a: 0x1144AF76, 10000 * 159996 bytes t: 5700 ms
a: 0x3F4ECB8A, 10000 * 159992 bytes t: 5600 ms
a: 0x1902A382, 10000 * 159984 bytes t: 5700 ms
speedup: 2.160714
An Intel Core2
-------- orig ------
a: 0x0CB4B676, 10000 * 160000 bytes t: 15200 ms
a: 0x25BEB273, 10000 * 159999 bytes t: 14900 ms
a: 0x733CB174, 10000 * 159998 bytes t: 14900 ms
a: 0x1144AF76, 10000 * 159996 bytes t: 15100 ms
a: 0x3F4ECB8A, 10000 * 159992 bytes t: 14800 ms
a: 0x1902A382, 10000 * 159984 bytes t: 14900 ms
-------- MMX ------
a: 0x0CB4B676, 10000 * 160000 bytes t: 5500 ms
a: 0x25BEB273, 10000 * 159999 bytes t: 5400 ms
a: 0x733CB174, 10000 * 159998 bytes t: 5500 ms
a: 0x1144AF76, 10000 * 159996 bytes t: 5400 ms
a: 0x3F4ECB8A, 10000 * 159992 bytes t: 5400 ms
a: 0x1902A382, 10000 * 159984 bytes t: 5500 ms
speedup: 2.763636
-------- SSE ------
a: 0x0CB4B676, 10000 * 160000 bytes t: 5400 ms
a: 0x25BEB273, 10000 * 159999 bytes t: 5300 ms
a: 0x733CB174, 10000 * 159998 bytes t: 5400 ms
a: 0x1144AF76, 10000 * 159996 bytes t: 5400 ms
a: 0x3F4ECB8A, 10000 * 159992 bytes t: 5400 ms
a: 0x1902A382, 10000 * 159984 bytes t: 5300 ms
speedup: 2.814815
-------- SSE2 ------
a: 0x0CB4B676, 10000 * 160000 bytes t: 3400 ms
a: 0x25BEB273, 10000 * 159999 bytes t: 3300 ms
a: 0x733CB174, 10000 * 159998 bytes t: 3400 ms
a: 0x1144AF76, 10000 * 159996 bytes t: 3300 ms
a: 0x3F4ECB8A, 10000 * 159992 bytes t: 3400 ms
a: 0x1902A382, 10000 * 159984 bytes t: 3300 ms
speedup: 4.470588
-------- SSSE3 ------
a: 0x0CB4B676, 10000 * 160000 bytes t: 2800 ms
a: 0x25BEB273, 10000 * 159999 bytes t: 2900 ms
a: 0x733CB174, 10000 * 159998 bytes t: 2800 ms
a: 0x1144AF76, 10000 * 159996 bytes t: 2800 ms
a: 0x3F4ECB8A, 10000 * 159992 bytes t: 2800 ms
a: 0x1902A382, 10000 * 159984 bytes t: 2900 ms
speedup: 5.428571
An AMD Semperon 140 (K10 Architecture)
-------- orig ------
a: 0x0CB4B676, 10000 * 160000 bytes t: 7500 ms
a: 0x25BEB273, 10000 * 159999 bytes t: 7500 ms
a: 0x733CB174, 10000 * 159998 bytes t: 7400 ms
a: 0x1144AF76, 10000 * 159996 bytes t: 7400 ms
a: 0x3F4ECB8A, 10000 * 159992 bytes t: 7800 ms
a: 0x1902A382, 10000 * 159984 bytes t: 7800 ms
-------- MMX ------
a: 0x0CB4B676, 10000 * 160000 bytes t: 4600 ms
a: 0x25BEB273, 10000 * 159999 bytes t: 4600 ms
a: 0x733CB174, 10000 * 159998 bytes t: 4600 ms
a: 0x1144AF76, 10000 * 159996 bytes t: 4600 ms
a: 0x3F4ECB8A, 10000 * 159992 bytes t: 4600 ms
a: 0x1902A382, 10000 * 159984 bytes t: 4600 ms
speedup: 1.630435
-------- SSE ------
a: 0x0CB4B676, 10000 * 160000 bytes t: 4100 ms
a: 0x25BEB273, 10000 * 159999 bytes t: 4100 ms
a: 0x733CB174, 10000 * 159998 bytes t: 4100 ms
a: 0x1144AF76, 10000 * 159996 bytes t: 4200 ms
a: 0x3F4ECB8A, 10000 * 159992 bytes t: 4100 ms
a: 0x1902A382, 10000 * 159984 bytes t: 4100 ms
speedup: 1.829268
-------- SSE2 ------
a: 0x0CB4B676, 10000 * 160000 bytes t: 1800 ms
a: 0x25BEB273, 10000 * 159999 bytes t: 1800 ms
a: 0x733CB174, 10000 * 159998 bytes t: 1800 ms
a: 0x1144AF76, 10000 * 159996 bytes t: 1700 ms
a: 0x3F4ECB8A, 10000 * 159992 bytes t: 1800 ms
a: 0x1902A382, 10000 * 159984 bytes t: 1800 ms
speedup: 4.166667
An Intel P4 based Xeon (Nocona, in 64 bit mode)
-------- orig ------
a: 0x0CB4B676, 10000 * 160000 bytes t: 21800 ms
a: 0x25BEB273, 10000 * 159999 bytes t: 20900 ms
a: 0x733CB174, 10000 * 159998 bytes t: 21000 ms
a: 0x1144AF76, 10000 * 159996 bytes t: 21000 ms
a: 0x3F4ECB8A, 10000 * 159992 bytes t: 21900 ms
a: 0x1902A382, 10000 * 159984 bytes t: 20900 ms
-------- SSE2 ------
a: 0x0CB4B676, 10000 * 160000 bytes t: 5900 ms
a: 0x25BEB273, 10000 * 159999 bytes t: 5300 ms
a: 0x733CB174, 10000 * 159998 bytes t: 4900 ms
a: 0x1144AF76, 10000 * 159996 bytes t: 5300 ms
a: 0x3F4ECB8A, 10000 * 159992 bytes t: 5600 ms
a: 0x1902A382, 10000 * 159984 bytes t: 5400 ms
speedup: 3.694915
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 04-x86.patch
Type: text/x-patch
Size: 37887 bytes
Desc: not available
URL: <http://madler.net/pipermail/zlib-devel_madler.net/attachments/20110315/68019160/attachment.bin>
More information about the Zlib-devel
mailing list