[Zlib-devel] [4/6][RFC V2 Patch] Add an x86 version of Adler32
Vincent Torri
vtorri at univ-evry.fr
Thu Mar 31 01:26:41 EDT 2011
On Wed, 30 Mar 2011, Jan Seiffert wrote:
> And finally an x86 version of Adler32.
>
> it covers:
> * Plain
> * MMX
> * SSE
> * SSE2
> * SSSE3
> * 32 & 64 Bit
> * PIC and non PIC
>
> And features a runtime cpu detection + dispatch.
>
> It heavily uses inline ASM, so it's restricted to GCC (or compatible
> compiler, like clang).
why not using yasm or nasm ?
Vincent Torri
> This gives us help from the compiler for boilerplate code, all the
> fiddling with calling conventions, PIC and bit-ness of the code.
>
> Here are some numbers:
> And old AMD Athlon64 X2 which SSE unit is only 64 bit wide
> -------- orig ------
> a: 0x0CB4B676, 10000 * 160000 bytes t: 12100 ms
> a: 0x25BEB273, 10000 * 159999 bytes t: 12100 ms
> a: 0x733CB174, 10000 * 159998 bytes t: 12400 ms
> a: 0x1144AF76, 10000 * 159996 bytes t: 12700 ms
> a: 0x3F4ECB8A, 10000 * 159992 bytes t: 12600 ms
> a: 0x1902A382, 10000 * 159984 bytes t: 12600 ms
> -------- MMX ------
> a: 0x0CB4B676, 10000 * 160000 bytes t: 6700 ms
> a: 0x25BEB273, 10000 * 159999 bytes t: 6800 ms
> a: 0x733CB174, 10000 * 159998 bytes t: 6900 ms
> a: 0x1144AF76, 10000 * 159996 bytes t: 6800 ms
> a: 0x3F4ECB8A, 10000 * 159992 bytes t: 6900 ms
> a: 0x1902A382, 10000 * 159984 bytes t: 6900 ms
> speedup: 1.805970
> -------- SSE ------
> a: 0x0CB4B676, 10000 * 160000 bytes t: 6800 ms
> a: 0x25BEB273, 10000 * 159999 bytes t: 6800 ms
> a: 0x733CB174, 10000 * 159998 bytes t: 6800 ms
> a: 0x1144AF76, 10000 * 159996 bytes t: 6900 ms
> a: 0x3F4ECB8A, 10000 * 159992 bytes t: 6800 ms
> a: 0x1902A382, 10000 * 159984 bytes t: 6900 ms
> speedup: 1.779412
> -------- SSE2 ------
> a: 0x0CB4B676, 10000 * 160000 bytes t: 5600 ms
> a: 0x25BEB273, 10000 * 159999 bytes t: 5700 ms
> a: 0x733CB174, 10000 * 159998 bytes t: 5600 ms
> a: 0x1144AF76, 10000 * 159996 bytes t: 5700 ms
> a: 0x3F4ECB8A, 10000 * 159992 bytes t: 5600 ms
> a: 0x1902A382, 10000 * 159984 bytes t: 5700 ms
> speedup: 2.160714
>
> An Intel Core2
> -------- orig ------
> a: 0x0CB4B676, 10000 * 160000 bytes t: 15200 ms
> a: 0x25BEB273, 10000 * 159999 bytes t: 14900 ms
> a: 0x733CB174, 10000 * 159998 bytes t: 14900 ms
> a: 0x1144AF76, 10000 * 159996 bytes t: 15100 ms
> a: 0x3F4ECB8A, 10000 * 159992 bytes t: 14800 ms
> a: 0x1902A382, 10000 * 159984 bytes t: 14900 ms
> -------- MMX ------
> a: 0x0CB4B676, 10000 * 160000 bytes t: 5500 ms
> a: 0x25BEB273, 10000 * 159999 bytes t: 5400 ms
> a: 0x733CB174, 10000 * 159998 bytes t: 5500 ms
> a: 0x1144AF76, 10000 * 159996 bytes t: 5400 ms
> a: 0x3F4ECB8A, 10000 * 159992 bytes t: 5400 ms
> a: 0x1902A382, 10000 * 159984 bytes t: 5500 ms
> speedup: 2.763636
> -------- SSE ------
> a: 0x0CB4B676, 10000 * 160000 bytes t: 5400 ms
> a: 0x25BEB273, 10000 * 159999 bytes t: 5300 ms
> a: 0x733CB174, 10000 * 159998 bytes t: 5400 ms
> a: 0x1144AF76, 10000 * 159996 bytes t: 5400 ms
> a: 0x3F4ECB8A, 10000 * 159992 bytes t: 5400 ms
> a: 0x1902A382, 10000 * 159984 bytes t: 5300 ms
> speedup: 2.814815
> -------- SSE2 ------
> a: 0x0CB4B676, 10000 * 160000 bytes t: 3400 ms
> a: 0x25BEB273, 10000 * 159999 bytes t: 3300 ms
> a: 0x733CB174, 10000 * 159998 bytes t: 3400 ms
> a: 0x1144AF76, 10000 * 159996 bytes t: 3300 ms
> a: 0x3F4ECB8A, 10000 * 159992 bytes t: 3400 ms
> a: 0x1902A382, 10000 * 159984 bytes t: 3300 ms
> speedup: 4.470588
> -------- SSSE3 ------
> a: 0x0CB4B676, 10000 * 160000 bytes t: 2800 ms
> a: 0x25BEB273, 10000 * 159999 bytes t: 2900 ms
> a: 0x733CB174, 10000 * 159998 bytes t: 2800 ms
> a: 0x1144AF76, 10000 * 159996 bytes t: 2800 ms
> a: 0x3F4ECB8A, 10000 * 159992 bytes t: 2800 ms
> a: 0x1902A382, 10000 * 159984 bytes t: 2900 ms
> speedup: 5.428571
>
> An AMD Semperon 140 (K10 Architecture)
> -------- orig ------
> a: 0x0CB4B676, 10000 * 160000 bytes t: 7500 ms
> a: 0x25BEB273, 10000 * 159999 bytes t: 7500 ms
> a: 0x733CB174, 10000 * 159998 bytes t: 7400 ms
> a: 0x1144AF76, 10000 * 159996 bytes t: 7400 ms
> a: 0x3F4ECB8A, 10000 * 159992 bytes t: 7800 ms
> a: 0x1902A382, 10000 * 159984 bytes t: 7800 ms
> -------- MMX ------
> a: 0x0CB4B676, 10000 * 160000 bytes t: 4600 ms
> a: 0x25BEB273, 10000 * 159999 bytes t: 4600 ms
> a: 0x733CB174, 10000 * 159998 bytes t: 4600 ms
> a: 0x1144AF76, 10000 * 159996 bytes t: 4600 ms
> a: 0x3F4ECB8A, 10000 * 159992 bytes t: 4600 ms
> a: 0x1902A382, 10000 * 159984 bytes t: 4600 ms
> speedup: 1.630435
> -------- SSE ------
> a: 0x0CB4B676, 10000 * 160000 bytes t: 4100 ms
> a: 0x25BEB273, 10000 * 159999 bytes t: 4100 ms
> a: 0x733CB174, 10000 * 159998 bytes t: 4100 ms
> a: 0x1144AF76, 10000 * 159996 bytes t: 4200 ms
> a: 0x3F4ECB8A, 10000 * 159992 bytes t: 4100 ms
> a: 0x1902A382, 10000 * 159984 bytes t: 4100 ms
> speedup: 1.829268
> -------- SSE2 ------
> a: 0x0CB4B676, 10000 * 160000 bytes t: 1800 ms
> a: 0x25BEB273, 10000 * 159999 bytes t: 1800 ms
> a: 0x733CB174, 10000 * 159998 bytes t: 1800 ms
> a: 0x1144AF76, 10000 * 159996 bytes t: 1700 ms
> a: 0x3F4ECB8A, 10000 * 159992 bytes t: 1800 ms
> a: 0x1902A382, 10000 * 159984 bytes t: 1800 ms
> speedup: 4.166667
>
> An Intel P4 based Xeon (Nocona, in 64 bit mode)
> -------- orig ------
> a: 0x0CB4B676, 10000 * 160000 bytes t: 21800 ms
> a: 0x25BEB273, 10000 * 159999 bytes t: 20900 ms
> a: 0x733CB174, 10000 * 159998 bytes t: 21000 ms
> a: 0x1144AF76, 10000 * 159996 bytes t: 21000 ms
> a: 0x3F4ECB8A, 10000 * 159992 bytes t: 21900 ms
> a: 0x1902A382, 10000 * 159984 bytes t: 20900 ms
> -------- SSE2 ------
> a: 0x0CB4B676, 10000 * 160000 bytes t: 5900 ms
> a: 0x25BEB273, 10000 * 159999 bytes t: 5300 ms
> a: 0x733CB174, 10000 * 159998 bytes t: 4900 ms
> a: 0x1144AF76, 10000 * 159996 bytes t: 5300 ms
> a: 0x3F4ECB8A, 10000 * 159992 bytes t: 5600 ms
> a: 0x1902A382, 10000 * 159984 bytes t: 5400 ms
> speedup: 3.694915
>
More information about the Zlib-devel
mailing list