[Zlib-devel] [4/6][RFC V2 Patch] Add an x86 version of Adler32

Vincent Torri vtorri at univ-evry.fr
Thu Mar 31 01:26:41 EDT 2011



On Wed, 30 Mar 2011, Jan Seiffert wrote:

> And finally an x86 version of Adler32.
>
> it covers:
> * Plain
> * MMX
> * SSE
> * SSE2
> * SSSE3
> * 32 & 64 Bit
> * PIC and non PIC
>
> And features a runtime cpu detection + dispatch.
>
> It heavily uses inline ASM, so it's restricted to GCC (or compatible
> compiler, like clang).

why not using yasm or nasm ?

Vincent Torri

> This gives us help from the compiler for boilerplate code, all the
> fiddling with calling conventions, PIC and bit-ness of the code.
>
> Here are some numbers:
> And old AMD Athlon64 X2 which SSE unit is only 64 bit wide
>        -------- orig ------
>               a: 0x0CB4B676, 10000 * 160000 bytes     t: 12100 ms
>               a: 0x25BEB273, 10000 * 159999 bytes     t: 12100 ms
>               a: 0x733CB174, 10000 * 159998 bytes     t: 12400 ms
>               a: 0x1144AF76, 10000 * 159996 bytes     t: 12700 ms
>               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 12600 ms
>               a: 0x1902A382, 10000 * 159984 bytes     t: 12600 ms
>        -------- MMX ------
>               a: 0x0CB4B676, 10000 * 160000 bytes     t: 6700 ms
>               a: 0x25BEB273, 10000 * 159999 bytes     t: 6800 ms
>               a: 0x733CB174, 10000 * 159998 bytes     t: 6900 ms
>               a: 0x1144AF76, 10000 * 159996 bytes     t: 6800 ms
>               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 6900 ms
>               a: 0x1902A382, 10000 * 159984 bytes     t: 6900 ms
>        speedup: 1.805970
>        -------- SSE ------
>               a: 0x0CB4B676, 10000 * 160000 bytes     t: 6800 ms
>               a: 0x25BEB273, 10000 * 159999 bytes     t: 6800 ms
>               a: 0x733CB174, 10000 * 159998 bytes     t: 6800 ms
>               a: 0x1144AF76, 10000 * 159996 bytes     t: 6900 ms
>               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 6800 ms
>               a: 0x1902A382, 10000 * 159984 bytes     t: 6900 ms
>        speedup: 1.779412
>        -------- SSE2 ------
>               a: 0x0CB4B676, 10000 * 160000 bytes     t: 5600 ms
>               a: 0x25BEB273, 10000 * 159999 bytes     t: 5700 ms
>               a: 0x733CB174, 10000 * 159998 bytes     t: 5600 ms
>               a: 0x1144AF76, 10000 * 159996 bytes     t: 5700 ms
>               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 5600 ms
>               a: 0x1902A382, 10000 * 159984 bytes     t: 5700 ms
>        speedup: 2.160714
>
> An Intel Core2
>        -------- orig ------
>               a: 0x0CB4B676, 10000 * 160000 bytes     t: 15200 ms
>               a: 0x25BEB273, 10000 * 159999 bytes     t: 14900 ms
>               a: 0x733CB174, 10000 * 159998 bytes     t: 14900 ms
>               a: 0x1144AF76, 10000 * 159996 bytes     t: 15100 ms
>               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 14800 ms
>               a: 0x1902A382, 10000 * 159984 bytes     t: 14900 ms
>        -------- MMX ------
>               a: 0x0CB4B676, 10000 * 160000 bytes     t: 5500 ms
>               a: 0x25BEB273, 10000 * 159999 bytes     t: 5400 ms
>               a: 0x733CB174, 10000 * 159998 bytes     t: 5500 ms
>               a: 0x1144AF76, 10000 * 159996 bytes     t: 5400 ms
>               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 5400 ms
>               a: 0x1902A382, 10000 * 159984 bytes     t: 5500 ms
>        speedup: 2.763636
>        -------- SSE ------
>               a: 0x0CB4B676, 10000 * 160000 bytes     t: 5400 ms
>               a: 0x25BEB273, 10000 * 159999 bytes     t: 5300 ms
>               a: 0x733CB174, 10000 * 159998 bytes     t: 5400 ms
>               a: 0x1144AF76, 10000 * 159996 bytes     t: 5400 ms
>               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 5400 ms
>               a: 0x1902A382, 10000 * 159984 bytes     t: 5300 ms
>        speedup: 2.814815
>        -------- SSE2 ------
>               a: 0x0CB4B676, 10000 * 160000 bytes     t: 3400 ms
>               a: 0x25BEB273, 10000 * 159999 bytes     t: 3300 ms
>               a: 0x733CB174, 10000 * 159998 bytes     t: 3400 ms
>               a: 0x1144AF76, 10000 * 159996 bytes     t: 3300 ms
>               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 3400 ms
>               a: 0x1902A382, 10000 * 159984 bytes     t: 3300 ms
>        speedup: 4.470588
>        -------- SSSE3 ------
>               a: 0x0CB4B676, 10000 * 160000 bytes     t: 2800 ms
>               a: 0x25BEB273, 10000 * 159999 bytes     t: 2900 ms
>               a: 0x733CB174, 10000 * 159998 bytes     t: 2800 ms
>               a: 0x1144AF76, 10000 * 159996 bytes     t: 2800 ms
>               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 2800 ms
>               a: 0x1902A382, 10000 * 159984 bytes     t: 2900 ms
>        speedup: 5.428571
>
> An AMD Semperon 140 (K10 Architecture)
>        -------- orig ------
>               a: 0x0CB4B676, 10000 * 160000 bytes     t: 7500 ms
>               a: 0x25BEB273, 10000 * 159999 bytes     t: 7500 ms
>               a: 0x733CB174, 10000 * 159998 bytes     t: 7400 ms
>               a: 0x1144AF76, 10000 * 159996 bytes     t: 7400 ms
>               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 7800 ms
>               a: 0x1902A382, 10000 * 159984 bytes     t: 7800 ms
>        -------- MMX ------
>               a: 0x0CB4B676, 10000 * 160000 bytes     t: 4600 ms
>               a: 0x25BEB273, 10000 * 159999 bytes     t: 4600 ms
>               a: 0x733CB174, 10000 * 159998 bytes     t: 4600 ms
>               a: 0x1144AF76, 10000 * 159996 bytes     t: 4600 ms
>               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 4600 ms
>               a: 0x1902A382, 10000 * 159984 bytes     t: 4600 ms
>        speedup: 1.630435
>        -------- SSE ------
>               a: 0x0CB4B676, 10000 * 160000 bytes     t: 4100 ms
>               a: 0x25BEB273, 10000 * 159999 bytes     t: 4100 ms
>               a: 0x733CB174, 10000 * 159998 bytes     t: 4100 ms
>               a: 0x1144AF76, 10000 * 159996 bytes     t: 4200 ms
>               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 4100 ms
>               a: 0x1902A382, 10000 * 159984 bytes     t: 4100 ms
>        speedup: 1.829268
>        -------- SSE2 ------
>               a: 0x0CB4B676, 10000 * 160000 bytes     t: 1800 ms
>               a: 0x25BEB273, 10000 * 159999 bytes     t: 1800 ms
>               a: 0x733CB174, 10000 * 159998 bytes     t: 1800 ms
>               a: 0x1144AF76, 10000 * 159996 bytes     t: 1700 ms
>               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 1800 ms
>               a: 0x1902A382, 10000 * 159984 bytes     t: 1800 ms
>        speedup: 4.166667
>
> An Intel P4 based Xeon (Nocona, in 64 bit mode)
>        -------- orig ------
>               a: 0x0CB4B676, 10000 * 160000 bytes     t: 21800 ms
>               a: 0x25BEB273, 10000 * 159999 bytes     t: 20900 ms
>               a: 0x733CB174, 10000 * 159998 bytes     t: 21000 ms
>               a: 0x1144AF76, 10000 * 159996 bytes     t: 21000 ms
>               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 21900 ms
>               a: 0x1902A382, 10000 * 159984 bytes     t: 20900 ms
>        -------- SSE2 ------
>               a: 0x0CB4B676, 10000 * 160000 bytes     t: 5900 ms
>               a: 0x25BEB273, 10000 * 159999 bytes     t: 5300 ms
>               a: 0x733CB174, 10000 * 159998 bytes     t: 4900 ms
>               a: 0x1144AF76, 10000 * 159996 bytes     t: 5300 ms
>               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 5600 ms
>               a: 0x1902A382, 10000 * 159984 bytes     t: 5400 ms
>        speedup: 3.694915
>




More information about the Zlib-devel mailing list