[Zlib-devel] crc32 big/little endian
John Bowler
jbowler at frontiernet.net
Wed Apr 21 19:57:18 EDT 2010
From: Joakim Tjernlund
>gcc has always had a hard time optimizing crc32. I recently discovered that
>-O1 was noticeable faster than -O2 with gcc 4.3.4 in some crc32 tests I was
>doing a while back.
Wow, you are correct. Silly me - I just blindly assumed that -O1 would be slightly worse than -O2 (and this is *true* on ARM gcc 3.4.4 where -O1 performs worse of all but still much better than -O0). Here's an updated BYFOUR table:
buffer -O3 -Os -O2 -O1 -O0
64 18644 19035 18650 17279 40816
128 17060 17250 17080 17755 36057
256 16280 16366 16276 15802 34619
512 15874 15926 15890 14901 33596
1024 15902 15928 15903 14650 33742
2048 15722 15710 15699 14311 32548
4096 15586 15602 15586 14129 33543
8192 15624 15590 15587 14080 34835
16384 18162 18146 18149 17126 37775
That's a 10% speed improvement over the next best by using -O1, horrible.
>One must help gcc by laying out the C code so it matches
>what you want.
I have enormous problems with a compiler specific approach, particularly what you are doing which is to perform CSE for the compiler - something a decent compiler can do itself (at least it can do the optimizations you did.) I admit I automatically do CSE while writing code, but you've rearranged a single variable into separate variables in a way that is highly processor specific. The original code also has considerable loop unrolling, and that's typically bad on an ARM with branch prediction.
I applied your patch to crc32.c then re-ran my 'BYFOUR' test on both x86 (Prescott) and ARM. In both cases I found the *new* code to be slower. Here's a relative table for x86 (>100% is slower - percentage is simply new/old):
buffer -O3 -Os -O2 -O1 -O0
64 104% 111% 104% 111% 104%
128 114% 113% 114% 109% 113%
256 111% 110% 111% 114% 104%
512 109% 108% 109% 116% 100%
1024 106% 106% 106% 115% 97%
2048 106% 106% 106% 117% 102%
4096 107% 107% 106% 118% 102%
8192 107% 107% 107% 118% 114%
16384 106% 111% 106% 114% 105%
Likewise on ARM:
buffer -O3 -Os -O2 -O1 -O0
64 103% 101% 103% 101% 105%
128 106% 104% 107% 103% 109%
256 108% 106% 110% 104% 111%
512 109% 108% 111% 104% 113%
1024 109% 108% 112% 105% 113%
2048 110% 109% 113% 105% 114%
4096 110% 109% 113% 105% 114%
8192 110% 109% 113% 105% 114%
16384 110% 109% 113% 105% 114%
So I'm not seeing an improvement anywhere - which is surprising since you tested it (presumably on gcc 4.3.4) and I could reproduce the -O1 problem.
Of course x86 has notorious variability in the relative performance of indexing instructions, and I was compiling here without a specific -march.
John Bowler <jbowler at acm.org>
More information about the Zlib-devel
mailing list