[Zlib-devel] crc32 big/little endian

Wed Apr 21 19:57:18 EDT 2010

From: Joakim Tjernlund 
>gcc has always had a hard time optimizing crc32. I recently discovered that
>-O1 was noticeable faster than -O2 with gcc 4.3.4 in some crc32 tests I was
>doing a while back.

Wow, you are correct.  Silly me - I just blindly assumed that -O1 would be slightly worse than -O2 (and this is *true* on ARM gcc 3.4.4 where -O1 performs worse of all but still much better than -O0).  Here's an updated BYFOUR table:

buffer	-O3	-Os	-O2	-O1	-O0
64	18644	19035	18650	17279	40816
128	17060	17250	17080	17755	36057
256	16280	16366	16276	15802	34619
512	15874	15926	15890	14901	33596
1024	15902	15928	15903	14650	33742
2048	15722	15710	15699	14311	32548
4096	15586	15602	15586	14129	33543
8192	15624	15590	15587	14080	34835
16384	18162	18146	18149	17126	37775

That's a 10% speed improvement over the next best by using -O1, horrible.

>One must help gcc by laying out the C code so it matches
>what you want. 

I have enormous problems with a compiler specific approach, particularly what you are doing which is to perform CSE for the compiler - something a decent compiler can do itself (at least it can do the optimizations you did.)  I admit I automatically do CSE while writing code, but you've rearranged a single variable into separate variables in a way that is highly processor specific.  The original code also has considerable loop unrolling, and that's typically bad on an ARM with branch prediction.

I applied your patch to crc32.c then re-ran my 'BYFOUR' test on both x86 (Prescott) and ARM.  In both cases I found the *new* code to be slower.  Here's a relative table for x86 (>100% is slower - percentage is simply new/old):

buffer	-O3	-Os	-O2	-O1	-O0
64	104%	111%	104%	111%	104%
128	114%	113%	114%	109%	113%
256	111%	110%	111%	114%	104%
512	109%	108%	109%	116%	100%
1024	106%	106%	106%	115%	97%
2048	106%	106%	106%	117%	102%
4096	107%	107%	106%	118%	102%
8192	107%	107%	107%	118%	114%
16384	106%	111%	106%	114%	105%

Likewise on ARM:

buffer	-O3	-Os	-O2	-O1	-O0
64	103%	101%	103%	101%	105%
128	106%	104%	107%	103%	109%
256	108%	106%	110%	104%	111%
512	109%	108%	111%	104%	113%
1024	109%	108%	112%	105%	113%
2048	110%	109%	113%	105%	114%
4096	110%	109%	113%	105%	114%
8192	110%	109%	113%	105%	114%
16384	110%	109%	113%	105%	114%

So I'm not seeing an improvement anywhere - which is surprising since you tested it (presumably on gcc 4.3.4) and I could reproduce the -O1 problem.

Of course x86 has notorious variability in the relative performance of indexing instructions, and I was compiling here without a specific -march.

John Bowler <jbowler at acm.org>