[Zlib-devel] crc32 big/little endian

Wed Apr 21 17:08:50 EDT 2010

From: Török Edwin
> Unless someone beats me to it I'll write a short benchmark code and
> report results.

Thanks for the program... I was surprised by the choice of 16384 byte buffer as input to crc32, so I modified the program (attached) to test buffer size, NOBYFOUR and performance on ARM.

The buffer size choice has a major impact on speed on x86 Prescott but optimization levels (so long as optimization is done) only have a small effect:

Buffer -O3	-Os	-O2	-O0
64	18644	19035	18650	40816
128	17060	17250	17080	36057
256	16280	16366	16276	34619
512	15874	15926	15890	33596
1024	15902	15928	15903	33742
2048	15722	15710	15699	32548
4096	15586	15602	15586	33543
8192	15624	15590	15587	34835
16384	18162	18146	18149	37775
text	13473	12481	12293	13746
data	296	296	296	296
bss	16420	16396	16420	16420
total	30189	29173	29709	30462
error	<1%	<1%	<1%	5-10%

That's compiled gcc 4.3.4.  The 'optimal' size of buffer is 4096 or 8192 bytes.  16384 bytes has an (approximately) 16% speed cost.  Decreasing to 512 bytes has little effect on speed.

On x86, repeating these experiments with -DNOBYFOUR the times go up by around a factor of 2.5 throughout.

On ARM, however, using gcc 3.4.4 (gcc 4 probably has substantially better ARM support) a different picture emerges.  The buffer size behavior no longer occurs (this is an XScale ARM system running on a LinkSys NSLU - SlugOS), but -Os how consistently gives best performance, and the penalty of using NOBYFOUR is almost gone - indeed there is a speed improvement over the BYFOUR code for the 64 byte buffer (50528 us vs 51879 us).  Here are the 'BYFOUR' figures (this is for 10 times less data than on x86):

buffer	-O3	-Os	-O2	-O0
64	53476	53147	51879	142842
128	47443	46498	45408	123773
256	44212	43190	41955	114270
512	42613	41508	40210	109485
1024	41816	40701	39346	107129
2048	41406	40261	38919	105972
4096	41214	40080	38719	105328
8192	41132	39955	38604	105046
16384	41079	39906	38567	104928
text	17651	17427	17471	21083
data	308	308	308	308
bss	16392	16392	16392	16392
total	34351	34127	34171	37783
error	0.2-0.5%	0.2-0.5%	0.2-0.6%	0.1-0.3%

And with -DNOBYFOUR, as a percentage of the above:

buffer	-O3	-Os	-O2	-O0
64	105%	95%	97%	111%
128	114%	105%	107%	125%
256	120%	112%	114%	134%
512	123%	115%	117%	139%
1024	125%	117%	119%	142%
2048	126%	118%	120%	143%
4096	127%	118%	121%	144%
8192	127%	119%	121%	144%
16384	127%	119%	121%	144%
text	8787	8603	8643	10403
data	308	308	308	308
bss	16392	16392	16392	16392
total	25487	25303	25343	27103
error	0.2-0.5	0.2-0.5%	0.2-0.3%	0.1-0.2%

The best speed on ARM is 38567us obtained with -O2 and BYFOUR and a 16Kbyte buffer size, the best NOBYFOUR speed is 21% slower with the same buffer and optimization settings.

Conclusion?  Well, there is no conclusion - the best approach depends on compiler, architecture and, perhaps most telling, the size of the buffers coming in to the crc32 function - something that may be difficult to control.

John Bowler <jbowler at acm.org>

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: crc32test.c
URL: <http://madler.net/pipermail/zlib-devel_madler.net/attachments/20100421/57a224cd/attachment.c>