[Zlib-devel] crc32 big/little endian
Joakim Tjernlund
joakim.tjernlund at transmode.se
Thu Apr 22 02:42:19 EDT 2010
"John Bowler" <jbowler at frontiernet.net> wrote on 2010/04/22 01:57:18:
>
> From: Joakim Tjernlund
> >gcc has always had a hard time optimizing crc32. I recently discovered that
> >-O1 was noticeable faster than -O2 with gcc 4.3.4 in some crc32 tests I was
> >doing a while back.
>
> Wow, you are correct. Silly me - I just blindly assumed that -O1 would be
> slightly worse than -O2 (and this is *true* on ARM gcc 3.4.4 where -O1
> performs worse of all but still much better than -O0). Here's an updated BYFOUR table:
>
> buffer -O3 -Os -O2 -O1 -O0
> 64 18644 19035 18650 17279 40816
> 128 17060 17250 17080 17755 36057
> 256 16280 16366 16276 15802 34619
> 512 15874 15926 15890 14901 33596
> 1024 15902 15928 15903 14650 33742
> 2048 15722 15710 15699 14311 32548
> 4096 15586 15602 15586 14129 33543
> 8192 15624 15590 15587 14080 34835
> 16384 18162 18146 18149 17126 37775
>
> That's a 10% speed improvement over the next best by using -O1, horrible.
I think this is a gcc 4.3.4 issue only, I been told later gcc's do better.
>
> >One must help gcc by laying out the C code so it matches
> >what you want.
>
> I have enormous problems with a compiler specific approach, particularly what
> you are doing which is to perform CSE for the compiler - something a decent
> compiler can do itself (at least it can do the optimizations you did.) I
> admit I automatically do CSE while writing code, but you've rearranged a
> single variable into separate variables in a way that is highly processor
> specific. The original code also has considerable loop unrolling, and that's
> typically bad on an ARM with branch prediction.
I don't think these are particularly processor specific (except for the unrolling)
Too many times I have seen gcc failing to optimize a
while (len >= 4) {
xxxx;
len -= 4;
}
into a proper loop where you test against !=0 instead:
for (len=len>>2; len; --len)
xxxx;
This kind of optimization is pretty much generic for any CPU I think.
>
> I applied your patch to crc32.c then re-ran my 'BYFOUR' test on both x86
> (Prescott) and ARM. In both cases I found the *new* code to be slower.
> Here's a relative table for x86 (>100% is slower - percentage is simply new/old):
I would think this comes from removing the unrolling, I didn't think that would matter
much, unrolling 32 bytes feels like overkill though. 8 or 16 should work too.
>
> buffer -O3 -Os -O2 -O1 -O0
> 64 104% 111% 104% 111% 104%
> 128 114% 113% 114% 109% 113%
> 256 111% 110% 111% 114% 104%
> 512 109% 108% 109% 116% 100%
> 1024 106% 106% 106% 115% 97%
> 2048 106% 106% 106% 117% 102%
> 4096 107% 107% 106% 118% 102%
> 8192 107% 107% 107% 118% 114%
> 16384 106% 111% 106% 114% 105%
>
> Likewise on ARM:
>
> buffer -O3 -Os -O2 -O1 -O0
> 64 103% 101% 103% 101% 105%
> 128 106% 104% 107% 103% 109%
> 256 108% 106% 110% 104% 111%
> 512 109% 108% 111% 104% 113%
> 1024 109% 108% 112% 105% 113%
> 2048 110% 109% 113% 105% 114%
> 4096 110% 109% 113% 105% 114%
> 8192 110% 109% 113% 105% 114%
> 16384 110% 109% 113% 105% 114%
>
> So I'm not seeing an improvement anywhere - which is surprising since you
> tested it (presumably on gcc 4.3.4) and I could reproduce the -O1 problem.
Oh, I didn't test this particular impl. I tested some other crc impl. I worked
on a while back. This was just first shoot.
Jocke
More information about the Zlib-devel
mailing list