[Zlib-devel] crc32 big/little endian

Thu Apr 22 02:42:19 EDT 2010

"John Bowler" <jbowler at frontiernet.net> wrote on 2010/04/22 01:57:18:
>
> From: Joakim Tjernlund
> >gcc has always had a hard time optimizing crc32. I recently discovered that
> >-O1 was noticeable faster than -O2 with gcc 4.3.4 in some crc32 tests I was
> >doing a while back.
>
> Wow, you are correct.  Silly me - I just blindly assumed that -O1 would be
> slightly worse than -O2 (and this is *true* on ARM gcc 3.4.4 where -O1
> performs worse of all but still much better than -O0).  Here's an updated BYFOUR table:
>
> buffer   -O3   -Os   -O2   -O1   -O0
> 64   18644   19035   18650   17279   40816
> 128   17060   17250   17080   17755   36057
> 256   16280   16366   16276   15802   34619
> 512   15874   15926   15890   14901   33596
> 1024   15902   15928   15903   14650   33742
> 2048   15722   15710   15699   14311   32548
> 4096   15586   15602   15586   14129   33543
> 8192   15624   15590   15587   14080   34835
> 16384   18162   18146   18149   17126   37775
>
> That's a 10% speed improvement over the next best by using -O1, horrible.

I think this is a gcc 4.3.4 issue only, I been told later gcc's do better.

>
> >One must help gcc by laying out the C code so it matches
> >what you want.
>
> I have enormous problems with a compiler specific approach, particularly what
> you are doing which is to perform CSE for the compiler - something a decent
> compiler can do itself (at least it can do the optimizations you did.)  I
> admit I automatically do CSE while writing code, but you've rearranged a
> single variable into separate variables in a way that is highly processor
> specific.  The original code also has considerable loop unrolling, and that's
> typically bad on an ARM with branch prediction.

I don't think these are particularly processor specific (except for the unrolling)
Too many times I have seen gcc failing to optimize a
    while (len >= 4) {
        xxxx;
        len -= 4;
    }
into a proper loop where you test against !=0 instead:
 for (len=len>>2; len; --len)
    xxxx;
This kind of optimization is pretty much generic for any CPU I think.

>
> I applied your patch to crc32.c then re-ran my 'BYFOUR' test on both x86
> (Prescott) and ARM.  In both cases I found the *new* code to be slower.
> Here's a relative table for x86 (>100% is slower - percentage is simply new/old):

I would think this comes from removing the unrolling, I didn't think that would matter
much, unrolling 32 bytes feels like overkill though. 8 or 16 should work too.

>
> buffer   -O3   -Os   -O2   -O1   -O0
> 64   104%   111%   104%   111%   104%
> 128   114%   113%   114%   109%   113%
> 256   111%   110%   111%   114%   104%
> 512   109%   108%   109%   116%   100%
> 1024   106%   106%   106%   115%   97%
> 2048   106%   106%   106%   117%   102%
> 4096   107%   107%   106%   118%   102%
> 8192   107%   107%   107%   118%   114%
> 16384   106%   111%   106%   114%   105%
>
> Likewise on ARM:
>
> buffer   -O3   -Os   -O2   -O1   -O0
> 64   103%   101%   103%   101%   105%
> 128   106%   104%   107%   103%   109%
> 256   108%   106%   110%   104%   111%
> 512   109%   108%   111%   104%   113%
> 1024   109%   108%   112%   105%   113%
> 2048   110%   109%   113%   105%   114%
> 4096   110%   109%   113%   105%   114%
> 8192   110%   109%   113%   105%   114%
> 16384   110%   109%   113%   105%   114%
>
> So I'm not seeing an improvement anywhere - which is surprising since you
> tested it (presumably on gcc 4.3.4) and I could reproduce the -O1 problem.

Oh, I didn't test this particular impl. I tested some other crc impl. I worked
on a while back. This was just first shoot.

 Jocke