[Zlib-devel] [7/8][RFC V3 Patch] Blackfin implementation

Mon May 9 11:09:56 EDT 2011

here's some feedback i got internally:

A spent a little more time on it today, but the code looks pretty well 
optimized.  Actually this is an impressive piece of work.

As long as the input data is cached, this code is going to be run pretty 
quickly for longer vector lengths; the inner loop is pretty tight.

   68:  b2 e0 25 00     LSETUP(0x6c<_adler32_vec+0x6c>, 0xb2<_adler32_vec+0xb2>) LC1 = P0;
   6c:  7d 51           R5 = R5 + R7;
   6e:  12 cc 00 c0     DISALGNEXCPT || R1 = [I0 ++ M0] || NOP;
   72:  81 9d 00 00
   76:  18 c4 c0 4d     (R7, R6) = BYTEUNPACK R1:0;
   7a:  81 c8 32 0a     A1 += R6.L * R2.L, A0 += R6.L * R2.H (FU) || R4 = [I2 ++ M0] || R0 = [I3 ++ M0];
   7e:  94 9d 98 9d
   82:  81 c0 33 8e     A1 += R6.H * R3.L, A0 += R6.H * R3.H (FU);
   86:  81 c8 3c 0a     A1 += R7.L * R4.L, A0 += R7.L * R4.H (FU) || R2 = [I2 ++ M0] || R3 = [I3 ++ M0];
   8a:  92 9d 9b 9d
   8e:  81 c0 38 8e     A1 += R7.H * R0.L, A0 += R7.H * R0.H (FU);
   92:  12 cc 00 c0     DISALGNEXCPT || R0 = [I0 ++ M0] || NOP;
   96:  80 9d 00 00
   9a:  18 c4 c0 6d     (R7, R6) = BYTEUNPACK R1:0 (R);
   9e:  81 c8 32 0a     A1 += R6.L * R2.L, A0 += R6.L * R2.H (FU) || R4 = [I2 ++ M0] || R1 = [I3 ++ M0];
   a2:  94 9d 99 9d
   a6:  81 c0 33 8e     A1 += R6.H * R3.L, A0 += R6.H * R3.H (FU);
   aa:  81 c8 3c 0a     A1 += R7.L * R4.L, A0 += R7.L * R4.H (FU) || R2 = [I2 ++ M0] || R3 = [I3 ++ M0];
   ae:  92 9d 9b 9d
   b2:  8d c0 b9 af     R7 = (A1 += R7.H * R1.L), R6 = (A0 += R7.H * R1.H) (FU);

If the 'buf' (I0) input data is not cached, the speed is going to be 
bound by the speed of the external memory interface.  The inner loop is 
13 instruction with 2 32 bit fetches from I0.  Depending on the 
SCLK:CCLK ratio, and the memory type, the core will be waiting around 
for completion of memory fetches.  In that case all this optimization 
work will not be effective.

If zlib has very recently read/written the input data, and write-back 
cache is turned on, and the data block size is less than (say) 16kB then 
the speed up could be realized.  I guess taking benchmarks of zlib is 
what you are doing?

vord_e and vord_o (I2, I3) should be ideally placed so they are in 
separate cache banks or else this is going to stall a lot.

This section could be slightly optimized, but this is only a clean up 
loop of a few iterations.  Really not worth the trouble.

  11e:  b2 e0 04 20     LSETUP(0x122<_adler32_vec+0x122>, 0x126<_adler32_vec+0x126>) LC1 = P2;
  122:  08 98           R0 = B[P1++] (Z);
  124:  c7 51           R7 = R7 + R0;
  126:  be 51           R6 = R6 + R7;

I am not sure of the need to set the circular buffer registers for I0?  
Unless it is for added security against bugs/buffer overflows...

    e:  12 32           P2 = R2;      /* len parameter */

   22:  01 34           I0 = R1;      /* buf parameter */
   24:  01 36           B0 = R1;
   26:  4a 32           P1 = P2;

   2c:  21 6c           P1 += 0x4;              /* (  4) */    // Why add four here?  B/c of disaligned fetch???

   34:  61 36           L0 = P1;

Testing
----------

The important one is testing.  The author should compare results with 
the reference code in a test shell.

Here is the API function.

local noinline uLong adler32_vec(adler, buf, len)
     uLong adler;
     const Bytef *buf;
     uInt len;
{

Some tests should include

a/ changing alignment of buf, to test various combinations.
b/ len: various 0..VNMAX, >> VNMAX
c/ fill buffer with random data, all 0's, all 0xffffffff etc
d/ 'adler' input vary with various input values, 0, 0xffffffff, random

All I can think of for now.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part.
URL: <http://madler.net/pipermail/zlib-devel_madler.net/attachments/20110509/7ec473ce/attachment.sig>