[Zlib-devel] [7/8][RFC V3 Patch] Blackfin implementation
Mike Frysinger
vapier at gentoo.org
Mon May 9 11:09:56 EDT 2011
here's some feedback i got internally:
A spent a little more time on it today, but the code looks pretty well
optimized. Actually this is an impressive piece of work.
As long as the input data is cached, this code is going to be run pretty
quickly for longer vector lengths; the inner loop is pretty tight.
68: b2 e0 25 00 LSETUP(0x6c<_adler32_vec+0x6c>, 0xb2<_adler32_vec+0xb2>) LC1 = P0;
6c: 7d 51 R5 = R5 + R7;
6e: 12 cc 00 c0 DISALGNEXCPT || R1 = [I0 ++ M0] || NOP;
72: 81 9d 00 00
76: 18 c4 c0 4d (R7, R6) = BYTEUNPACK R1:0;
7a: 81 c8 32 0a A1 += R6.L * R2.L, A0 += R6.L * R2.H (FU) || R4 = [I2 ++ M0] || R0 = [I3 ++ M0];
7e: 94 9d 98 9d
82: 81 c0 33 8e A1 += R6.H * R3.L, A0 += R6.H * R3.H (FU);
86: 81 c8 3c 0a A1 += R7.L * R4.L, A0 += R7.L * R4.H (FU) || R2 = [I2 ++ M0] || R3 = [I3 ++ M0];
8a: 92 9d 9b 9d
8e: 81 c0 38 8e A1 += R7.H * R0.L, A0 += R7.H * R0.H (FU);
92: 12 cc 00 c0 DISALGNEXCPT || R0 = [I0 ++ M0] || NOP;
96: 80 9d 00 00
9a: 18 c4 c0 6d (R7, R6) = BYTEUNPACK R1:0 (R);
9e: 81 c8 32 0a A1 += R6.L * R2.L, A0 += R6.L * R2.H (FU) || R4 = [I2 ++ M0] || R1 = [I3 ++ M0];
a2: 94 9d 99 9d
a6: 81 c0 33 8e A1 += R6.H * R3.L, A0 += R6.H * R3.H (FU);
aa: 81 c8 3c 0a A1 += R7.L * R4.L, A0 += R7.L * R4.H (FU) || R2 = [I2 ++ M0] || R3 = [I3 ++ M0];
ae: 92 9d 9b 9d
b2: 8d c0 b9 af R7 = (A1 += R7.H * R1.L), R6 = (A0 += R7.H * R1.H) (FU);
If the 'buf' (I0) input data is not cached, the speed is going to be
bound by the speed of the external memory interface. The inner loop is
13 instruction with 2 32 bit fetches from I0. Depending on the
SCLK:CCLK ratio, and the memory type, the core will be waiting around
for completion of memory fetches. In that case all this optimization
work will not be effective.
If zlib has very recently read/written the input data, and write-back
cache is turned on, and the data block size is less than (say) 16kB then
the speed up could be realized. I guess taking benchmarks of zlib is
what you are doing?
vord_e and vord_o (I2, I3) should be ideally placed so they are in
separate cache banks or else this is going to stall a lot.
This section could be slightly optimized, but this is only a clean up
loop of a few iterations. Really not worth the trouble.
11e: b2 e0 04 20 LSETUP(0x122<_adler32_vec+0x122>, 0x126<_adler32_vec+0x126>) LC1 = P2;
122: 08 98 R0 = B[P1++] (Z);
124: c7 51 R7 = R7 + R0;
126: be 51 R6 = R6 + R7;
I am not sure of the need to set the circular buffer registers for I0?
Unless it is for added security against bugs/buffer overflows...
e: 12 32 P2 = R2; /* len parameter */
22: 01 34 I0 = R1; /* buf parameter */
24: 01 36 B0 = R1;
26: 4a 32 P1 = P2;
2c: 21 6c P1 += 0x4; /* ( 4) */ // Why add four here? B/c of disaligned fetch???
34: 61 36 L0 = P1;
Testing
----------
The important one is testing. The author should compare results with
the reference code in a test shell.
Here is the API function.
local noinline uLong adler32_vec(adler, buf, len)
uLong adler;
const Bytef *buf;
uInt len;
{
Some tests should include
a/ changing alignment of buf, to test various combinations.
b/ len: various 0..VNMAX, >> VNMAX
c/ fill buffer with random data, all 0's, all 0xffffffff etc
d/ 'adler' input vary with various input values, 0, 0xffffffff, random
All I can think of for now.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part.
URL: <http://madler.net/pipermail/zlib-devel_madler.net/attachments/20110509/7ec473ce/attachment.sig>
More information about the Zlib-devel
mailing list