[Zlib-devel] [7/6][RFC V2.1 Patch] Blackfin implementation

Jan Seiffert kaffeemonster at googlemail.com
Fri Apr 8 15:21:44 EDT 2011


2011/4/8 Mike Frysinger <vapier at gentoo.org>:
> the focus is on the lsetup part sections, and the main one looks pretty
> concentrated as it is ...
>

After some hour of gardening i had another idea, not doing it SIMD
style but more thinking like a DSP, we have two MAC, we get the
multiply for free so something like:

n = NMAX
for i < NMAX; i++
  A1 += byte * 1
  A0 += byte * n--;

only with BYTEUNPACK an two times an n to mask stalls from the n--.

This should fit in the register set, but i often thought that, and then...

> the 2nd one does a lot of byte loads, so i wonder if a few more insns but with
> 16bit (or even 32bit) loads would in practice speed things up.

The second loop is the trailer handling, doing 1 to 4 byte max. So: i
doesn't matter.

> the core often times is running at 5x or 6x the speed of system devices, so any external
> memory i/o is probably going to dominate the stalls.

Yes, prop.
Unfortunately i don't think the CPU can reorder  instructions (esp.
loads) so one can fetch the next 4 Byte while doing calc. on the last
4 Byte.
Or am i wrong?

>  most Blackfin parts have a 16bit bus to external memory, some have 32bit buses, but none have 8bit ;).
>
> i'll try bouncing the code around internally to see if i can get any tips from
> people who spend their days optimizing.  that certainly isnt me.

Someone who eats CPU-pipelines for breakfast is very welcome ;)

> -mike
>

Greetings
Jan


-- 
Murphy's Law of Combat
Rule #3: "Never forget that your weapon was manufactured by the
lowest bidder"




More information about the Zlib-devel mailing list