[Zlib-devel] [7/6][RFC V2.1 Patch] Blackfin implementation
Jan Seiffert
kaffeemonster at googlemail.com
Fri Apr 8 15:21:44 EDT 2011
2011/4/8 Mike Frysinger <vapier at gentoo.org>:
> the focus is on the lsetup part sections, and the main one looks pretty
> concentrated as it is ...
>
After some hour of gardening i had another idea, not doing it SIMD
style but more thinking like a DSP, we have two MAC, we get the
multiply for free so something like:
n = NMAX
for i < NMAX; i++
A1 += byte * 1
A0 += byte * n--;
only with BYTEUNPACK an two times an n to mask stalls from the n--.
This should fit in the register set, but i often thought that, and then...
> the 2nd one does a lot of byte loads, so i wonder if a few more insns but with
> 16bit (or even 32bit) loads would in practice speed things up.
The second loop is the trailer handling, doing 1 to 4 byte max. So: i
doesn't matter.
> the core often times is running at 5x or 6x the speed of system devices, so any external
> memory i/o is probably going to dominate the stalls.
Yes, prop.
Unfortunately i don't think the CPU can reorder instructions (esp.
loads) so one can fetch the next 4 Byte while doing calc. on the last
4 Byte.
Or am i wrong?
> most Blackfin parts have a 16bit bus to external memory, some have 32bit buses, but none have 8bit ;).
>
> i'll try bouncing the code around internally to see if i can get any tips from
> people who spend their days optimizing. that certainly isnt me.
Someone who eats CPU-pipelines for breakfast is very welcome ;)
> -mike
>
Greetings
Jan
--
Murphy's Law of Combat
Rule #3: "Never forget that your weapon was manufactured by the
lowest bidder"
More information about the Zlib-devel
mailing list