[Zlib-devel] Proposed code change
Steve Snyder
swsnyder at insightbb.com
Sat Oct 28 11:04:09 EDT 2006
On Saturday 28 October 2006 10:25 am, Mark Brown wrote:
> On Sat, Oct 28, 2006 at 10:04:02AM -0400, Steve Snyder wrote:
> > The improvement comes from copying the data in multiple bytes rather
> > than one byte at a time. The original code is a looped "*out++ =
> > *in++" meaning 2 memory access and 2 pointer increments for each byte
> > copied. The patch attempts to copy the data in 32-bit blocks,
> > falling back to the original code if the 32-bit copy is not
> > practical.
>
> Have you benchmarked using memcpy() instead? Compilers are often able
> to optimise that and will tend to have an easier job taking advantage
> of whatever platform specific tricks are available.
Yes, I've looked at memcpy. It's not practical given the small lengths of
data dealt with in inflate_fast(). Though my patch isn't x86-centric,
the following explanation is.
Both the Microsoft and Gnu compilers can generate memcpy() as either
inline code or as a function.
The advantage of the inline code is the avoidance of stack handling for
passing parameters. The disadvantage is that it does no alignment
checking, blindly copying (len/4) 32-bit values then the remaining
(len%4) bytes, even if (len%4 == 0). The alignment is important because
performance is our goal and there is a performance penalty for writing
data to addresses that are non-aligned to data size. That's bad. What
is worse is that the inline memcpy() does not check for overlapping
sources and destination address. That is very very bad.
(I discovered much to my chagrin that inflate_fast() will often have
source and destination addresses that are only separated by a single
byte. That means that writing 16- or 32-bit destination data will step
on the source data. That's why my proposed code change checks the
address ranges before attempting multi-byte copying.)
Calling memcpy() as a function as a performance killer because it involves
pushing 3 params onto the stack by inflate_fast() then pulling them off
and moving them into registers within memcpy(). The function (both MS's
and GNu's) is smart enough about x86 architecture to align the addresses,
and to check for overlaps, but imposes a lot of overhead for what is
typically a very small amount of data to copy. There may be specialty
cases in which long data lengths are typically output, but in general use
the data lengths are too small to justify the overhead of calling a
function.
Thus my avoidance of memcpy().
More information about the Zlib-devel
mailing list