[Zlib-devel] Proposed code change

Sat Oct 28 11:04:09 EDT 2006

On Saturday 28 October 2006 10:25 am, Mark Brown wrote:
> On Sat, Oct 28, 2006 at 10:04:02AM -0400, Steve Snyder wrote:
> > The improvement comes from copying the data in multiple bytes rather
> > than one byte at a time.  The original code is a looped "*out++ =
> > *in++" meaning 2 memory access and 2 pointer increments for each byte
> > copied.  The patch attempts to copy the data in 32-bit blocks,
> > falling back to the original code if the 32-bit copy is not
> > practical.
>
> Have you benchmarked using memcpy() instead?  Compilers are often able
> to optimise that and will tend to have an easier job taking advantage
> of whatever platform specific tricks are available.

Yes, I've looked at memcpy.  It's not practical given the small lengths of 
data dealt with in inflate_fast().  Though my patch isn't x86-centric, 
the following explanation is.

Both the Microsoft and Gnu compilers can generate memcpy() as either 
inline code or as a function.  

The advantage of the inline code is the avoidance of stack handling for 
passing parameters.  The disadvantage is that it does no alignment 
checking, blindly copying (len/4) 32-bit values then the remaining 
(len%4) bytes, even if (len%4 == 0).  The alignment is important because 
performance is our goal and there is a performance penalty for writing 
data to addresses that are non-aligned to data size.  That's bad.  What 
is worse is that the inline memcpy() does not check for overlapping 
sources and destination address.  That is very very bad.

(I discovered much to my chagrin that inflate_fast() will often have 
source and destination addresses that are only separated by a single 
byte.  That means that writing 16- or 32-bit destination data will step 
on the source data.  That's why my proposed code change checks the 
address ranges before attempting multi-byte copying.)

Calling memcpy() as a function as a performance killer because it involves 
pushing 3 params onto the stack by inflate_fast() then pulling them off 
and moving them into registers within memcpy().  The function (both MS's 
and GNu's) is smart enough about x86 architecture to align the addresses, 
and to check for overlaps, but imposes a lot of overhead for what is 
typically a very small amount of data to copy.  There may be specialty 
cases in which long data lengths are typically output, but in general use 
the data lengths are too small to justify the overhead of calling a 
function.

Thus my avoidance of memcpy().