[Zlib-devel] [PATCH] deflate.c: identify slide_Pos() for later optimization

Mon Jul 23 20:03:43 EDT 2012

Modern "multimedia" vectorized hardware instructions can speed deflate().
For higher-end x86* CPUs the speedup might be 2% to 3% of total CPU time.
On a slower CPU, or with a compiler plus instruction decoder that suffer
longer latency after a branch (such as gcc for some PowerPC chips)
then the improvement might be 5% to 8%.

The attached patch introduces a new subroutine slide_Pos() in deflate.c
which identifies the operation that is subject to optimization.
The opportunity arises when sliding the window.  The vectors head[]
and prev[] of substring indices are adjusted using saturating subtraction.
A very good compiler should be able to recognize and vectorize the operation
from the patched source.  If not, then any compiler which can inline a local
subroutine should give code which is no worse than the unmodified version.
A compiler which does not inline slide_Pos might introduce a penalty
approximately equal to the cost of two internal subroutine calls.

If there is interest, then I will follow with assembly-language versions
of slide_Pos for i686/x86_64 (with runtime selection among several variants
according to actual hardware capabilities), PowerPC altivec (compile-time
selection) and ARM neon (compile-time selection.)

-- 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-slide_Pos-identify-for-future-optimization.patch
Type: text/x-patch
Size: 2352 bytes
Desc: not available
URL: <http://madler.net/pipermail/zlib-devel_madler.net/attachments/20120723/dd3573a9/attachment.bin>