[Zlib-devel] Deflate is dead. Long live... (was: Google's version of zlib code)

Sun Mar 3 01:32:23 EST 2013

I can only agree.
One of the problems with Deflate is that it is byte oriented and its 32k search window looks somewhat limited nowadays.
HTML5 only recommends UTF-8 encoding. In UTF-8 Unicode "characters" or code points take between 1 and 4 bytes -- 4 bytes are pretty rare since those code points are outside the BMP and encompass mostly dead scripts like egyptian hieroglyphs or rare Han code points, but 3 bytes code points are necessary for practically all the scripts used in India and Asia... thats roughly half of the world population -- the lead octet is either an ASCII char or a value between 0xC2 and 0xF4, the continuation bytes are exclusively in the 0x80-0xBF range (only 64 values, 6 bits) this means that the lead byte and the continuation bytes could benefit from having independent entropy encodings (perhaps separate Huffman tables) or being guessed based on context.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: UTF-8.png
Type: image/png
Size: 88260 bytes
Desc: not available
URL: <http://madler.net/pipermail/zlib-devel_madler.net/attachments/20130303/540c6dd2/attachment.png>
-------------- next part --------------

I've found a few papers about Unicode compression (SCSU, BOCU-1) they tend to prove that bzip2 does a better job than gzip whatever the primary text encoding was (UTF-8, UTF-16....) but they test agains texts in a given language and usually bigger than typical web files, HTML is more or less a mixed bag of markup (english tinted) and display text that could be in an entirely different script, SVG files are another beast since they hold a lot of figures compared to standard texts.
Anyway compared to Gzip, bzip2 is still slower and needs more memory (this could still be a problem on handhelds), and the fact that it works way better on big chunks of data would introduce a latency problem for on the fly compression (mod_deflate), it's practically the same for other compression algorithms like LZMA (.xz).

The web needs perhaps its own compression algorithm, that would be both fast and efficient without gobbling megabytes of memory, something middle of the road between Deflate and LZMA with a twist of flexibility to allow very fast  low latency compression of dynamic content.

http://www.unicode.org/faq/compression.html
http://cs.fit.edu/media/TechnicalReports/cs-2002-10.pdf

Regards
-- 
Fr?d?ric Kayser

Le 3 mars 2013 ? 06:07, Mark Adler a ?crit :

> On Mar 2, 2013, at 7:54 AM, Nelson H. F. Beebe <beebe at math.utah.edu> wrote:
>> 	Google publishes Zopfli as open-source compression algorithm to speed up Web download
>> 	Compression is about 100 times slower than conventional methods but compresses about 5% better,
> 
> That's cool, but it seems like an awful lot of effort for a small improvement.  Perhaps it's time to add a better compression method to HTTP's accept-encoding.
> 
> Mark
> 
> 
> _______________________________________________
> Zlib-devel mailing list
> Zlib-devel at madler.net
> http://mail.madler.net/mailman/listinfo/zlib-devel_madler.net