[Zlib-devel] deflateSetDictionary(): How to determine "most commonly used strings"?

Fri Apr 16 19:07:23 EDT 2010

From: Ralf Junker
>For deflateSetDictionary(), the zlib documentation states that "the 
>dictionary should consist of strings (byte sequences) that are likely to 
>be encountered later in the data to be compressed".

As I understand the code (but I *may* have got this very wrong - Mark can you confirm this explanation?) the dictionary just behaves as though it was prefixed on front of the data to be compressed.  I.e. the effect is as though the data stream was prefixed by the dictionary, the LZ stream was reset immediately after the dictionary (Z_FULL_FLUSH) and only the part of the stream after this is output.  (Specifically, codes corresponding to the dictionary are not generated/stored.)

So far as I can see the dictionary can only have an effect on the first window-size bytes of the uncompressed data, because after that the dictionary bytes are no longer visible to the compression algorithm.

If I've got this all correct there are a few of things that should help with HTML:

1) HTML files all start the same way.  The most obvious thing to put, right at the *start* of the dictionary, is that block starting "<html..."; that's an instant saving of 20 or more bytes.
2) The dictionary contents should mimic HTML contents; the strings at the end being at the end of the dictionary (because they disappear last.)
3) The HTML should definitely be pre-filtered to remove random stuff like repeated white space.  (This is where XML is easier than HTML - white space is better understood in XML.)
4) Perhaps contrary to expectations it is worth putting in strings that occur once and only once, so long as they occur at the start of the file (<HEAD> and the start of <BODY>), because then they can be completely removed from the deflate data.

Anyway, the easiest way to generate a preset dictionary is to take the application that generates the HTML you compress and get it to generate an *empty* HTML page, then just use that (pre-filtered of course.)

If the HTML comes from multiple source, and if you want to only use one dictionary (I'd use one for HTML and one for CSS), you definitely want to run the HTML (and CSS) through a regularizer/pretty-printer.

The ideal dictionary is a template into which the meat of the HTML page is inserted, then every run of characters from the template can be replaced by a single code in the output.  In fact, if I have this right, the result should be just as efficient as an HTML specific encoding that maps tags (etc) to single codes (just so long as each HTML page fits in the window!)

John Bowler <jbowler at acm.org>