[Zlib-devel] deflateSetDictionary(): How to determine "most commonly used strings"?

Fri Apr 16 14:54:39 EDT 2010

For deflateSetDictionary(), the zlib documentation states that "the 
dictionary should consist of strings (byte sequences) that are likely to 
be encountered later in the data to be compressed".

To achieve better compression of HTML text, I wonder about any 
recommendations on how an optimal set of dictionary strings is best 
generated from typical data? What kind of "strings" help zlib to 
compress best?

* Which size of strings? Letter N-Grams, single or multiple words, 
entire sentences?

* Include white space and / or control characters or not?

* Keep "real words" separate from "control characters" like <>/ etc?

* Does it help to build the strings from the 1st n bytes of a document 
only, assuming that zlib will catch up with later content automatically?

* Can zlib somehow help in creating those strings?

Any suggestions are much appreciated!

Ralf