[Zlib-devel] deflateSetDictionary(): How to determine "most commonly used strings"?
Ralf Junker
ralfjunker at gmx.de
Fri Apr 16 14:54:39 EDT 2010
For deflateSetDictionary(), the zlib documentation states that "the
dictionary should consist of strings (byte sequences) that are likely to
be encountered later in the data to be compressed".
To achieve better compression of HTML text, I wonder about any
recommendations on how an optimal set of dictionary strings is best
generated from typical data? What kind of "strings" help zlib to
compress best?
* Which size of strings? Letter N-Grams, single or multiple words,
entire sentences?
* Include white space and / or control characters or not?
* Keep "real words" separate from "control characters" like <>/ etc?
* Does it help to build the strings from the 1st n bytes of a document
only, assuming that zlib will catch up with later content automatically?
* Can zlib somehow help in creating those strings?
Any suggestions are much appreciated!
Ralf
More information about the Zlib-devel
mailing list