[Zlib-devel] deflateSetDictionary(): How to determine "most commonly used strings"?
Greg Roelofs
newt at pobox.com
Fri Apr 16 21:37:01 EDT 2010
>> Is there any reason you can't just run zlib on some typical HTML
>> files, perhaps concatenated, and dump the strings corresponding to
>> (distance, length) pairs?
> That's exactly what I'd like to do, but I fail to find the zlib API
> function call to dump the strings. Anything I am missing in zlib.h?
Very few libraries provide an API to access their internals; "internal"
is, almost by definition, the antithesis of an API. I was suggesting
adding some printfs at strategic locations in the encode or decode
functions. There may even already be some debug-type stuff in there
that can be enabled with a macro, though I wouldn't count on it.
> What flavor of data dictionary strings does deflateSetDictionary() work
> best with?
As John noted, the dictionary is not much more than a hack to provide
the initial 32 KB sliding window rather than starting from scratch.
It should look similar to the HTML you want to compress, but not
identical; among other things, there's no obvious need for any repeated
elements, although if you have, say, three common flavors of <TABLE ...>
lines, you might want to include all three. It's hard to guess which
strings gzip/zlib will find the most useful without actually trying the
experiment, however. The human eye is good at finding certain types
of patterns, but the code is blind to things like newlines, so it's
not uncommon for it to find better (longer) strings that a human would
miss.
> What kind of compression improvements can I typically expect?
No better than 32 KB savings, AFAIK, and that assumes you store/transmit
your dictionary separately. If your files are all around that size (give
or take a factor of three) and you have to compress them separately, then
you could see some serious benefits. But you'd do even better simply by
concatenating all of the files into one blob (say, a tar archive) before
compressing.
This is not a commonly used option; it adds complexity with relatively
little benefit except when compressing a whole bunch of similar, small
files.
Greg
More information about the Zlib-devel
mailing list