[Zlib-devel] deflateSetDictionary(): How to determine "most commonly used strings"?

Fri Apr 16 16:12:51 EDT 2010

On 16.04.2010 21:29, Greg Roelofs wrote:

>> To achieve better compression of HTML text, I wonder about any
>>> recommendations on how an optimal set of dictionary strings is
>>> best generated from typical data? What kind of "strings" help
>>> zlib to compress best?

> Is there any reason you can't just run zlib on some typical HTML
> files, perhaps concatenated, and dump the strings corresponding to
> (distance, length) pairs?

That's exactly what I'd like to do, but I fail to find the zlib API
function call to dump the strings. Anything I am missing in zlib.h?

> I suspect that would give you a pretty good idea.  You would want to
> do some statistical analysis on the results (frequencies of
> occurrence, turnover rate, etc.), but there's nothing like actually
> looking at real data to get you started...

I am running a my own statistical analysis right now, but so far without 
support by zlib nor do I know much of the internals.

What flavor of data dictionary strings does deflateSetDictionary() work 
best with?

What kind of compression improvements can I typically expect?

Ralf