[Zlib-devel] deflateSetDictionary(): How to determine "most commonly used strings"?

Mark Adler madler at alumni.caltech.edu
Sat Apr 17 11:23:37 EDT 2010


On Apr 16, 2010, at 4:07 PM, John Bowler wrote:
> the dictionary just behaves as though it was prefixed on front of the data to be compressed.
...
> So far as I can see the dictionary can only have an effect on the first window-size bytes of the uncompressed data, because after that the dictionary bytes are no longer visible to the compression algorithm.

Correct.

> 1) HTML files all start the same way.  The most obvious thing to put, right at the *start* of the dictionary, is that block starting "<html..."; that's an instant saving of 20 or more bytes.

Actually you want to put the most likely strings to be repeated at the end of the dictionary, not the start.  The end of the dictionary will provide shorter distances, which are coded in fewer bits.

On Apr 16, 2010, at 6:37 PM, Greg Roelofs wrote:
> This is not a commonly used option; it adds complexity with relatively
> little benefit except when compressing a whole bunch of similar, small
> files.

In fact I know of an application that used text messages for transmitting vending machine data that greatly benefitted from this, where it would use the previous up to 32K of messages (which was many messages) as the dictionary for the next message.  There was a return path for retransmission, so if the receiver lost lock on the dictionary, the message was sent with no dictionary and the process started over.

On Apr 17, 2010, at 1:10 AM, Peter Elmer wrote:
> I've been meaning since some time to ask this same question regarding a 
> zlib interface for dictionary discovery.

Actually, you don't need a zlib interface.  You can get the same information from the compressed data itself.  infgen ( http://zlib.net/infgen.c.gz ) will "disassemble" a deflate stream into readable descriptions of the contents.  The matches could perhaps be used to aid in dictionary creation.

Mark





More information about the Zlib-devel mailing list