[Zlib-devel] deflateSetDictionary(): How to determine "most commonly used strings"?

Sat Apr 17 04:10:09 EDT 2010

  Hi,

On Fri, Apr 16, 2010 at 06:37:01PM -0700, Greg Roelofs wrote:
> >> Is there any reason you can't just run zlib on some typical HTML
> >> files, perhaps concatenated, and dump the strings corresponding to
> >> (distance, length) pairs?
> 
> > That's exactly what I'd like to do, but I fail to find the zlib API
> > function call to dump the strings. Anything I am missing in zlib.h?
> 
> Very few libraries provide an API to access their internals; "internal"
> is, almost by definition, the antithesis of an API.  I was suggesting
> adding some printfs at strategic locations in the encode or decode
> functions.  There may even already be some debug-type stuff in there
> that can be enabled with a macro, though I wouldn't count on it.
<...> 
> This is not a commonly used option; it adds complexity with relatively
> little benefit except when compressing a whole bunch of similar, small
> files.

  I've been meaning since some time to ask this same question regarding a 
zlib interface for dictionary discovery. In my field we have a file format 
which is made up of a very large number of very small records, each 
individually compressed with zlib. The compressed records are stored 
together in a file. Later they are read and decompressed individually, with 
random access to the individual records. Before compression, the individual 
records can indeed be fairly small on average: 10's to 100's of kB.

  The data stored in these records is not known in advance, however groups
of them do in fact contain similar data and are known to be grouped, even 
if they are stored separately. It would not be difficult for us to introduce 
some sort of learning phase into the application "workflow" which produces
these files. The dictionary used for a group of records could also easily be 
stored as just another record type in the same file as a file format 
extension.

  So naively I would expect that we would benefit from using dictionaries
and from some mechanism to access the results of compression runs done
without a dictionary in order to prepare one for subsequent compression.
The key element here (that is perhaps a bit non-standard) is that we are 
compressing the file records individually from scratch in order to permit 
subsequent random access and without having to decompress the entire file up 
to that point.

  Thus some zlib API for this use case would potentially be interesting to 
us. I was also not able to figure out how to do it without modifying the
zlib code myself (Of course suggestions as to some better way of doing this
are also quite welcome!)

                                 thanks,
                                   Pete

-------------------------------------------------------------------------
Peter Elmer     E-mail: Peter.Elmer at cern.ch      Phone: +41 (22) 767-4644
Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 23, Switzerland
-------------------------------------------------------------------------