[Zlib-devel] deflateSetDictionary(): How to determine "most commonly used strings"?
Peter Elmer
Peter.Elmer at cern.ch
Sat Apr 17 04:10:09 EDT 2010
Hi,
On Fri, Apr 16, 2010 at 06:37:01PM -0700, Greg Roelofs wrote:
> >> Is there any reason you can't just run zlib on some typical HTML
> >> files, perhaps concatenated, and dump the strings corresponding to
> >> (distance, length) pairs?
>
> > That's exactly what I'd like to do, but I fail to find the zlib API
> > function call to dump the strings. Anything I am missing in zlib.h?
>
> Very few libraries provide an API to access their internals; "internal"
> is, almost by definition, the antithesis of an API. I was suggesting
> adding some printfs at strategic locations in the encode or decode
> functions. There may even already be some debug-type stuff in there
> that can be enabled with a macro, though I wouldn't count on it.
<...>
> This is not a commonly used option; it adds complexity with relatively
> little benefit except when compressing a whole bunch of similar, small
> files.
I've been meaning since some time to ask this same question regarding a
zlib interface for dictionary discovery. In my field we have a file format
which is made up of a very large number of very small records, each
individually compressed with zlib. The compressed records are stored
together in a file. Later they are read and decompressed individually, with
random access to the individual records. Before compression, the individual
records can indeed be fairly small on average: 10's to 100's of kB.
The data stored in these records is not known in advance, however groups
of them do in fact contain similar data and are known to be grouped, even
if they are stored separately. It would not be difficult for us to introduce
some sort of learning phase into the application "workflow" which produces
these files. The dictionary used for a group of records could also easily be
stored as just another record type in the same file as a file format
extension.
So naively I would expect that we would benefit from using dictionaries
and from some mechanism to access the results of compression runs done
without a dictionary in order to prepare one for subsequent compression.
The key element here (that is perhaps a bit non-standard) is that we are
compressing the file records individually from scratch in order to permit
subsequent random access and without having to decompress the entire file up
to that point.
Thus some zlib API for this use case would potentially be interesting to
us. I was also not able to figure out how to do it without modifying the
zlib code myself (Of course suggestions as to some better way of doing this
are also quite welcome!)
thanks,
Pete
-------------------------------------------------------------------------
Peter Elmer E-mail: Peter.Elmer at cern.ch Phone: +41 (22) 767-4644
Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 23, Switzerland
-------------------------------------------------------------------------
More information about the Zlib-devel
mailing list