[Zlib-devel] stored-block size limit?
Mark Adler
madler at alumnus.caltech.edu
Mon Apr 8 23:16:01 EDT 2002
On Wednesday, April 3, 2002, at 07:39 AM, Greg Roelofs wrote:
> It's not clear to me if there's some technical (tuning) reason for this
> connection or if it's simply a matter of using the buffer sizes that
> happen to be available. There's no particular reason why you *couldn't*
> emit a 64K stored block at one of the lower memLevels, is there?
Sure you could emit it, but you might not know if you should. The trick
is that in order to know that the 64K of data could not be compressed to
less than 64K, so you need to be able to compress 64K. If you don't
have enough buffer space to compress 64K, then you can't do that. Small
memLevels limit how many literals can be accumulated, hence limit the
size of a stored block.
The literal buffer size is 2^(memLevel + 6), hence 16,384 literals for
the default memLevel of 8. For incompressible data, almost all of the
bytes will become literals, hence the limit of about 16K on the stored
block size with the default settings.
The other problem I noted is a little more subtle. If the window is
smaller than the literal buffer size, then the beginning of the block of
uncompressed data will be lost off of the bottom of the window before
the deflate block is complete. Then by the time you know whether you
should compress or not, it's too late! You've already lost the start of
the block that you would have copied into the stored block. You can't
reconstruct the uncompressed data from the compressed information in the
buffers, since there may be some matches to strings that are also
already long gone. So you have to emit a fixed or dynamic coded block.
My suggested fix for the second problem is to simply force memLevel to
be no greater than windowBits - 6. This would result in some
degradation in compression when forced, due to smaller dynamic blocks
with more relative overhead for the tree descriptions. But it would
avoid the currently unknown expansion of incompressible data for those
cases.
Or the alternative is to not fix it at all, if reasonable bounds can be
calculated for the expansion of data using dynamic/fixed blocks.
Hmm. Now that I'm writing this, an approach occurs to me. (Stream of
consciousness follows.) Get a bound for fixed blocks, since that's
easier, and since a dynamic block won't be emitted if a fixed block
would be smaller. For fixed blocks, the largest literal is nine bits.
The worst case for length/distance pairs would be for coding only three
bytes. The length code in that case is seven bits, and the worst
distance code is 18 bits. So that's 25 bits to code three bytes.
That's less than 27 bits using the worst literals. So the maximum
expansion is with literals at nine bits per byte, or 12.5% expansion.
In addition there are the three bits to start the fixed block and seven
bits to end it. So the worst fixed block is 9n + 10 bits, where n is
the number of uncompressed bytes in the block. The worst case is all
literals, so n is 2^(memLevel + 6). So the expansion factor is 1.125 +
1.25/n. The smallest n is 128, so the expansion factor is less than
1.125 + 0.01 = 1.135. Viola! It does seem conservative compared to the
observed 5% or so, but at least it is easily calculable.
So it is safe to assume an expansion by no more than 1/7 of the
uncompressed size. Or for a shift-only calculation that's a tad closer,
1/8 + 1/64 of the uncompressed size. So the new zlib function that I
recommended to return the maximum compressed size could return:
n + ((n + 7) >> 3) + ((n + 63) >> 6) + 11
where the 11 covers the zlib wrapper and small stored blocks, and easily
covers small fixed blocks as well.
It doesn't seem terribly onerous to me to have to allocate about 14%
more buffer for the output, so this may be the right solution. Or a
more complicated function could take into account the memLevel and
windowBits settings, and return smaller bounds for combinations where
stored blocks are possible.
mark
p.s. Greg's comment on the zlib technical details page is apparently
correct. I do like to calculate stuff like this.
More information about the Zlib-devel
mailing list