[Zlib-devel] scan_tree & send_tree in trees.c do not follow RFC1951 page 14 guidelines
Frédéric Kayser
f.kayser at free.fr
Fri Apr 5 09:23:19 EDT 2013
Hello,
I have noticed this small phrase in RF1951 page 14:
The code length repeat codes can cross from HLIT + 257 to the
HDIST + 1 code lengths. In other words, all code lengths form
a single sequence of HLIT + HDIST + 258 values.
Since Zlib builds the Huffman header using two separate lists/trees (Literals/End-of-Block/Lengths and Distances) and never reunite them:
in build_bl_tree(s)
/* Determine the bit length frequencies for literal and distance trees */
scan_tree(s, (ct_data *)s->dyn_ltree, s->l_desc.max_code);
scan_tree(s, (ct_data *)s->dyn_dtree, s->d_desc.max_code);
in send_all_trees(s, lcodes, dcodes, blcodes)
send_tree(s, (ct_data *)s->dyn_ltree, lcodes-1); /* literal tree */
send_tree(s, (ct_data *)s->dyn_dtree, dcodes-1); /* distance tree */
it misses a small and rare Huffman table optimization, for instance here are side by side dump snippets of the same file produced by Defdb, the first file has been compressed using Gzip, the second is optimized by Defluff:
[_] 0xFF CL (val: 0) [_] 0xFF CL (val: 0)
[3] EofB CL (val: 7) [3] EofB CL (val: 7)
[3] l_00 CL (val: 5) (length 3) [3] l_00 CL (val: 5) (length 3)
[4] LREP (3 times) [5] LREP (4 times)
[_] l_01 CL (val: 5) (length 4) [_] l_01 CL (val: 5) (length 4)
[_] l_02 CL (val: 5) (length 5) [_] l_02 CL (val: 5) (length 5)
[_] l_03 CL (val: 5) (length 6) [_] l_03 CL (val: 5) (length 6)
[3] d_00 CL (val: 5) (distance 1) [_] d_00 CL (val: 5) (distance 1)
[4] d_01 CL (val: 1) (distance 2) [5] d_01 CL (val: 1) (distance 2)
[3] d_02 CL (val: 5) (distance 3) [3] d_02 CL (val: 5) (distance 3)
[4] d_03 CL (val: 0) (distance 4) [4] d_03 CL (val: 0) (distance 4)
[4] d_04 CL (val: 3) (distances [5-6]) [4] d_04 CL (val: 3) (distances [5-6])
Gzip finds only 4 repetitions of CL 5 recorded as a single CL length followed by a factor 3 repetition (the next CL5 is recorded as a single CL), whereas Defluff finds 5 repetitions (four on the Literals/End-of-Block/Lengths CL "side" and one on the Distances CL "side") recorded as a single CL length followed by a factor 4 repetition.
CL lists/trees as seen by Zlib:
[_] 0xFF CL (val: 0)
[3] EofB CL (val: 7)
[3] l_00 CL (val: 5)
[4] LREP (3 times)
[_] l_01 CL (val: 5) (length 4)
[_] l_02 CL (val: 5) (length 5)
[_] l_03 CL (val: 5) (length 6)
--- artificial border between Literals/End-of-Block/Lengths CL and Distances CL created by the two separate calls to scan_tree ---
[3] d_00 CL (val: 5) (distance 1)
[4] d_01 CL (val: 1) (distance 2)
[3] d_02 CL (val: 5) (distance 3)
[4] d_03 CL (val: 0) (distance 4)
[4] d_04 CL (val: 3) (distances [5-6])
Since there's apparently some tailroom in dyn_ltree I may try to write a patch to introduce a new function scan_trees(s, (ct_data *)s->dyn_ltree, (ct_data *)s->dyn_dtree, lcodes-1, dcodes-1) that will put dyn_dtree right after dyn_ltree before the scan is actually performed this way it would catch repetitions crossing the two lists/trees.
By the way they are other caveats in scan_tree for instance it handles repetitions in a greedy way. If the same CL repeats 10 times it will output: a single CL, a factor 7 repetition, a single CL and a single CL again (1+7+1+1) whereas it could emit a single CL a factor 6 repetition and a factor 3 repetition (1+6+3), I'm not saying that the second option is always better but currently scan_tree does not even try alternate options.
Test files attached.
Best regards
--
Frédéric Kayser
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://madler.net/pipermail/zlib-devel_madler.net/attachments/20130405/52c8fae8/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 3rep_gzip.gz
Type: application/x-gzip
Size: 156 bytes
Desc: not available
URL: <http://madler.net/pipermail/zlib-devel_madler.net/attachments/20130405/52c8fae8/attachment.bin>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://madler.net/pipermail/zlib-devel_madler.net/attachments/20130405/52c8fae8/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 4rep_defluff.gz
Type: application/x-gzip
Size: 155 bytes
Desc: not available
URL: <http://madler.net/pipermail/zlib-devel_madler.net/attachments/20130405/52c8fae8/attachment-0001.bin>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://madler.net/pipermail/zlib-devel_madler.net/attachments/20130405/52c8fae8/attachment-0002.html>
More information about the Zlib-devel
mailing list