[Zlib-devel] Increased accuracy in guessing strm->data_type

Sun Oct 3 23:25:51 EDT 2004

Intro: I don't want this message to delay the release of v1.2.2.
Items (1) and (2) can be addressed now; item (3) requires some testing,
although the proposed code is very simple. If you feel that item (3)
cannot be sufficiently tested by the next public release, it can be
postponed for the release of v1.2.3.

**

1).
Quote from ChangeLog:
  - Fix trees.c to update strm->data_type (no one ever noticed!)

Perhaps that's because no one ever cared.

Only a fraction of the zlib apps might care about it (essentially, the
applications that process zip files). Apparently, even those ones didn't
care enough by now.

But now, the update of strm->data_type incurs extra computation (little,
but still...), and the apps that do not care about the value of this
field, might care about the performance penalty. I propose to trigger
this computation only of the user explicitly requests it. That is, I
propose to set it to something different that Z_UNKNOWN (for instance,
Z_BINARY) by default. If the user app does not care about it, it may
remain as is. Otherwise, it is possible to change this field to
Z_UNKNOWN after calling deflateInit() or deflateReset(), but before
calling deflate().

Although this is a change from the current behavior, there are no
compatibility concerns, since the current behavior is broken, anyway.

**

2).
Z_ASCII? No, Z_TEXT!

Z_ASCII is a misleading name, because apps that might care about
strm->data_type, are really interested if they are processing a text
file (should they display it as text? should they convert line endings
between LF and CRLF? etc.) Being ASCII vs. being another non-English
alphabet (including the Latin ones that use accented symbols beyond the
strict ASCII set) is not the issue here, so a name like Z_TEXT is more
appropriate than Z_ASCII.

**

If you agree with the considerations (1) and (2), here is a patch that
addresses them:

--- zlib.h~     Fri Sep 10 01:50:14 2004
+++ zlib.h      Sat Oct 02 00:43:00 2004
@@ -97,7 +97,7 @@
     free_func  zfree;   /* used to free the internal state */
     voidpf     opaque;  /* private data object passed to zalloc and zfree */

-    int     data_type;  /* best guess about the data type: ascii or binary */
+    int     data_type;  /* best guess about the data type: text or binary */
     uLong   adler;      /* adler32 value of the uncompressed data */
     uLong   reserved;   /* reserved for future use */
 } z_stream;
@@ -172,7 +172,8 @@
 /* compression strategy; see deflateInit2() below for details */

 #define Z_BINARY   0
-#define Z_ASCII    1
+#define Z_TEXT     1
+#define Z_ASCII    Z_TEXT  /* for compatibility with the old zlib */
 #define Z_UNKNOWN  2
 /* Possible values of the data_type field (though see inflate()) */

@@ -282,10 +283,11 @@
     deflate() sets strm->adler to the adler32 checksum of all input read
   so far (that is, total_in bytes).

-    deflate() may update data_type if it can make a good guess about
-  the input data type (Z_ASCII or Z_BINARY). In doubt, the data is considered
-  binary. This field is only for information purposes and does not affect
-  the compression algorithm in any manner.
+    If strm->data_type is set to Z_UNKNOWN after calling deflateInit() or
+  deflateReset(), deflate() tries to guess the input data type (binary or
+  text), and set strm->data_type to Z_BINARY or Z_TEXT, accordingly. If in
+  doubt, the data is considered binary. This field is only for information
+  purposes and does not affect the compression algorithm in any manner.

     deflate() returns Z_OK if some progress has been made (more input
   processed or more output produced), Z_STREAM_END if all input has been

--- deflate.c~  Tue Feb 24 07:30:54 2004
+++ deflate.c   Sat Oct 02 01:10:00 2004
@@ -1,5 +1,5 @@
 /* deflate.c -- compress data using the deflation algorithm
- * Copyright (C) 1995-2003 Jean-loup Gailly.
+ * Copyright (C) 1995-2004 Jean-loup Gailly.
  * For conditions of distribution and use, see copyright notice in zlib.h
  */

@@ -367,7 +367,11 @@

     strm->total_in = strm->total_out = 0;
     strm->msg = Z_NULL; /* use zfree if we ever allocate msg dynamically */
-    strm->data_type = Z_UNKNOWN;
+
+    /* If you want deflate() to guess the data type, set this to Z_UNKNOWN
+     * after calling deflateInit() or deflateReset().
+     */
+    strm->data_type = Z_BINARY;

     s = (deflate_state *)strm->state;
     s->pending = 0;

**

3).
Although the correctness of guessing the value of strm->data_type can
never be guaranteed, the current guessing scheme, which is the same one
that is used in PKZIP, is particularly dumb.

First, it mislabels binary files as text so many times. In my
experience, while transfering zip files between Windows and Unix, and
asking for the conversion of line endings, I got corrupted extractions
due to mislabeling of some binary files (most notably, PDF and MS-Word
documents) as text.

It is my understanding that other zip users had similar experiences.

I think it's safer to assume that no text stream contains any
non-whitespace control character: if a single non-ws control char is
found, then the whole stream should be labeled as binary.

Second, the current scheme mislabels some important categories of text
files as binary. The scheme works well for English texts, and for
non-English texts written in a Latin script; but it fails on non-Latin
scripts (Greek, Cyrillic, Asian, Hebrew, Arabic, ...)
I agree that guessing on UCS-2 and UCS-4 encodings is a lost cause, but
ISO-8859 extensions, UTF-8, KOI-8, and all the other extensions to the
ASCII alphabet can be handled much better by simply considering the
characters 128-255 as important as the characters 32-127.

Under these considerations, I'm proposing a new algorithm that is not
only more accurate (in my opinion), but also faster:

    if stream is empty then:
        return Z_BINARY;  /* know nothing; assume it's binary */
    for each ch in stream do:
        if iscntrl(ch) and not isspace(ch) then:
            return Z_BINARY;
    return Z_TEXT;

... and here is the implementation:

--- trees.c~	Tue Feb 24 07:36:36 2004
+++ trees.c	Sat Oct 02 00:56:00 2004
@@ -1,5 +1,5 @@
 /* trees.c -- output deflated data using Huffman coding
- * Copyright (C) 1995-2003 Jean-loup Gailly
+ * Copyright (C) 1995-2004 Jean-loup Gailly
  * For conditions of distribution and use, see copyright notice in zlib.h
  */

@@ -152,7 +152,7 @@
                               int blcodes));
 local void compress_block OF((deflate_state *s, ct_data *ltree,
                               ct_data *dtree));
-local void set_data_type  OF((deflate_state *s));
+local int eval_data_type  OF((deflate_state *s));
 local unsigned bi_reverse OF((unsigned value, int length));
 local void bi_windup      OF((deflate_state *s));
 local void bi_flush       OF((deflate_state *s));
@@ -930,8 +930,10 @@
     /* Build the Huffman trees unless a stored block is forced */
     if (s->level > 0) {

-         /* Check if the file is ascii or binary */
-        if (s->strm->data_type == Z_UNKNOWN) set_data_type(s);
+         /* Check if the file is text or binary */
+        if (s->strm->data_type == Z_UNKNOWN)
+            s->strm->data_type =
+                ((stored_len > 0) ? eval_data_type(s) : Z_BINARY);

         /* Construct the literal and distance trees */
         build_tree(s, (tree_desc *)(&(s->l_desc)));
@@ -1117,21 +1119,24 @@
 }

 /*
===========================================================================
- * Set the data type to ASCII or BINARY, using a crude approximation:
- * binary if more than 20% of the bytes are <= 6 or >= 128, ascii otherwise.
- * IN assertion: the fields freq of dyn_ltree are set and the total of all
- * frequencies does not exceed 64K (to fit in an int on 16 bit machines).
+ * Check if the data type is TEXT or BINARY, using a crude approximation:
+ * return Z_TEXT if all symbols are either printable characters (33 to 255)
+ * or white spaces (9 to 13, or 32); return Z_BINARY otherwise.
+ * Note: if the stream is empty, this function returns Z_TEXT, although
+ * Z_BINARY would be more appropriate. This situation is better handled
+ * outside this function.
+ * IN assertion: the fields Freq of dyn_ltree are set.
  */
-local void set_data_type(s)
+local int eval_data_type(s)
     deflate_state *s;
 {
-    int n = 0;
-    unsigned ascii_freq = 0;
-    unsigned bin_freq = 0;
-    while (n < 7)        bin_freq += s->dyn_ltree[n++].Freq;
-    while (n < 128)    ascii_freq += s->dyn_ltree[n++].Freq;
-    while (n < LITERALS) bin_freq += s->dyn_ltree[n++].Freq;
-    s->strm->data_type = bin_freq > (ascii_freq >> 2) ? Z_BINARY : Z_ASCII;
+    int n;
+
+    for (n = 0; n <= 8; n++)
+        if (s->dyn_ltree[n].Freq != 0) return Z_BINARY;
+    for (n = 14; n <= 31; n++)
+        if (s->dyn_ltree[n].Freq != 0) return Z_BINARY;
+    return Z_TEXT;
 }

 /*
===========================================================================

**

As you see, the newer algorithm uses 27 iterations, instead of 256.
If you agree with (3), maybe there is no need to consider (1), because
the speed penalty is near to nothing.

Best regards,
Cosmin