[Zlib-devel] zlib gzopen_w function added

Mon Mar 19 15:47:39 EDT 2012

On 3/17/2012 10:58 AM, Mark Adler wrote:
> On Mar 17, 2012, at 1:20 AM, William A. Rowe Jr. wrote:
>> We would be very happy if zlib API's spoke utf-8.  Otherwise this
>> is altogether a misguided effort.
> 
> Bill,
> 
> I will certainly defer to the judgement of the Windows experts here, since I don't use it.  Let me discuss the rationale for the gzopen_w() function.
> 
> zlib's API's already do speak UTF-8 on everything except Windows.  On Windows, the open() function, which is used by gzopen() does *not* accept UTF-8.  So the zlib interface as it now stands does not permit passing files names using UTF-8 or UTF-16 on Windows, i.e. no foreign characters at all.  So something needs to be done to support that.

Fair enough.  And I shouldn't be too rash, there are absolutely fantastic
reasons to support _w api entry points for native unicode applications!
I didn't mean to badmouth such an API.

> The two choices as I understand it are: a) on Windows have gzopen() accept UTF-8 by converting the provided file name to UTF-16 and calling _wopen() instead of open(), or b) leave gzopen() alone and provide a new gzopen-like() function that accepts UTF-16 directly, calling _wopen().  I chose b) because it most closely matches the current interface in Windows which provides an fopen() and a _wfopen(), where the former accepts a byte-string that cannot be UTF-8, and the latter accepts a UTF-16 file name, and that it is the simplest, most direct approach that requires no conversions and uses the Windows native unicode character representation (UTF-16).
> 
> If I understand you correctly, you would strongly prefer option a).

So we need to provide the maximum flexibility to everyone, which means
supporting 1. native code page (no ability to decode alien characters)
which needs to remain the 'default' behavior, 2. introduced W(ide) api
entry points (gzopenW would be the convention on Windows), carefully
treating non-filename, byte buffers as bytes not chars.

And I'd suggest 3. also providing either an entry point, alternate build
flag, or both to treat all true char data as utf-8.  This was the APR
project's solution (used by subversion, httpd etc).

I'm happy to help on this, in fact I wrote one of the few fast utf8/ucs2
translators which didn't open the big fat security hole that the sun j2se
exposed.  So yes, proceed with gzopenW, let's assume stati64 and family
are always used so everything on windows support 'LARGE FILES', and all
that is left is a simple method for using the full utf-8 charset.  The
entire local code page concept is broken anyways for portable applications
such as zip/unzip.  (NFD vs NFC for unicode is similarly broken but let's
not discuss that evil, now ;-)