[Zlib-devel] Optimizations needed for gzgets() (Zlib version 1.2.3)

Scott_Riley at amat.com Scott_Riley at amat.com
Thu Jan 7 17:15:28 EST 2010


Hi Gilles,

I used your code with some minor modifications replacing fread() with
gzread() and fseek() with gzseek(). This approach actually slowed down the
decompression way more. In order to process a file of 14619 rows, it took
almost 26 seconds using the new approach whereas using gzgets() took about
750 milliseconds. It takes approximately 250 milliseconds to process an
uncompressed file with fgets() and fseek().

The only other change I made to your code is that I return a pointer to the
current buffer like gzgets(). The code is below:

Code used that takes 26 seconds to read 14619 lines all of which are
slightly less than 256 bytes long:
---------------------------------------------------------------------------------------------------------------------------------------------
char* ZEXPORT
dau_gzgets_buffered(char *line_buffer,int i_is_size,int nchars,flatfile_t
*flatfile_p)
{
	unsigned char tab_in[BUFFER_IN_CACHE_SIZE];
	char *b = line_buffer;
	size_t read_ascii_in_file = 0;
	int pos_in_char_line = 0;

	if ((i_is_size != 0) && (nchars == 0))
		return NULL;

	for (;;)
	{
		int size_to_read_binary = BUFFER_IN_CACHE_SIZE;
		if (i_is_size != 0)
		{
			if ((nchars-1) < BUFFER_IN_CACHE_SIZE)
		 		size_to_read_binary = nchars-1;
		}

		if (size_to_read_binary>0)
		{
		//	read_ascii_in_file = fread(&tab_in
[0],1,(size_t)size_to_read_binary,f);
			read_ascii_in_file = gzread(flatfile_p->gz_file_p,
&tab_in[0], size_to_read_binary);
		}

		if (read_ascii_in_file<=0)
		{
			if (pos_in_char_line == 0)
		 		   return NULL;
			else
			{
				line_buffer[pos_in_char_line]='\0';
				return b == line_buffer && pos_in_char_line > 0 ?
b: NULL;
			}
		}

		if (read_ascii_in_file > 0)
		{
			int i;
			for (i=0;i<(int)read_ascii_in_file;i++)
			{
		 		char c;
		 		c = (((char)tab_in[(i)])) ;

		 		if (c!=0x0d)
		 		{
		 			if (c=='\n')
		 			{
		 			//	fseek(f,-1 * (long)(read_ascii_in_file
- (i+1)),SEEK_CUR);
						gzseek(flatfile_p->gz_file_p, -1 *
(long)(read_ascii_in_file - (i+1)),SEEK_CUR);
		 		 		if (i_is_size != 0)
						{
		 		 			/* only if you want \n at end of
string:
							line_buffer[pos_in_char_line+
+]='\n';*/
		 		 		}
		 		 		line_buffer[pos_in_char_line] ='\0';
		 		 		return b == line_buffer &&
pos_in_char_line > 0 ? b : NULL;
		 			}

		 			if ((i_is_size == 1) && (pos_in_char_line ==
(nchars-1)))
		 		 	{
		 		 	//	fseek(f, -1 * (long)(read_ascii_in_file
- i),SEEK_CUR);
						gzseek(flatfile_p->gz_file_p,-1 *
(long)(read_ascii_in_file - i),SEEK_CUR);
		 		 		line_buffer[pos_in_char_line]='\0';
		 		 		return b == line_buffer &&
pos_in_char_line > 0 ? b : NULL;
		 			}

		 			line_buffer[pos_in_char_line++] = c;
				}
			}
		}
		read_ascii_in_file = 0;
	}

	return b == line_buffer && pos_in_char_line > 0 ? b : NULL;
}
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
I would have thought that fewer reads on the file probably would have been
faster than reading the file in one character at a time; however the new
code is also seeking back to the last '\n' everytime.

Do you have any further suggestions?

Mark - Since you have been in the process of optimizing gzgets(), could you
provide some feedback on the issue? Also, if you have a working prototype I
would be very interested in testing it out.

Thank you,
Scott

The content of this message is Applied Materials Confidential.  If you are
not the intended recipient and have received this message in error, any use
or distribution is prohibited.  Please notify me immediately by reply
e-mail and delete this message from your computer system.  Thank you.


                                                                          
                                                                          
                                                                          
                                                                       To 
    "Gilles Vollant"                    <zlib-devel at madler.net>           
    <info at winimage.com>                                                cc 
    Sent by:                                                              
    zlib-devel-bounces at madler.n                                   Subject 
    et                                  Re: [Zlib-devel] Optimizations    
                                        needed for gzgets()	(Zlib        
                                        version	1.2.3)                   
    01/07/2010 05:38 AM                                                   
                                                                          
                                                                          
         Please respond to                                                
       zlib-devel at madler.net                                              
                                                                          
                                                                          
                                                                          
                                                                          




gzgets call gzread one time for each char.

I suggest you use instead gzread with a bigger buffer (0x100 bytes or more,
by example)


just to give an idea, here is a buffered gets I wrote for another
application.
there is a af_fseek you must remove (not fast fseek on gzio.c), instead use
a persistent tab_in and read_ascii_in_file



#define BUFFER_IN_CACHE_SIZE (0x100)

int u_fgets_buffered(char* line,int i_is_size,int size,FILE* f)
{
  unsigned char tab_in[BUFFER_IN_CACHE_SIZE];
  size_t read_ascii_in_file = 0;
  int pos_in_char_line = 0;

  if ((i_is_size != 0) && (size == 0))
    return EOF;



  for (;;)
  {
		   int size_to_read_binary = BUFFER_IN_CACHE_SIZE;
		   if (i_is_size != 0)
		 		   if ((size-1) < BUFFER_IN_CACHE_SIZE)
		 		 		   size_to_read_binary = size-1;

		   if (size_to_read_binary>0)
		 		   read_ascii_in_file =
fread(&tab_in[0],1,(size_t)size_to_read_binary,f);
		   if (read_ascii_in_file<=0)
		   {
		 		   if (pos_in_char_line == 0)
		 		 		   return EOF;
		 		   else
		 		   {
		 		 		   line[pos_in_char_line]='\0';
		 		 		   return pos_in_char_line;
		 		   }
		   }

		   if (read_ascii_in_file > 0)
		   {
		 		   int i;
		 		   for (i=0;i<(int)read_ascii_in_file;i++)
		 		   {
		 		 		   char c;
		 		 		   c = (((char)tab_in[(i)])) ;

		 		 		   if (c!=0x0d)
		 		 		   {
		 		 		 		   if (c=='\n')
		 		 		 		   {
		 		 		 		 		   fseek(f,-1 *
(long)(read_ascii_in_file - (i+1)),SEEK_CUR);
		 		 		 		 		   if
(i_is_size!=0) {

/* only if you want \n at
end of string : line[pos_in_char_line++]='\n';*/
		 		 		 		 		   }
		 		 		 		 		   line
[pos_in_char_line]='\0';
		 		 		 		 		   return
pos_in_char_line;
		 		 		 		   }

		 		 		 		   if ((i_is_size==1) &&
(pos_in_char_line ==
(size-1)))
		 		 		 		   {
		 		 		 		 		   fseek(f,-1 *
(long)(read_ascii_in_file - i),SEEK_CUR);
		 		 		 		 		   line
[pos_in_char_line]='\0';
		 		 		 		 		   return
pos_in_char_line;
		 		 		 		   }

		 		 		 		   line[pos_in_char_line++]
= c;
		 		 		   }
		 		   }
		   }
		   read_ascii_in_file = 0;

  }
}

-----Message d'origine-----
De : zlib-devel-bounces at madler.net [mailto:zlib-devel-bounces at madler.net]
De
la part de Enrico Weigelt
Envoyé : jeudi 7 janvier 2010 01:32
À : zlib-devel at madler.net
Objet : Re: [Zlib-devel] Optimizations needed for gzgets() (Zlib version
1.2.3)

Scott_Riley at amat.com wrote:

> I think you hit the nail right on the head. The current application we
have
> integrated with the Zlib library requires processing files one line at a
> time; therefore we need to use gzgets() for compressed files since we use
> fgets() for uncompressed files.

hmm, maybe you could call zcat as subprocess and assign its output
to a FILE ? ;-o

BTW: if you might consider rewriting the app to some vfs layer, you
could use libmvfs - it doesnt have zlib support right now, but I'm
going to add it soon anyways.





_______________________________________________
Zlib-devel mailing list
Zlib-devel at madler.net
http://mail.madler.net/mailman/listinfo/zlib-devel_madler.net






More information about the Zlib-devel mailing list