Date: Sun, 22 Aug 2010 13:04:00 -0400 (EDT) From: Garrett Wollman <wollman@hergotha.csail.mit.edu> To: ed@80386.nl Cc: current@freebsd.org Subject: Re: why GNU grep is fast Message-ID: <201008221704.o7MH40pV088795@hergotha.csail.mit.edu> In-Reply-To: <20100822163644.GU2978@hoeg.nl> References: <20100822163644.GU2978@hoeg.nl> <201008210231.o7L2VRvI031700@ducky.net>
next in thread | previous in thread | raw e-mail | index | archive | help
In article <20100822163644.GU2978@hoeg.nl> you write: >I think that implementing a simple fgrep boils down to mmap()ing a file >and calling memmem() on the mapping to search for the input string. Of >course this relies on having an efficient memmem() implementation, for >example using one of the algorithms mentioned in this thread. It's actually more complicated than that, because you have to ensure that you are not matching the middle of a multibyte character, when the current locale specifies a character set with a multibyte encoding. Likewise when searching for the newlines that delimit the matched line. (I'm not sure whether FreeBSD supports any character encodings that would be ambiguous in that way.) I don't think this was considered an issue when Mike Haertel was developing GNU grep. It seems reasonable to implement BMG or some other fast search in memmem(). Note that if you can't (or don't want to) mmap the whole file at once, you'll need special handling for the boundary conditions -- both at the string search level and at the search for line delimiters. This is much easier in the fgrep case, obviously, since the length of the query puts a finite upper bound on the amount of the old buffer you need to keep -- with regexps you really need your regexp engine to be able to report its matching state, or else limit your input to strictly conforming POSIX text files (i.e., line lengths limited to {LINE_MAX}). -GAWollman
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201008221704.o7MH40pV088795>