Date: Thu, 29 Jul 1999 21:16:53 +0900 From: "Daniel C. Sobral" <dcs@newsguy.com> To: Tim Vanderhoek <vanderh@ecf.utoronto.ca> Cc: James Howard <howardjp@wam.umd.edu>, freebsd-hackers@FreeBSD.ORG Subject: Re: replacing grep(1) Message-ID: <37A04635.59E74FF6@newsguy.com> References: <Pine.GSO.4.10.9907272344440.19477-100000@rac9.wam.umd.edu> <379F3701.F79E10DF@newsguy.com> <19990728191531.A318@mad>
next in thread | previous in thread | raw e-mail | index | archive | help
Tim Vanderhoek wrote: > > On Thu, Jul 29, 1999 at 01:59:45AM +0900, Daniel C. Sobral wrote: > > > > Sorry, but a simplistic analysis like that just won't cut for grep. > > The algorithmic complexity is highly relevant here. Try this: > > Algorithmic complexity!?! Yup. > It's a freaking grep application. There is no freaking algorithmic > complexity. Pattern matching is one of the prime examples of algorithmic complexity. You can add complexity very trivially. > At least not outside of our regex library, anyways. I had not looked at the source, so I didn't know exactly how the application did it's stuff. Now I did, and I'll comment. Let's say the number of patterns is N, and the total number of characters to be examined is S. Let's call the "unmodified" complexity C, just for the sake of simplifying comparision using a dangerous simplification. First, the new grep uses fgetln(). fgetln() searches for a new line. So each character is examined (at least) twice. That's C+S*read already. GNU Grep uses mmap() (or read(), but not in FreeBSD), so it doesn't incur in this additional complexity. Also, fgetln() will copy the line buffer from time to time, though that's not a simple computation, and probably of little significance. In addition to that, the new grep copies the fgrepln() result each time. Add S*copy to C. Next, the new grep tests the lines against each pattern separately! GNU grep doesn't. That's just *outside* the regexp library. Now, whether the complexity is inside or outside the regexp library, I don't care. It's complexity all the same. So it *must* be factored in. > The test you > suggested doesn't show anything about that algorithmic complexity, > though. Yeah? Try to back that with the results of the tests I suggested. > If we have a slow regex library, though, I would consider that a > separate problem from a slow grep. If the f*cking grep is f*cking slow, I don't give a f*ck where the problem is located! It just *IS*. GNU grep uses gnu regexp library, the new grep uses our own. If changing greps means changing to a library whose algorithm complexity is greater, then that *DOES* count against the change. For instance, a quick browse over GNU greps shows the gnu regexp library can factor in multiple patterns. That is not being done by the new grep. Does our regexp library support that? Now, here is a quick and dirty fix for the repeated malloc()/free(). Notice that this is what fgetln() does, in fact. I'm afraid, though, that's this is not anywhere near what would be needed by far to put the new grep anywhere near the league of GNU grep. I like the idea of a readable code, I like the idea of a BSD license, but it would be damn silly to replace a clearly superior grep, and that's where the thing stands right now. --- util.c.orig Thu Jul 29 19:14:17 1999 +++ util.c Thu Jul 29 20:49:16 1999 @@ -107,6 +107,8 @@ ln.file = fn; ln.line_no = 0; + ln.bufsize = 81; /* Magical constants, yeah! */ + ln.dat = grep_malloc(81); linesqueued = 0; if (Bflag > 0) @@ -115,11 +117,14 @@ ln.off = grep_tell(); if ((tmp = grep_getln(&ln.len)) == NULL) break; - ln.dat = grep_malloc(ln.len + 1); + if (ln.bufsize < ln.len + 1) + ln.dat = grep_realloc(ln.dat, ln.len + 1); memcpy(ln.dat, tmp, ln.len); - ln.dat[ln.len] = 0; if (ln.len > 0 && ln.dat[ln.len - 1] == '\n') ln.dat[--ln.len] = 0; + else + ln.dat[ln.len] = 0; + ln.line_no++; z = tail; @@ -127,9 +132,9 @@ enqueue(&ln); linesqueued++; } - free(ln.dat); c += t; } + free(ln.dat); if (Bflag > 0) clearqueue(); grep_close(); --- grep.h.orig Thu Jul 29 20:47:52 1999 +++ grep.h Thu Jul 29 20:48:34 1999 @@ -35,6 +35,7 @@ typedef struct { size_t len; + size_t bufsize; int line_no; int off; char *file; -- Daniel C. Sobral (8-DCS) dcs@newsguy.com dcs@freebsd.org "Is it true that you're a millionaire's son who never worked a day in your life?" "Yeah, I guess so." "Lemme tell you, son, you ain't missed a thing." To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?37A04635.59E74FF6>