From owner-freebsd-hackers Fri Jul 30 7:47:38 1999 Delivered-To: freebsd-hackers@freebsd.org Received: from peach.ocn.ne.jp (peach.ocn.ne.jp [210.145.254.87]) by hub.freebsd.org (Postfix) with ESMTP id C47D0156D7 for ; Fri, 30 Jul 1999 07:47:27 -0700 (PDT) (envelope-from dcs@newsguy.com) Received: from newsguy.com by peach.ocn.ne.jp (8.9.1a/OCN) id XAA07181; Fri, 30 Jul 1999 23:46:58 +0900 (JST) Message-ID: <37A1AF27.C5B9DB7D@newsguy.com> Date: Fri, 30 Jul 1999 22:56:55 +0900 From: "Daniel C. Sobral" X-Mailer: Mozilla 4.6 [en] (Win98; I) X-Accept-Language: en,pt-BR,ja MIME-Version: 1.0 To: Tim Vanderhoek Cc: freebsd-hackers@FreeBSD.ORG Subject: Re: replacing grep(1) References: <379F3701.F79E10DF@newsguy.com> <19990728191531.A318@mad> <37A04635.59E74FF6@newsguy.com> <19990729182229.E24296@mad> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Tim Vanderhoek wrote: > > I'm sorry. I've read your message and have decided that you're wrong. Not that you did bother to counter the points I made. You only comment on the one thing I said was probably insignificant. Are you taking your clues from me? :-) > Outside of the regexp library, algorithmic complexity is not a factor > here. It would take a beanbag to write anything other than an O(N) > algorithm. I said that I did not care whether the thing is inside or outside the regexp library. And a N*search+N*copy, as opposed to N*search, *is* relevant. And that N*copy is outside regexp. And, just for the reference, GNU Grep uses a dfa to identify likely matches before letting gnuregexp work. > The proposed grep is slow, very slow, and I've sent a long message to > James outlining how to make it much faster, but algorithmic complexity > is not an issue. So you say without having checked. > > > The test you suggested doesn't show anything about that algorithmic > > > complexity, though. > > > > Yeah? Try to back that with the results of the tests I suggested. > > No, it's not even worth my time. > > Now look. You've gotten me so upset I actually went and did a simple > test. The test showed I'm right and you're wrong. Catting X number > of copies of /etc/termcap into longfile causes the time grep uses > to pass longfile searching for all occurrences of "printer" causes > it to use an extra 0.03 seconds for every repetition of /etc/termcap > in longfile. > > Gee, linear complexity wrt to file length. Who could've guessed!? That does not *begin* to cover the cases I outlined. > What'ya bet GNU grep also exhibits linear complexity? :) > > Admit it, you jumped in with some bullshit about complexity when had > you taken the time to look into what James meant when he said "it now > spends 50% of its time in procline()" you would have kept quiet, > realizing that he was talking about a constant factor in the > complexity analysis, an subject where comments such as "it now spends > 50% of its time in procline()" are relevent. Ok, here is the _DATA_ backing my "bullshit". First table: searching for non-existent patterns Tests: 1) grep -e 123 world.build 2) grep -e 123456 world.build 3) grep -e 123 124 world.build 4) grep -e 123 456 world.build These were made with GNU grep, the version 0.9 of the new grep, and that version with the patch I sent previously (this later was non-intended -- only after completing the test I realized the executable was the one with my patches). Each test was repeated five times after both the executable and the target file were cached. I show here the averages of the line "real" for time. The user and sys values were actually more interesting, but with much greater deviation. :-) GNU grep New grep Patched grep 1) 0.09945s 0.4460s 0.3870s 2) 0.07225s 0.4424s 0.3894s 3) 0.12200s 0.6352s 0.5814s 4) 0.18240s 0.6364s 0.5796s One can clearly see that GNU grep has a much better complexity in the cases of longer patterns or multiple patterns with common prefix. It also shows that the new grep spends a lot of time in an activity not related to the search itself, since it does multiple patterns by calling regexec() multiple times, but 2:1 is not the proportion you see up there. Also, the patch I introduced to eliminate N*(malloc()+free()), N being the number of lines searched, significantly reduces that overhead (overhead as in, *beyond* the time spent in regexec()). Second table: searching for existing patterns Tests: 1) grep -e net world.build >/dev/null 2) grep -e netipx world.build >/dev/null 3) grep -e netinet world.build >/dev/null 4) grep -e netinet -e netipx world.build >/dev/null GNU grep New grep 1) 0.10750s 0.57060s 2) 0.07575s 0.46375s 3) 0.07416s 0.46700s 4) 0.09950s 0.67440s Though these tests involve more factors because each has a different number of matches, it again shows very clearly that the new grep has increased complexity in the case of multiple patterns. See there, cases 1 and 4. The latter has *less* matches than the former. Third table: non-existing pattern on different sized files Tests: 1) grep 123 world.build 2) grep 123 world.build.2 (two times world.build) 3) grep 123 world.build.3 (three times world.build) 4) grep 123 world.build.4 (four times world.build) GNU grep New grep 1) 0.09600s 0.44750s 2) 0.16425s 0.89075s 3) 0.24760s 1.30850s 4) 0.31833s 1.75900s Linear, it would seem... but, alas, this is to be expected. Grep searches inside lines, and the above does not increase the size of a line, only the number of them. Still, it's a relief that the new grep does not have a worse performance in this most simple test. Fourth table: non-existing patterns on files with different line sizes. Tests: 1) grep abc line10 2) grep abc line20 3) grep 124 line10 4) grep 124 line20 Line10 and line20 have line sizes of 10 and 20, respectively. They are both around 1 Mb in size, and same size. The lines are 1234567890 and 12345678901234567890, so the first pattern is immediatly ignored, and the second has two matches before failing. GNU grep New grep 1) 0.03600s 0.41916s 2) 0.03520s 0.25828s 3) 0.03642s 0.43400s 4) 0.03783s 0.27442s These are very interesting numbers. First, it shows that new grep performs *better* when the line size decreases! This can only mean that a lot of the time spent on *each* line is spent *outside* regexec(). Just FYI, O(M+N) is a greater complexity than O(M). It would seem that GNU grep is superior in the case of partial matches without a full match too, but the standard deviation for the GNU grep case is too big for the relative differences in the GNU grep case have any meaning. Next time you "think" the complexity doesn't matter, do it with non-FreeBSD code. -- Daniel C. Sobral (8-DCS) dcs@newsguy.com dcs@freebsd.org "Is it true that you're a millionaire's son who never worked a day in your life?" "Yeah, I guess so." "Lemme tell you, son, you ain't missed a thing." To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message