From owner-freebsd-hackers  Fri Jul 30  7:47:38 1999
Delivered-To: freebsd-hackers@freebsd.org
Received: from peach.ocn.ne.jp (peach.ocn.ne.jp [210.145.254.87])
	by hub.freebsd.org (Postfix) with ESMTP id C47D0156D7
	for <freebsd-hackers@FreeBSD.ORG>; Fri, 30 Jul 1999 07:47:27 -0700 (PDT)
	(envelope-from dcs@newsguy.com)
Received: from newsguy.com by peach.ocn.ne.jp (8.9.1a/OCN) id XAA07181; Fri, 30 Jul 1999 23:46:58 +0900 (JST)
Message-ID: <37A1AF27.C5B9DB7D@newsguy.com>
Date: Fri, 30 Jul 1999 22:56:55 +0900
From: "Daniel C. Sobral" <dcs@newsguy.com>
X-Mailer: Mozilla 4.6 [en] (Win98; I)
X-Accept-Language: en,pt-BR,ja
MIME-Version: 1.0
To: Tim Vanderhoek <vanderh@ecf.utoronto.ca>
Cc: freebsd-hackers@FreeBSD.ORG
Subject: Re: replacing grep(1)
References: <Pine.GSO.4.10.9907272344440.19477-100000@rac9.wam.umd.edu> <379F3701.F79E10DF@newsguy.com> <19990728191531.A318@mad> <37A04635.59E74FF6@newsguy.com> <19990729182229.E24296@mad>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Tim Vanderhoek wrote:
> 
> I'm sorry.  I've read your message and have decided that you're wrong.

Not that you did bother to counter the points I made. You only
comment on the one thing I said was probably insignificant. Are you
taking your clues from me? :-)

> Outside of the regexp library, algorithmic complexity is not a factor
> here.  It would take a beanbag to write anything other than an O(N)
> algorithm.

I said that I did not care whether the thing is inside or outside
the regexp library. And a N*search+N*copy, as opposed to N*search,
*is* relevant. And that N*copy is outside regexp.

And, just for the reference, GNU Grep uses a dfa to identify likely
matches before letting gnuregexp work.

> The proposed grep is slow, very slow, and I've sent a long message to
> James outlining how to make it much faster, but algorithmic complexity
> is not an issue.

So you say without having checked.

> > > The test you suggested doesn't show anything about that algorithmic
> > > complexity, though.
> >
> > Yeah? Try to back that with the results of the tests I suggested.
> 
> No, it's not even worth my time.
> 
> Now look.  You've gotten me so upset I actually went and did a simple
> test.  The test showed I'm right and you're wrong.  Catting X number
> of copies of /etc/termcap into longfile causes the time grep uses
> to pass longfile searching for all occurrences of "printer" causes
> it to use an extra 0.03 seconds for every repetition of /etc/termcap
> in longfile.
> 
> Gee, linear complexity wrt to file length.  Who could've guessed!?

That does not *begin* to cover the cases I outlined.

> What'ya bet GNU grep also exhibits linear complexity?  :)
> 
> Admit it, you jumped in with some bullshit about complexity when had
> you taken the time to look into what James meant when he said "it now
> spends 50% of its time in procline()" you would have kept quiet,
> realizing that he was talking about a constant factor in the
> complexity analysis, an subject where comments such as "it now spends
> 50% of its time in procline()" are relevent.

Ok, here is the _DATA_ backing my "bullshit".

First table: searching for non-existent patterns

Tests:
1) grep -e 123 world.build
2) grep -e 123456 world.build
3) grep -e 123 124 world.build
4) grep -e 123 456 world.build

These were made with GNU grep, the version 0.9 of the new grep, and
that version with the patch I sent previously (this later was
non-intended -- only after completing the test I realized the
executable was the one with my patches).

Each test was repeated five times after both the executable and the
target file were cached. I show here the averages of the line "real"
for time. The user and sys values were actually more interesting,
but with much greater deviation. :-)

	GNU grep	New grep	Patched grep
1)	0.09945s	0.4460s		0.3870s
2)	0.07225s	0.4424s		0.3894s
3)	0.12200s	0.6352s		0.5814s
4)	0.18240s	0.6364s		0.5796s

One can clearly see that GNU grep has a much better complexity in
the cases of longer patterns or multiple patterns with common
prefix.

It also shows that the new grep spends a lot of time in an activity
not related to the search itself, since it does multiple patterns by
calling regexec() multiple times, but 2:1 is not the proportion you
see up there. Also, the patch I introduced to eliminate
N*(malloc()+free()), N being the number of lines searched,
significantly reduces that overhead (overhead as in, *beyond* the
time spent in regexec()).

Second table: searching for existing patterns

Tests:
1) grep -e net world.build >/dev/null
2) grep -e netipx world.build >/dev/null
3) grep -e netinet world.build >/dev/null
4) grep -e netinet -e netipx world.build >/dev/null

	GNU grep	New grep
1)	0.10750s	0.57060s
2)	0.07575s	0.46375s
3)	0.07416s	0.46700s
4)	0.09950s	0.67440s

Though these tests involve more factors because each has a different
number of matches, it again shows very clearly that the new grep has
increased complexity in the case of multiple patterns. See there,
cases 1 and 4. The latter has *less* matches than the former.

Third table: non-existing pattern on different sized files

Tests:
1) grep 123 world.build
2) grep 123 world.build.2 (two times world.build)
3) grep 123 world.build.3 (three times world.build)
4) grep 123 world.build.4 (four times world.build)

	GNU grep	New grep
1)	0.09600s	0.44750s
2)	0.16425s	0.89075s
3)	0.24760s	1.30850s
4)	0.31833s	1.75900s

Linear, it would seem... but, alas, this is to be expected. Grep
searches inside lines, and the above does not increase the size of a
line, only the number of them. Still, it's a relief that the new
grep does not have a worse performance in this most simple test.

Fourth table: non-existing patterns on files with different line
sizes.

Tests:
1) grep abc line10
2) grep abc line20
3) grep 124 line10
4) grep 124 line20

Line10 and line20 have line sizes of 10 and 20, respectively. They
are both around 1 Mb in size, and same size. The lines are
1234567890 and 12345678901234567890, so the first pattern is
immediatly ignored, and the second has two matches before failing.

	GNU grep	New grep
1)	0.03600s	0.41916s
2)	0.03520s	0.25828s
3)	0.03642s	0.43400s
4)	0.03783s	0.27442s

These are very interesting numbers. First, it shows that new grep
performs *better* when the line size decreases! This can only mean
that a lot of the time spent on *each* line is spent *outside*
regexec(). Just FYI, O(M+N) is a greater complexity than O(M).

It would seem that GNU grep is superior in the case of partial
matches without a full match too, but the standard deviation for the
GNU grep case is too big for the relative differences in the GNU
grep case have any meaning.

Next time you "think" the complexity doesn't matter, do it with
non-FreeBSD code.

--
Daniel C. Sobral			(8-DCS)
dcs@newsguy.com
dcs@freebsd.org

	"Is it true that you're a millionaire's son who never worked a day
in your life?"
	"Yeah, I guess so."
	"Lemme tell you, son, you ain't missed a thing."


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message