From owner-freebsd-hackers@FreeBSD.ORG Tue Nov 3 22:14:50 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6691D106566B for ; Tue, 3 Nov 2009 22:14:50 +0000 (UTC) (envelope-from mel.flynn+fbsd.hackers@mailing.thruhere.net) Received: from mailhub.rachie.is-a-geek.net (rachie.is-a-geek.net [66.230.99.27]) by mx1.freebsd.org (Postfix) with ESMTP id 000FC8FC13 for ; Tue, 3 Nov 2009 22:14:49 +0000 (UTC) Received: from smoochies.rachie.is-a-geek.net (mailhub.rachie.is-a-geek.net [192.168.2.11]) by mailhub.rachie.is-a-geek.net (Postfix) with ESMTP id 0B6F17E854; Tue, 3 Nov 2009 13:14:48 -0900 (AKST) From: Mel Flynn To: freebsd-hackers@freebsd.org Date: Tue, 3 Nov 2009 23:14:45 +0100 User-Agent: KMail/1.12.1 (FreeBSD/8.0-RC1; KDE/4.3.1; i386; ; ) References: <200911032122.28905.mel.flynn+fbsd.hackers@mailing.thruhere.net> <4AF09E49.3010705@FreeBSD.org> In-Reply-To: <4AF09E49.3010705@FreeBSD.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Message-Id: <200911032314.45247.mel.flynn+fbsd.hackers@mailing.thruhere.net> Cc: Gabor Kovesdan Subject: Re: Issue with grep -i (on i386 only?) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 Nov 2009 22:14:50 -0000 On Tuesday 03 November 2009 22:19:05 Gabor Kovesdan wrote: > Mel Flynn escribi=F3: > > Hi, > > > > attached a little test script for grep's -i performance. I tried a few > > different machines and the 64-bit 7.2 machine I could steal doesn't seem > > to be affected and out performs pcregrep. >=20 > Note, that pcregrep isn't POSIX regex so it's not a good base of > comparison. PCRE provides a POSIX-compliant interface to deal with > Perl-compatible regex for those, who are already familiar with the > former but it's still Perl regex and not POSIX! That's why some people > get confused when PCRE comes to the topic. I realize this, but for the case in question it does not matter. Both=20 'regexes' should do the same in PCRE and POSIX. I provided the comparison t= o=20 show that the 'problem of case insensitive comparison' is solvable, at the= =20 very least for the simple case. > > On i386 machines, grep -i is significantly slower: > > i386, 7.2-STABLE of Sep 8, load averages: 0.00, 0.02, 0.00, > > Mem: 336M Active, 442M Inact, 217M Wired, 38M Cache, 112M Buf, 198M Free > > dev.cpu.0.freq: 2992 (Intel P-IV HTT enabled) > > 16Meg file result: > > =3D>>> 16777216 > > =3D>>> fgrep > > 0.04 real 0.02 user 0.01 sys > > 0.04 real 0.03 user 0.01 sys > > =3D>>> pcregrep > > 0.21 real 0.19 user 0.02 sys > > 0.21 real 0.20 user 0.00 sys > > =3D>>> grep > > 0.04 real 0.02 user 0.01 sys << not -i > > 3.64 real 3.61 user 0.01 sys << -i >=20 > It's an interesting observation, I have never heard of this. >=20 > > So it looks to me that, while there is a problem with case insensitive > > comparison, just rewriting the expression is an optimization grep could > > perform. > > Either way, with the new text tools being written (done?) is this probl= em > > being attacked, not fixable due to specifications or not considered an > > issue? Any PR's needed / I missed? Patches to try? > > > > [And it just occured to me bsdgrep is in ports]: > > =3D>>> bsdgrep > > 0.93 real 0.74 user 0.00 sys > > 4.80 real 4.33 user 0.02 sys > > 4.97 real 4.34 user 0.01 sys > > > > So here the optimization does not fly. >=20 > Unfortunately, this is the most important issue with BSDL texttools. In > the grep case, the BSDL version is ready and feature-complete but the > performance isn't quite satisfying. The main reason of this is GNU grep > uses a lot of shortcuts, which results in a bloated code (8000 LOC), > while BSDL grep keeps everything simple and straightforward (1500 LOC). > IMO, the desired solution would be to keep grep small and get a modern > regex library for FreeBSD, which performs well. Pushing regex > optimizations into grep is a bad idea because it not just makes the code > bloated but other regex users won't benefit from the optimization so the > problem should be fixed at its roots. And the current regex library we > have is old, slow and doesn't support wchar, at all. With this kind of difference, I don't really care who performs the=20 optimization, but it seems that multiple options at the same character spot= is=20 not handled very well, with an extra penalty for "case insensitive". Why this isn't present on my 64-bit machine is a bit of a mystery to me, bu= t=20 since almost no time is spent in sys, I can't blame it on kernel. > Btw, do you mind if I include your script into the BSD grep > distribution? I already planned to write something like this for future > testing. Consider it public domain. =2D-=20 Mel