From owner-freebsd-questions@FreeBSD.ORG Sun Aug 24 15:27:00 2008 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1AC04106566C for ; Sun, 24 Aug 2008 15:27:00 +0000 (UTC) (envelope-from ws@au.dyndns.ws) Received: from ipmail05.adl2.internode.on.net (ipmail05.adl2.internode.on.net [203.16.214.145]) by mx1.freebsd.org (Postfix) with ESMTP id 8275C8FC08 for ; Sun, 24 Aug 2008 15:26:59 +0000 (UTC) (envelope-from ws@au.dyndns.ws) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Aj8BABsbsUiWZWdv/2dsb2JhbAAIskSBag X-IronPort-AV: E=Sophos;i="4.32,263,1217773800"; d="scan'208";a="189259317" Received: from ppp103-111.static.internode.on.net (HELO [192.168.1.157]) ([150.101.103.111]) by ipmail05.adl2.internode.on.net with ESMTP; 25 Aug 2008 00:56:57 +0930 From: Wayne Sierke To: Walt Pawley In-Reply-To: References: <200808220759.m7M7xuh0047625@lurza.secnetix.de> <48AFD1ED.5070800@infracaninophile.co.uk> Content-Type: text/plain Date: Mon, 25 Aug 2008 00:56:54 +0930 Message-Id: <1219591614.49053.142.camel@predator-ii.buffyverse> Mime-Version: 1.0 X-Mailer: Evolution 2.22.2 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit Cc: Oliver Fromme , freebsd-questions@freebsd.org Subject: Re: sed/awk, instead of Perl X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 24 Aug 2008 15:27:00 -0000 On Sat, 2008-08-23 at 15:16 -0700, Walt Pawley wrote: > At 10:01 AM +0100 8/23/08, Matthew Seaman wrote: > >Walt Pawley wrote: > >> > >> At the risk of beating this to death, I just happened to > >> stumble on a real world example of why one might want to use > >> Perl for sed-ly stuff. > >> ... snip ... > >> wump$ ls -l Desktop/klog > >> -rw-r--r-- 1 wump 1001 52753322 22 Aug 16:37 Desktop/klog > >> wump$ time sed "s/ .*//" Desktop/klog > kadr1 > >> > >> real 0m10.800s > >> user 0m10.580s > >> sys 0m0.250s > >> wump$ time perl -pe 's/ .*//' Desktop/klog > kadr2 > >> > >> real 0m0.975s > >> user 0m0.700s > >> sys 0m0.270s > >> wump$ cmp kadr1 kadr2 > >> wump$ > >> > >> Why disparity in execution speed? ... > > > >Careful now. Have you accounted for the effect of the klog file > >being cached in VM rather than having to be read afresh from disk? > >It makes a very big difference in how fast it is processed. > > No, I hadn't done any such accounting. So, wrote a little script > you can surmise from the following output: > > wump$ sh -v spdtst > time perl -pe 's/ .*//' Desktop/klog > /dev/null > > real 0m0.961s > user 0m0.740s > sys 0m0.230s > time sed "s/ .*//" Desktop/klog > /dev/null > > real 0m10.506s > user 0m10.270s > sys 0m0.250s > time awk '{print $1}' Desktop/klog > /dev/null > > real 0m2.333s > user 0m2.140s > sys 0m0.180s > time sed "s/ .*//" Desktop/klog > /dev/null > > real 0m10.489s > user 0m10.250s > sys 0m0.230s > time perl -pe 's/ .*//' Desktop/klog > /dev/null > > real 0m0.799s > user 0m0.580s > sys 0m0.220s > I see similar results on all of four systems I tried here - an order of magnitude difference between perl (fastest) and sed, and awk slightly slower than perl. All running perl 5.8.8. I did a handful of manual runs and took the most consistent-looking results. Source file was a 62MB apache log with 232k records. Interestingly an Ubuntu system exhibited a similar difference between perl and sed, but its awk was slightly faster than perl. > >In order to get meaningful data for this sort of test you should > >do a dummy run or two of each command in fairly quick succession, > >and then repeat your test runs a number of times and look at the > >average and standard deviation of the execution times. ... > > Yeah, Hoyle would like that. But for me, I think the results > are clear enough without all the messing with statistical > computations. 10 to 1 or better is good enough for me to think > there's some major difference. That said, it would appear that > caching can make a difference - which is why I put the Perl > invocation first ... so it would be running without the benefit > of caching. But I don't believe I was entirely successful in > that effort. The very first time I ran this, which was also the > very first time in a whole day that the klog file had been > accessed, the first Perl invocation took about 2 seconds of > real time and still only 0.7 seconds of user time. I don't > believe caching explains the execution speed disparity. > > It was mentioned that this function is made for awk, so I tried > that as well. It is also evidently not as quick as Perl at > doing the job. The time shown above is quite consistent with a > number of other runs I've tried with awk. > Keep in mind that awk, while producing a comparable result, likely uses quite a different parsing strategy. While the comparison is interesting for this particular test-case, different circumstances could produce very different results. > I suspect a real Perl internals maven could explain this. I > have some ideas but they're conjecture. Perhaps some effort to > improve execution efficiency in sed and awk would not be wasted? My conjecture is this: the implementation of pcre that perl uses most likely has good optimisation for the "ends with .*" part of the pattern (vs sed). While the result is certainly interesting and perhaps surprising[1], it is for a single, simple pattern which is far too little to draw much in the way of conclusions from - except perhaps that extracting the first field from a data source with many records can possibly be effected more rapidly with perl or awk than sed. Nevertheless, I've always dismissed perl as being "heavy and slow" through anecdotal "evidence" and the results you found are a pertinent reminder that assumptions like that are never worthy. Wayne [1] particularly in light of studies such as this one: http://swtch.com/~rsc/regexp/regexp1.html