From owner-freebsd-hackers@FreeBSD.ORG Fri Jan 23 19:34:42 2009 Return-Path: Delivered-To: freebsd-hackers@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 260D21065676 for ; Fri, 23 Jan 2009 19:34:42 +0000 (UTC) (envelope-from dougb@FreeBSD.org) Received: from mail2.fluidhosting.com (mx22.fluidhosting.com [204.14.89.5]) by mx1.freebsd.org (Postfix) with ESMTP id C2D698FC1E for ; Fri, 23 Jan 2009 19:34:41 +0000 (UTC) (envelope-from dougb@FreeBSD.org) Received: (qmail 29069 invoked by uid 399); 23 Jan 2009 19:34:41 -0000 Received: from localhost (HELO ?192.168.0.19?) (dougb@dougbarton.us@127.0.0.1) by localhost with ESMTPAM; 23 Jan 2009 19:34:41 -0000 X-Originating-IP: 127.0.0.1 X-Sender: dougb@dougbarton.us Message-ID: <497A1BEE.7070709@FreeBSD.org> Date: Fri, 23 Jan 2009 11:35:10 -0800 From: Doug Barton Organization: http://www.FreeBSD.org/ User-Agent: Thunderbird 2.0.0.19 (Windows/20081209) MIME-Version: 1.0 To: Oliver Fromme References: <200901231109.n0NB933k069163@lurza.secnetix.de> In-Reply-To: <200901231109.n0NB933k069163@lurza.secnetix.de> X-Enigmail-Version: 0.95.7 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Yoshihiro Ota , freebsd-hackers@FreeBSD.ORG, xistence@0x58.com, cperciva@FreeBSD.ORG Subject: Re: freebsd-update's install_verify routine excessive stating X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 23 Jan 2009 19:34:42 -0000 Oliver Fromme wrote: > Yoshihiro Ota wrote: > > Oliver Fromme wrote: > > > It would be much better to generate two lists: > > > - The list of hashes, as already done ("filelist") > > > - A list of gzipped files present, stripped to the hash: > > > > > > (cd files; echo *.gz) | > > > tr ' ' '\n' | > > > sed 's/\.gz$//' > filespresent > > > > > > Note we use "echo" instead of "ls", in order to avoid the > > > kern.argmax limit. 64000 files would certainly exceed that > > > limit. Also note that the output is already sorted because > > > the shell sorts wildcard expansions. > > > > > > Now that we have those two files, we can use comm(1) to > > > find out whether there are any hashes in filelist that are > > > not in filespresent: > > > > > > if [ -n "$(comm -23 filelist filespresent)" ]; then > > > echo -n "Update files missing -- " > > > ... > > > fi > > > > > > That solution scales much better because no shell loop is > > > required at all. > > > > This will probably be the fastest. > > Are you sure? I'm not. I'd put money on this being faster for a lot of reasons. test is a builtin in our /bin/sh, so there is no exec involved for 'test -f', but going out to disk for 64k files on an individual basis should definitely be slower than getting the file list in one shot. There's no doubt that the current routine is not efficient. The cat should be eliminated, the following is equivalent: cut -f 2,7 -d '|' $@ | (quoting the $@ won't make a difference here). I haven't seen the files we're talking about, but I can't help thinking that cut | grep | cut could be streamlined. > Only a benchmark can answer that. Agreed, when making changes like this you should always benchmark them. I did a lot of that when working on portmaster 2.0 which is why I have some familiarity with this issue. > > awk -F "|" ' > > $2 ~ /^f/{required[$7]=$7; count++} > > END{FS="[/.]"; > > while("find files -name *.gz" | getline>0) > > if($2 in required) > > if(--count<=0) > > exit(0); > > exit(count)}' "$@" > > I think this awk solution is more difficult to read and > understand, which means that it is also more prone to > introduce errors. I agree, but I have only passing familiarity with awk, so to someone who knows awk this might look like "hello world." :) Doug -- This .signature sanitized for your protection