Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 23 Jan 2009 11:35:10 -0800
From:      Doug Barton <dougb@FreeBSD.org>
To:        Oliver Fromme <olli@lurza.secnetix.de>
Cc:        Yoshihiro Ota <ota@j.email.ne.jp>, freebsd-hackers@FreeBSD.ORG, xistence@0x58.com, cperciva@FreeBSD.ORG
Subject:   Re: freebsd-update's install_verify routine excessive stating
Message-ID:  <497A1BEE.7070709@FreeBSD.org>
In-Reply-To: <200901231109.n0NB933k069163@lurza.secnetix.de>
References:  <200901231109.n0NB933k069163@lurza.secnetix.de>

next in thread | previous in thread | raw e-mail | index | archive | help
Oliver Fromme wrote:
> Yoshihiro Ota wrote:
>  > Oliver Fromme wrote:
>  > > It would be much better to generate two lists:
>  > >  - The list of hashes, as already done ("filelist")
>  > >  - A list of gzipped files present, stripped to the hash:
>  > > 
>  > >    (cd files; echo *.gz) |
>  > >    tr ' ' '\n' |
>  > >    sed 's/\.gz$//' > filespresent
>  > > 
>  > > Note we use "echo" instead of "ls", in order to avoid the
>  > > kern.argmax limit.  64000 files would certainly exceed that
>  > > limit.  Also note that the output is already sorted because
>  > > the shell sorts wildcard expansions.
>  > > 
>  > > Now that we have those two files, we can use comm(1) to
>  > > find out whether there are any hashes in filelist that are
>  > > not in filespresent:
>  > > 
>  > >    if [ -n "$(comm -23 filelist filespresent)" ]; then
>  > >            echo -n "Update files missing -- "
>  > >            ...
>  > >    fi
>  > > 
>  > > That solution scales much better because no shell loop is
>  > > required at all.
>  > 
>  > This will probably be the fastest.
> 
> Are you sure?  I'm not.

I'd put money on this being faster for a lot of reasons. test is a
builtin in our /bin/sh, so there is no exec involved for 'test -f',
but going out to disk for 64k files on an individual basis should
definitely be slower than getting the file list in one shot.

There's no doubt that the current routine is not efficient. The cat
should be eliminated, the following is equivalent:

cut -f 2,7 -d '|' $@ |

(quoting the $@ won't make a difference here).

I haven't seen the files we're talking about, but I can't help
thinking that cut | grep | cut could be streamlined.

> Only a benchmark can answer that. 

Agreed, when making changes like this you should always benchmark
them. I did a lot of that when working on portmaster 2.0 which is why
I have some familiarity with this issue.

>  > awk -F "|" '
>  >   $2 ~ /^f/{required[$7]=$7; count++}
>  >   END{FS="[/.]";
>  >    while("find files -name *.gz" | getline>0)
>  >     if($2 in required)
>  >      if(--count<=0)
>  >       exit(0);
>  >   exit(count)}' "$@"
> 
> I think this awk solution is more difficult to read and
> understand, which means that it is also more prone to
> introduce errors. 

I agree, but I have only passing familiarity with awk, so to someone
who knows awk this might look like "hello world." :)

Doug

-- 

    This .signature sanitized for your protection



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?497A1BEE.7070709>