Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 22 Jan 2009 20:38:19 -0500
From:      Yoshihiro Ota <ota@j.email.ne.jp>
To:        Oliver Fromme <olli@lurza.secnetix.de>
Cc:        freebsd-hackers@FreeBSD.ORG, xistence@0x58.com, cperciva@FreeBSD.ORG
Subject:   Re: freebsd-update's install_verify routine excessive stating
Message-ID:  <20090122203819.585fb35f.ota@j.email.ne.jp>
In-Reply-To: <200901221217.n0MCHfY3086653@lurza.secnetix.de>
References:  <F671A531-2093-4CCC-B2AB-9339B19FECC4@0x58.com> <200901221217.n0MCHfY3086653@lurza.secnetix.de>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi.

It's interesting. 

On Thu, 22 Jan 2009 13:17:41 +0100 (CET)
Oliver Fromme <olli@lurza.secnetix.de> wrote:

> Hi,
> 
> 
> So I would suggest to replace the whole pipe with this:
> 
>    awk -F "|" '$2 ~ /^f/ {print $2}' "$@" |
>    sort -u > filelist
> 
> It would be much better to generate two lists:
>  - The list of hashes, as already done ("filelist")
>  - A list of gzipped files present, stripped to the hash:
> 
>    (cd files; echo *.gz) |
>    tr ' ' '\n' |
>    sed 's/\.gz$//' > filespresent
> 
> Note we use "echo" instead of "ls", in order to avoid the
> kern.argmax limit.  64000 files would certainly exceed that
> limit.  Also note that the output is already sorted because
> the shell sorts wildcard expansions.
> 
> Now that we have those two files, we can use comm(1) to
> find out whether there are any hashes in filelist that are
> not in filespresent:
> 
>    if [ -n "$(comm -23 filelist filespresent)" ]; then
>            echo -n "Update files missing -- "
>            ...
>    fi
> 
> That solution scales much better because no shell loop is
> required at all.

This will probably be the fastest.

awk -F "|" '
  $2 ~ /^f/{required[$7]=$7; count++}
  END{FS="[/.]";
   while("find files -name *.gz" | getline>0)
    if($2 in required)
     if(--count<=0)
      exit(0);
  exit(count)}' "$@"

It verifies entries using hashtable instead of sort.
Therefore, it is O(n+m) instead of O(n*log(n))+O(m*log(m)) in theory.
I am not well aware how good awk's associate array is, though.

Regards,
Hiro



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090122203819.585fb35f.ota>