Date: Fri, 23 Jan 2009 12:09:03 +0100 (CET) From: Oliver Fromme <olli@lurza.secnetix.de> To: ota@j.email.ne.jp (Yoshihiro Ota) Cc: freebsd-hackers@FreeBSD.ORG, xistence@0x58.com, cperciva@FreeBSD.ORG Subject: Re: freebsd-update's install_verify routine excessive stating Message-ID: <200901231109.n0NB933k069163@lurza.secnetix.de> In-Reply-To: <20090122203819.585fb35f.ota@j.email.ne.jp>
next in thread | previous in thread | raw e-mail | index | archive | help
Yoshihiro Ota wrote:
 > Oliver Fromme wrote:
 > > It would be much better to generate two lists:
 > >  - The list of hashes, as already done ("filelist")
 > >  - A list of gzipped files present, stripped to the hash:
 > > 
 > >    (cd files; echo *.gz) |
 > >    tr ' ' '\n' |
 > >    sed 's/\.gz$//' > filespresent
 > > 
 > > Note we use "echo" instead of "ls", in order to avoid the
 > > kern.argmax limit.  64000 files would certainly exceed that
 > > limit.  Also note that the output is already sorted because
 > > the shell sorts wildcard expansions.
 > > 
 > > Now that we have those two files, we can use comm(1) to
 > > find out whether there are any hashes in filelist that are
 > > not in filespresent:
 > > 
 > >    if [ -n "$(comm -23 filelist filespresent)" ]; then
 > >            echo -n "Update files missing -- "
 > >            ...
 > >    fi
 > > 
 > > That solution scales much better because no shell loop is
 > > required at all.
 > 
 > This will probably be the fastest.
Are you sure?  I'm not.
Only a benchmark can answer that.  See below.
 > awk -F "|" '
 >   $2 ~ /^f/{required[$7]=$7; count++}
 >   END{FS="[/.]";
 >    while("find files -name *.gz" | getline>0)
 >     if($2 in required)
 >      if(--count<=0)
 >       exit(0);
 >   exit(count)}' "$@"
I think this awk solution is more difficult to read and
understand, which means that it is also more prone to
introduce errors.  In fact, there are several problems
already:
First, you didn't quote the *.gz wildcard, so it will
fail if there happens to be a file "*.gz" in the current
directory.
Second, the script makes assumptions about the format of
the hash strings, e.g. that they don't contain dots.
(Currently they only contain hex digits, AFAICT, but what
if the format changes in the future.)
Third, exit(count) is a bad idea, because exit codes are
truncated to 8 bits.  If 256 files happen to be missing,
the script will seem to exit successfully.
All these flaws could be fixed, of course, but it will
introduce more complexity.  The shell code using sort(1)
and comm(1) doesn't have those flaws and appears more
robust.
 > It verifies entries using hashtable instead of sort.
 > Therefore, it is O(n+m) instead of O(n*log(n))+O(m*log(m)) in theory.
 > I am not well aware how good awk's associate array is, though.
It is pretty simple.  It's a hash list that starts with
50 buckets.  The number of buckets is doubled (and all
entries re-hashed!) when the average number of elements
per bucket exceeds 2.  The hash function is very simple,
it's just "hash = hash * 31 + c" for every character c
in the string, the result is modulo the current number
of buckets.  Note that freebsd-update uses SHA256 hashes
which are fairly long (64 characters).
BTW, you can easily make benchmarks.  The following command
will create 64000 files of the format "%64x.gz".  Be sure
to have enough free inodes on your file system ("df -i").
jot -rw "%08x" 64000 0 2000000000 | sed 's/.*/&&&&&&&&.gz/' | xargs touch
This took 2 minutes on my notebook (3 years old).  I also
noticed that the first 47000 inodes were created very
quickly (about 5 seconds), but then the speed dropped down
suddenly to about 150 inodes per second for the rest of
the two minutes.  CPU was 100% system according to top.
Removing the files is equally slow.
So there seems to be a limit of about 47000 inodes per
directory -- any more than that and it gets horribly
inefficient.  Therefore it would probably be a good idea
to change freebsd-update to create subdirectories and
distribute the files among them.  For example, make 16
subdirectories [0-9a-f] and put the files into them
according to the first digit of the hash.  This will
probably boost performance noticeably.
Sorting those 64000 files is really *not* an issue.  The
difference between "ls" and "ls -f" is only 0.2 seconds on
my notebook.  When using "ls -f | sort", sort(1) uses only
0.12 seconds of CPU time.  This is negligible.
Next I made a simple test with awk within that directory:
awk 'BEGIN{while("find . -name \\*.gz" | getline>0)required[$1]=$1}'
That script (which does only half of the required work)
takes 4 seconds, reproducible.  So I didn't bother to
write and test the full awk solution.
Bottom line:  The awk solution is less robust, less readable,
and much slower.
Best regards
   Oliver
-- 
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606,  Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758,  Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart
FreeBSD-Dienstleistungen, -Produkte und mehr:  http://www.secnetix.de/bsd
"The most important decision in [programming] language design
concerns what is to be left out."  --  Niklaus Wirth
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200901231109.n0NB933k069163>
