From owner-freebsd-hackers@FreeBSD.ORG Fri Jan 23 11:09:08 2009 Return-Path: Delivered-To: freebsd-hackers@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 85F0A1065673; Fri, 23 Jan 2009 11:09:08 +0000 (UTC) (envelope-from olli@lurza.secnetix.de) Received: from lurza.secnetix.de (unknown [IPv6:2a01:170:102f::2]) by mx1.freebsd.org (Postfix) with ESMTP id 048668FC22; Fri, 23 Jan 2009 11:09:07 +0000 (UTC) (envelope-from olli@lurza.secnetix.de) Received: from lurza.secnetix.de (localhost [127.0.0.1]) by lurza.secnetix.de (8.14.3/8.14.3) with ESMTP id n0NB94Jn069165; Fri, 23 Jan 2009 12:09:05 +0100 (CET) (envelope-from oliver.fromme@secnetix.de) Received: (from olli@localhost) by lurza.secnetix.de (8.14.3/8.14.3/Submit) id n0NB933k069163; Fri, 23 Jan 2009 12:09:03 +0100 (CET) (envelope-from olli) From: Oliver Fromme Message-Id: <200901231109.n0NB933k069163@lurza.secnetix.de> To: ota@j.email.ne.jp (Yoshihiro Ota) Date: Fri, 23 Jan 2009 12:09:03 +0100 (CET) In-Reply-To: <20090122203819.585fb35f.ota@j.email.ne.jp> X-Mailer: ELM [version 2.5 PL8] MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-2.1.2 (lurza.secnetix.de [127.0.0.1]); Fri, 23 Jan 2009 12:09:05 +0100 (CET) Cc: freebsd-hackers@FreeBSD.ORG, xistence@0x58.com, cperciva@FreeBSD.ORG Subject: Re: freebsd-update's install_verify routine excessive stating X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 23 Jan 2009 11:09:09 -0000 Yoshihiro Ota wrote: > Oliver Fromme wrote: > > It would be much better to generate two lists: > > - The list of hashes, as already done ("filelist") > > - A list of gzipped files present, stripped to the hash: > > > > (cd files; echo *.gz) | > > tr ' ' '\n' | > > sed 's/\.gz$//' > filespresent > > > > Note we use "echo" instead of "ls", in order to avoid the > > kern.argmax limit. 64000 files would certainly exceed that > > limit. Also note that the output is already sorted because > > the shell sorts wildcard expansions. > > > > Now that we have those two files, we can use comm(1) to > > find out whether there are any hashes in filelist that are > > not in filespresent: > > > > if [ -n "$(comm -23 filelist filespresent)" ]; then > > echo -n "Update files missing -- " > > ... > > fi > > > > That solution scales much better because no shell loop is > > required at all. > > This will probably be the fastest. Are you sure? I'm not. Only a benchmark can answer that. See below. > awk -F "|" ' > $2 ~ /^f/{required[$7]=$7; count++} > END{FS="[/.]"; > while("find files -name *.gz" | getline>0) > if($2 in required) > if(--count<=0) > exit(0); > exit(count)}' "$@" I think this awk solution is more difficult to read and understand, which means that it is also more prone to introduce errors. In fact, there are several problems already: First, you didn't quote the *.gz wildcard, so it will fail if there happens to be a file "*.gz" in the current directory. Second, the script makes assumptions about the format of the hash strings, e.g. that they don't contain dots. (Currently they only contain hex digits, AFAICT, but what if the format changes in the future.) Third, exit(count) is a bad idea, because exit codes are truncated to 8 bits. If 256 files happen to be missing, the script will seem to exit successfully. All these flaws could be fixed, of course, but it will introduce more complexity. The shell code using sort(1) and comm(1) doesn't have those flaws and appears more robust. > It verifies entries using hashtable instead of sort. > Therefore, it is O(n+m) instead of O(n*log(n))+O(m*log(m)) in theory. > I am not well aware how good awk's associate array is, though. It is pretty simple. It's a hash list that starts with 50 buckets. The number of buckets is doubled (and all entries re-hashed!) when the average number of elements per bucket exceeds 2. The hash function is very simple, it's just "hash = hash * 31 + c" for every character c in the string, the result is modulo the current number of buckets. Note that freebsd-update uses SHA256 hashes which are fairly long (64 characters). BTW, you can easily make benchmarks. The following command will create 64000 files of the format "%64x.gz". Be sure to have enough free inodes on your file system ("df -i"). jot -rw "%08x" 64000 0 2000000000 | sed 's/.*/&&&&&&&&.gz/' | xargs touch This took 2 minutes on my notebook (3 years old). I also noticed that the first 47000 inodes were created very quickly (about 5 seconds), but then the speed dropped down suddenly to about 150 inodes per second for the rest of the two minutes. CPU was 100% system according to top. Removing the files is equally slow. So there seems to be a limit of about 47000 inodes per directory -- any more than that and it gets horribly inefficient. Therefore it would probably be a good idea to change freebsd-update to create subdirectories and distribute the files among them. For example, make 16 subdirectories [0-9a-f] and put the files into them according to the first digit of the hash. This will probably boost performance noticeably. Sorting those 64000 files is really *not* an issue. The difference between "ls" and "ls -f" is only 0.2 seconds on my notebook. When using "ls -f | sort", sort(1) uses only 0.12 seconds of CPU time. This is negligible. Next I made a simple test with awk within that directory: awk 'BEGIN{while("find . -name \\*.gz" | getline>0)required[$1]=$1}' That script (which does only half of the required work) takes 4 seconds, reproducible. So I didn't bother to write and test the full awk solution. Bottom line: The awk solution is less robust, less readable, and much slower. Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün- chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd "The most important decision in [programming] language design concerns what is to be left out." -- Niklaus Wirth