From owner-freebsd-questions@FreeBSD.ORG Sat Aug 14 23:32:39 2004 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4B0CC16A4CE for ; Sat, 14 Aug 2004 23:32:39 +0000 (GMT) Received: from av7-2-sn2.hy.skanova.net (av7-2-sn2.hy.skanova.net [81.228.8.109]) by mx1.FreeBSD.org (Postfix) with ESMTP id 70AAC43D1D for ; Sat, 14 Aug 2004 23:32:37 +0000 (GMT) (envelope-from ertr1013@student.uu.se) Received: by av7-2-sn2.hy.skanova.net (Postfix, from userid 502) id 19EF037E43; Sun, 15 Aug 2004 01:32:36 +0200 (CEST) Received: from smtp2-2-sn2.hy.skanova.net (smtp2-2-sn2.hy.skanova.net [81.228.8.178]) by av7-2-sn2.hy.skanova.net (Postfix) with ESMTP id 0838E37E42 for ; Sun, 15 Aug 2004 01:32:36 +0200 (CEST) Received: from falcon.midgard.homeip.net (h201n1fls24o1048.bredband.comhem.se [212.181.162.201]) by smtp2-2-sn2.hy.skanova.net (Postfix) with SMTP id D626137E42 for ; Sun, 15 Aug 2004 01:32:35 +0200 (CEST) Received: (qmail 60541 invoked by uid 1001); 14 Aug 2004 23:32:35 -0000 Date: Sun, 15 Aug 2004 01:32:35 +0200 From: Erik Trulsson To: "Paul A. Hoadley" Message-ID: <20040814233234.GA56333@falcon.midgard.homeip.net> Mail-Followup-To: "Paul A. Hoadley" , freebsd-questions@freebsd.org References: <20040814230143.GB8610@grover.logicsquad.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20040814230143.GB8610@grover.logicsquad.net> User-Agent: Mutt/1.5.6i cc: freebsd-questions@freebsd.org Subject: Re: find -exec surprisingly slow X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 14 Aug 2004 23:32:39 -0000 On Sun, Aug 15, 2004 at 08:31:43AM +0930, Paul A. Hoadley wrote: > Hello, > > I'm in the process of cleaning a Maildir full of spam. It has > somewhere in the vicinity of 400K files in it. I started running > this yesterday: > > find . -atime +1 -exec mv {} /home/paulh/tmp/spam/sne/ \; > > It's been running for well over 12 hours. It certainly is > working---the spams are slowly moving to their new home---but it is > taking a long time. It's a very modest system, running 4.8-R on a > P2-350. I assume this is all overhead for spawning a shell and > running mv 400K times. I wouldn't make that assumption. The overhead for starting new processes is probably only a relatively small part of the time. You seem to have missed the fact that operations on very large directories (which a directory with 400K files in it certainly qualifies as) simply are slow. A directory is essentially just a list of the names of all the files in it and their i-nodes. To find a given file in a directory (e.g. in order to create, delete or rename it) the system needs to do a linear search through all the files in the directory. For directories containing large number of files this can take some time. If you have the UFS_DIRHASH kernel option enabled (which I believe is the default since 4.5-R) then the system will keep bunch of hash-tables in memory to avoid having to search through the whole directory every time. There is however an upper limit to how much memory will be used for such hashtables (2MB by default) and if this limit is exceeded (which it probably is in your case) things will slow down again. The effect of the UFS_DIRHASH option is effectively that instead of directory operations starting to slow down after a few thousand files in the same directory, you can have a few tens of thousands of files before operations start to become noticably slower. I am quite certain that if those 400K files had been divided into 40 directories, each with 10K files in it, things would have been much faster. > Is there a better way to move all files based > on some characteristic of their date stamp? Maybe separating the find > and the move, piping it through xargs? It's mostly done now, but I > will know better for next time. Reducing the number of processes spawned will certainly help some, but a better idea is to not have so many files in a single directory - that is just asking for trouble. -- Erik Trulsson ertr1013@student.uu.se