Date: Sun, 15 Aug 2004 01:32:35 +0200 From: Erik Trulsson <ertr1013@student.uu.se> To: "Paul A. Hoadley" <paulh@logicsquad.net> Cc: freebsd-questions@freebsd.org Subject: Re: find -exec surprisingly slow Message-ID: <20040814233234.GA56333@falcon.midgard.homeip.net> In-Reply-To: <20040814230143.GB8610@grover.logicsquad.net> References: <20040814230143.GB8610@grover.logicsquad.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Aug 15, 2004 at 08:31:43AM +0930, Paul A. Hoadley wrote: > Hello, > > I'm in the process of cleaning a Maildir full of spam. It has > somewhere in the vicinity of 400K files in it. I started running > this yesterday: > > find . -atime +1 -exec mv {} /home/paulh/tmp/spam/sne/ \; > > It's been running for well over 12 hours. It certainly is > working---the spams are slowly moving to their new home---but it is > taking a long time. It's a very modest system, running 4.8-R on a > P2-350. I assume this is all overhead for spawning a shell and > running mv 400K times. I wouldn't make that assumption. The overhead for starting new processes is probably only a relatively small part of the time. You seem to have missed the fact that operations on very large directories (which a directory with 400K files in it certainly qualifies as) simply are slow. A directory is essentially just a list of the names of all the files in it and their i-nodes. To find a given file in a directory (e.g. in order to create, delete or rename it) the system needs to do a linear search through all the files in the directory. For directories containing large number of files this can take some time. If you have the UFS_DIRHASH kernel option enabled (which I believe is the default since 4.5-R) then the system will keep bunch of hash-tables in memory to avoid having to search through the whole directory every time. There is however an upper limit to how much memory will be used for such hashtables (2MB by default) and if this limit is exceeded (which it probably is in your case) things will slow down again. The effect of the UFS_DIRHASH option is effectively that instead of directory operations starting to slow down after a few thousand files in the same directory, you can have a few tens of thousands of files before operations start to become noticably slower. I am quite certain that if those 400K files had been divided into 40 directories, each with 10K files in it, things would have been much faster. > Is there a better way to move all files based > on some characteristic of their date stamp? Maybe separating the find > and the move, piping it through xargs? It's mostly done now, but I > will know better for next time. Reducing the number of processes spawned will certainly help some, but a better idea is to not have so many files in a single directory - that is just asking for trouble. -- <Insert your favourite quote here.> Erik Trulsson ertr1013@student.uu.se
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20040814233234.GA56333>