From owner-freebsd-fs@FreeBSD.ORG Thu Oct 23 16:58:36 2003 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2988416A4B3 for ; Thu, 23 Oct 2003 16:58:36 -0700 (PDT) Received: from sploot.vicor-nb.com (sploot.vicor-nb.com [208.206.78.81]) by mx1.FreeBSD.org (Postfix) with ESMTP id 549C443FE3 for ; Thu, 23 Oct 2003 16:58:35 -0700 (PDT) (envelope-from kmarx@vicor.com) Received: from vicor.com (localhost [127.0.0.1]) by sploot.vicor-nb.com (8.12.8/8.12.8) with ESMTP id h9NNrdT1063942; Thu, 23 Oct 2003 16:53:39 -0700 (PDT) (envelope-from kmarx@vicor.com) Message-ID: <3F986A03.2050809@vicor.com> Date: Thu, 23 Oct 2003 16:53:39 -0700 From: Ken Marx User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.3b) Gecko/20030402 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Kirk McKusick References: <200310231946.h9NJkQeN007683@beastie.mckusick.com> In-Reply-To: <200310231946.h9NJkQeN007683@beastie.mckusick.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Mailman-Approved-At: Sat, 25 Oct 2003 07:10:35 -0700 cc: freebsd-fs@freebsd.org cc: cburrell@vicor.com cc: julian@vicor-nb.com cc: davep@vicor.com cc: VicPE@aol.com cc: jpl@vicor.com cc: gluk@ptci.ru cc: jrh@vicor.com cc: Julian Elischer Subject: Re: 4.8 ffs_dirpref problem X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Oct 2003 23:58:36 -0000 Ok, thanks, Kirk. Re newfs'ing and re-doing our test is on the todo list. Probably an overnight thing. Meanwhile we did a bit more digging and, maybe, found an anomaly: We did a few escapes to ddb while the perfomance was bad to see what a typical stack was: --- interrupt, eip = 0xc01d9af4, esp = 0xcfe24bf8, ebp = 0xcfe24c04 --- gbincore(cf3c6d00,1d090040,cfe24ca8,401,0) at gbincore+0x34 getblk(cf3c6d00,1d090040,1000,0,0) at getblk+0x80 bread(cf3c6d00,1d090040,1000,0,cfe24ca8) at bread+0x27 ffs_alloccg(c21eaf00,1d09,0,800) at ffs_alloccg+0x70 ffs_hashalloc(c21eaf00,1908,6420008,800,c026f110) at ffs_hashalloc+0x8c ffs_alloc(c21eaf00,0,6420008,800,c1f93080) at ffs_alloc+0xc9 ffs_balloc(cfe24e2c,cfc9da40,c203bd80,20001,cfccfde0) at ffs_balloc+0x46a ffs_write(cfe24e64,c203bd80,cf9934e0,41b,c03695a0) at ffs_write+0x319 vn_write(c203bd80,cfe24ed4,c1f93080,0,cf9934e0) at vn_write+0x15e dofilewrite(cf9934e0,c203bd80,4,809d200,41b) at dofilewrite+0xc1 write(cf9934e0,cfe24f80,41b,809d200,0) at write+0x3b --------------- So, alloccg logic needs to get the cg block. It goes through getblk which in turn looks to see if the block is alredy in an in-mem hashtable via the lookup routine, gbincore. Julian had the thought that perhaps there was something funny about this hash table. Possible wrt to cg blocks. So, we hacked in a frew routines to histogram how often each bucket was searched, and the 'average depth' of the bucket. (This crude average is total running sum of depths found over all times bucket was searched, divided by total times bucket was searched.) We found that block numbers really spike at bucket 250, and that the avg-depth of that bucket is 10-100 times that of any other over the total of 1023 buckets in the hash: bh[247]: freq=1863, avgdepth = 1 bh[248]: freq=1860, avgdepth = 1 bh[249]: freq=1777, avgdepth = 1 bh[250]: freq=969100, avgdepth = 440 bh[251]: freq=1595, avgdepth = 12 bh[252]: freq=1437, avgdepth = 1 To verify that these were cg block lookups we did a similar histogram of hash indexes for the actual bread() calls in ffs_alloccg. That is the bucket that would be hashed for (ip->i_devvp, fsbtodb(fs, cgtod(fs, cg)) We got similar, corroborating results: bh[248]: freq=0 bh[249]: freq=0 bh[250]: freq=662387 bh[251]: freq=0 bh[252]: freq=40 bh[253]: freq=0 It appears that lookups for cg blocks (that are probably in memory already) tend to be more costly than necessary(?). So, it may be that a better tuned file system would likely help. But is it also possible that tuning wouldn't be needed if the hash table were more evenly distributed? We can dump the block list for the anomalous hashtable bucket if you wish. And/or any other info/suggestions you have for that matter. Maybe we'll hack in a new hashing function just for kicks to see what happens... Thanks again for your time! k Kirk McKusick wrote: > Date: Thu, 23 Oct 2003 11:08:02 -0700 > From: Ken Marx > To: Julian Elischer > CC: mckusick@mckusick.com, cburrell@vicor.com, davep@vicor.com, > freebsd-fs@freebsd.org, gluk@ptci.ru, jpl@vicor.com, > jrh@vicor.com, julian@vicor-nb.com, VicPE@aol.com > Subject: Re: 4.8 ffs_dirpref problem > X-ASK-Info: Confirmed by User > > Thanks for the reply, > > We actually *did* try -s 4096 yesterday (not quite what you > suggested) with spotty results: Sometimes it seemed to go > more quickly, but often not. > > Let me clarify our test: We have a 1.5gb tar file from our > production raid that fairly represents the distribution of > data. We hit the performance problem when we get to dirs > with lots of small-ish files. But, as Julian mentioned, > we typically have many flavors of file sizes and populations. > > Admittedly, our untar'ing test isn't necessarily representitive > of what happens in production - we were just trying to fill > the disk and recreate the problem here. We *did* at least > hit a noticeable problem, and we believe it's the same > behavior that's hitting production. > > I just tried your exact suggested settings on an fs that > was already 96% full, and still experienced the very sluggish > behavior on exactly the same type of files/dirs. > > Our untar typically takes around 60-100 sec of system time > when things are going ok; 300-1000+ sec when the sluggishness > occurs. This time tends to increase as we get closer to > 99%. Sometimes as high as 4000+ secs. > > I wasn't clear from your mail if I should newfs the entire > fs and start over, or if I could have expected the settings > to make a difference for any NEW data. > > I can do this latter if you think it's required. The test > will then take several hours to run since we need at least > 85% disk usage to start seeing the problem. > > Thanks! > k > > Unfortunately, I do believe that you will need to start over from > scratch with a newfs. The problem is that by the time you are at > 85% full with the old parameters, the directory structure is already > too "dense" forcing you to search far and wide for more inodes. If > you start from the beginning with a large filesperdir then your > directory structure will expand across more of the disk which > should approximate the old algorithm. > > Kirk McKusick > > -- Ken Marx, kmarx@vicor-nb.com It's an orthogonal issue to leverage our critical resources and focus hard to resolve the market forces. - http://www.bigshed.com/cgi-bin/speak.cgi