Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 30 Oct 2003 11:07:20 -0800
From:      Ken Marx <kmarx@vicor.com>
To:        Don Lewis <truckman@FreeBSD.org>
Cc:        mckusick@beastie.mckusick.com
Subject:   Re: 4.8 ffs_dirpref problem
Message-ID:  <3FA16168.2010209@vicor.com>
In-Reply-To: <200310300641.h9U6fWeF031328@gw.catspoiler.org>
References:  <200310300641.h9U6fWeF031328@gw.catspoiler.org>

next in thread | previous in thread | raw e-mail | index | archive | help


Don Lewis wrote:
> On 29 Oct, Ken Marx wrote:
> 
>>Don Lewis wrote:
> 
> 
>>>I think the real problem is the following code in ffs_dirpref():
>>>
>>>        avgifree = fs->fs_cstotal.cs_nifree / fs->fs_ncg;
>>>        avgbfree = fs->fs_cstotal.cs_nbfree / fs->fs_ncg;
>>>        avgndir = fs->fs_cstotal.cs_ndir / fs->fs_ncg;
>>>[snip]
>>>        maxndir = min(avgndir + fs->fs_ipg / 16, fs->fs_ipg);
>>>        minifree = avgifree - fs->fs_ipg / 4;
>>>        if (minifree < 0)
>>>                minifree = 0;
>>>        minbfree = avgbfree - fs->fs_fpg / fs->fs_frag / 4;
>>>        if (minbfree < 0)
>>>                minbfree = 0;
>>>[snip]
>>>        prefcg = ino_to_cg(fs, pip->i_number);
>>>        for (cg = prefcg; cg < fs->fs_ncg; cg++)
>>>                if (fs->fs_cs(fs, cg).cs_ndir < maxndir &&
>>>                    fs->fs_cs(fs, cg).cs_nifree >= minifree &&
>>>                    fs->fs_cs(fs, cg).cs_nbfree >= minbfree) {
>>>                        if (fs->fs_contigdirs[cg] < maxcontigdirs)
>>>                                return ((ino_t)(fs->fs_ipg * cg));
>>>                }
>>>        for (cg = 0; cg < prefcg; cg++)
>>>                if (fs->fs_cs(fs, cg).cs_ndir < maxndir &&
>>>                    fs->fs_cs(fs, cg).cs_nifree >= minifree &&
>>>                    fs->fs_cs(fs, cg).cs_nbfree >= minbfree) {
>>>                        if (fs->fs_contigdirs[cg] < maxcontigdirs)
>>>                                return ((ino_t)(fs->fs_ipg * cg));
>>>                }
>>>
>>>If the file system is more than 75% full, minbfree will be zero, which
>>>will allow new directories to be created in cylinder groups that have no
>>>free blocks for either the directory itself, or for any files created in
>>>that directory.  If this happens, allocating the blocks for the
>>>directory and its files will require ffs_alloc() to do an expensive
>>>search across the cylinder groups for each block.  It looks to me like
>>>minbfree needs to equal, or at least a lot closer to avgbfree.
> 
> 
> Actually, I think the expensive search will only happen for the first
> block in each file (and the other blocks will be allocated in the same
> cylinder group), but if you are creating tons of files that are only one
> block long ...
> 
> 
>>>A similar situation exists with minifree.  Please note that the fallback
>>>algorithm uses the condition:
>>>	fs->fs_cs(fs, cg).cs_nifree >= avgifree
>>>
>>>
>>>
>>
>>Interesting. We (Vicor) will defer to experts here, but are very willing to
>>test anything you come up with.
> 
> 
> You might try the lightly tested patch below.  It tweaks the dirpref
> algorithm so that cylinder groups with free space >= 75% of the average
> free space and free inodes >= 75% of the average number of free inodes
> are candidates for allocating the directory.  It will not chose a
> cylinder group that does not have at least one free block and one free
> inode.
> 
> It also decreases maxcontigdirs as the free space decreases so that a
> cluster of directories is less likely to cause the cylinder group to
> overflow.  I think it would be better to tune maxcontigdirs individually
> for each cylinder group, based on the free space in that cylinder group,
> but that is more complex ...
> 
> Index: sys/ufs/ffs/ffs_alloc.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_alloc.c,v
> retrieving revision 1.64.2.2
> diff -u -r1.64.2.2 ffs_alloc.c
> --- sys/ufs/ffs/ffs_alloc.c	21 Sep 2001 19:15:21 -0000	1.64.2.2
> +++ sys/ufs/ffs/ffs_alloc.c	30 Oct 2003 06:01:38 -0000
> @@ -696,18 +696,18 @@
>  	 * optimal allocation of a directory inode.
>  	 */
>  	maxndir = min(avgndir + fs->fs_ipg / 16, fs->fs_ipg);
> -	minifree = avgifree - fs->fs_ipg / 4;
> -	if (minifree < 0)
> -		minifree = 0;
> -	minbfree = avgbfree - fs->fs_fpg / fs->fs_frag / 4;
> -	if (minbfree < 0)
> -		minbfree = 0;
> +	minifree = avgifree - avgifree / 4;
> +	if (minifree < 1)
> +		minifree = 1;
> +	minbfree = avgbfree - avgbfree / 4;
> +	if (minbfree < 1)
> +		minbfree = 1;
>  	cgsize = fs->fs_fsize * fs->fs_fpg;
>  	dirsize = fs->fs_avgfilesize * fs->fs_avgfpdir;
>  	curdirsize = avgndir ? (cgsize - avgbfree * fs->fs_bsize) / avgndir : 0;
>  	if (dirsize < curdirsize)
>  		dirsize = curdirsize;
> -	maxcontigdirs = min(cgsize / dirsize, 255);
> +	maxcontigdirs = min((avgbfree * fs->fs_bsize) / dirsize, 255);
>  	if (fs->fs_avgfpdir > 0)
>  		maxcontigdirs = min(maxcontigdirs,
>  				    fs->fs_ipg / fs->fs_avgfpdir);
> 
> 

Thanks Don,

re:
...
> cylinder group), but if you are creating tons of files that are only one
> block long ...

Not terribly scientific, but when our test bogs down, it's often
in a directory with 6400 1-block files. So, your comment seems plausible.

Anyway - I just tested your patch. Again, unloaded system, repeatedly
untaring a 1.5gb file, starting at 97% capacity. and:
	
	tunefs: average file size: (-f)                            49152
	tunefs: average number of files in a directory: (-s)       1500
	...

Takes about 74 system secs per 1.5gb untar:
-------------------------------------------
/dev/da0s1e 558889580 497843972 16334442    97% 6858407 63316311   10%   /raid
      119.23 real         1.28 user        73.09 sys
/dev/da0s1e 558889580 499371100 14807314    97% 6879445 63295273   10%   /raid
      111.69 real         1.32 user        73.65 sys
/dev/da0s1e 558889580 500898228 13280186    97% 6900483 63274235   10%   /raid
      116.67 real         1.44 user        74.19 sys
/dev/da0s1e 558889580 502425356 11753058    98% 6921521 63253197   10%   /raid
      114.73 real         1.25 user        75.01 sys
/dev/da0s1e 558889580 503952484 10225930    98% 6942559 63232159   10%   /raid
      116.95 real         1.30 user        74.10 sys
/dev/da0s1e 558889580 505479614 8698800    98% 6963597 63211121   10%   /raid
      115.29 real         1.39 user        74.25 sys
/dev/da0s1e 558889580 507006742 7171672    99% 6984635 63190083   10%   /raid
      114.01 real         1.16 user        74.04 sys
/dev/da0s1e 558889580 508533870 5644544    99% 7005673 63169045   10%   /raid
      119.95 real         1.32 user        75.05 sys
/dev/da0s1e 558889580 510060998 4117416    99% 7026711 63148007   10%   /raid
      114.89 real         1.33 user        74.66 sys
/dev/da0s1e 558889580 511588126 2590288    99% 7047749 63126969   10%   /raid
      114.91 real         1.58 user        74.64 sys
/dev/da0s1e 558889580 513115254 1063160   100% 7068787 63105931   10%   /raid
tot:     1161.06 real        13.45 user       742.89 sys

Compares pretty favorably to our naive, retro 4.4 dirpref hack
that averages in the mid-high 60's:
--------------------------------------------------------------------
/dev/da0s1e 558889580 497843952 16334462    97% 6858406 63316312   10%   /raid
      110.19 real         1.42 user        65.54 sys
/dev/da0s1e 558889580 499371080 14807334    97% 6879444 63295274   10%   /raid
      105.47 real         1.47 user        65.09 sys
/dev/da0s1e 558889580 500898208 13280206    97% 6900482 63274236   10%   /raid
      110.17 real         1.48 user        64.98 sys
/dev/da0s1e 558889580 502425336 11753078    98% 6921520 63253198   10%   /raid
      131.88 real         1.49 user        71.20 sys
/dev/da0s1e 558889580 503952464 10225950    98% 6942558 63232160   10%   /raid
      111.61 real         1.62 user        67.47 sys
/dev/da0s1e 558889580 505479594 8698820    98% 6963596 63211122   10%   /raid
      131.36 real         1.67 user        90.79 sys
/dev/da0s1e 558889580 507006722 7171692    99% 6984634 63190084   10%   /raid
      115.34 real         1.49 user        65.61 sys
/dev/da0s1e 558889580 508533850 5644564    99% 7005672 63169046   10%   /raid
      110.26 real         1.39 user        65.26 sys
/dev/da0s1e 558889580 510060978 4117436    99% 7026710 63148008   10%   /raid
      116.15 real         1.51 user        65.47 sys
/dev/da0s1e 558889580 511588106 2590308    99% 7047748 63126970   10%   /raid
      112.74 real         1.37 user        65.01 sys
/dev/da0s1e 558889580 513115234 1063180   100% 7068786 63105932   10%   /raid
     1158.36 real        15.01 user       686.57 sys

Without either, we'd expect timings of 5-20 minutes when things are
going poorly.

Happy to test further if you have tweaks to your patch or
things you'd like us to test in particular. E.g., load,
newfs, etc.

k.
-- 
Ken Marx, kmarx@vicor-nb.com
As a company we must not put the cart before the horse and set up weekly 
meetings on the solution space.
		- http://www.bigshed.com/cgi-bin/speak.cgi



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3FA16168.2010209>