Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 17 Nov 2003 13:27:29 -0800 (PST)
From:      Don Lewis <truckman@FreeBSD.org>
To:        kmarx@vicor.com
Cc:        mckusick@beastie.mckusick.com
Subject:   Re: 4.8 ffs_dirpref problem
Message-ID:  <200311172127.hAHLRTeF088888@gw.catspoiler.org>
In-Reply-To: <200311170331.hAH3VleF086693@gw.catspoiler.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 16 Nov, Don Lewis wrote:
> On 16 Nov, Don Lewis wrote:
> 
>>> I'm somewhat tempted to change the calculation to:
>>> 	min(avgbfree, max(1, (avgbfree - avgbfree/4), (dirsize/fs->fs_bsize)))
>>> where the last term works out to 4500 with your tunefs parameters.
>> 
>> I tried a variation of this on my -CURRENT box and it benchmarked
>> consistently worse.  I've got a "spare' 10 GB partition which first
>> copied my /usr/ports/packages to, and then filled by repeatedly tarring
>> my /usr/ports tree over to it.  The partition was 100% full, including
>> the reserve space, after four iterations.
> 
> I just looked again, and it is more than 100% full, but only slightly
> into the reserve space.
> 
>> With minbfree set to max((avgbfree - avgbfree/4), 1) here are two
>> iterations (the fifth line of timing data is for the 'rm -rf' command):
>> 
>>      1310.47 real         5.48 user       141.90 sys
>>      1336.78 real         5.62 user       152.27 sys
>>      1368.84 real         6.02 user       151.75 sys
>>      1359.70 real         5.55 user       154.01 sys
>>       423.44 real         2.25 user       107.26 sys
>> 
>>      1300.56 real         5.65 user       148.82 sys
>>      1372.20 real         5.79 user       152.25 sys
>>      1359.01 real         6.03 user       152.63 sys
>>      1380.90 real         5.31 user       153.71 sys
>>       437.22 real         2.20 user       105.61 sys
>> 
>> With minbfree set to
>>  max(min(max(avgbfree - avgbfree / 4, dirsize / fs->fs_bsize),
>>          avgbfree), 1)
>> I get the following:
>> 
>>      1314.61 real         5.66 user       175.43 sys
>>      1350.40 real         6.12 user       179.15 sys
>>      1386.86 real         6.32 user       179.12 sys
>>      1418.60 real         5.74 user       181.64 sys
>>       508.67 real         2.67 user       119.66 sys
>> 
>>      1361.19 real         5.97 user       176.94 sys
>>      1327.63 real         5.72 user       179.60 sys
>>      1376.16 real         6.33 user       179.72 sys
>>      1356.47 real         6.07 user       180.24 sys
>>       462.67 real         2.30 user       119.18 sys
>> 
>> I'm using the newfs defaults, but dirsize is recalculated as the
>> filesystem fills if the appropriate value is larger than what is
>> calculated from the parameters set by newfs.
> 
> I filled up the file system again with the 
>   minbree = max((avgbfree - avgbfree/4), 1)
> version of the code.
> 
> Based on the output of df and dumpfs, I calculate:
> 	avgfilesize = 18K
> 	curdirsize = 83K
> 	avgbfree = 864
> 	avgifree = 14631
> 
> What suprises me is the poor distribution of free space across the
> cylinder groups in the file system.  I now suspect the culprit is
> minifree.  The current code calculates minifree as 75% of avgifree, or
> about 10973.  There are some cylinder groups that are less than half
> full (capacity is 11761 blocks/group) in this filesystem, but their free
> inode counts are near the 10K minifree limit.  It looks like the free
> inode count should be de-emphasized if the filesystem will run out of
> blocks before it runs out of inodes, and vice-versa if inodes are likely
> to be exhausted first.  I now suspect that the other version of the
> minbfree code was more likely to bail out because it could not find any
> cylinder groups that met both selection criteria and used the fallback
> code, which probably selected the cylinder groups that were already full
> but had a large number of free inodes.  Something to ponder ...

I ran another test with minifree set to a small value, which effectively
removed it from the cylinder group selection criteria.  I used
  max(min(max(avgbfree - avgbfree / 4, dirsize / fs->fs_bsize),
          avgbfree), 1)
for minbfree.  The results were similar to the previous
  max((avgbfree - avgbfree/4), 1)
tests.

     1337.34 real         5.69 user       150.63 sys
     1323.58 real         5.87 user       157.96 sys
     1347.14 real         5.52 user       159.77 sys
     1361.57 real         5.37 user       160.50 sys
      419.49 real         2.52 user       114.75 sys

     1344.53 real         5.47 user       157.03 sys
     1326.97 real         4.77 user       151.57 sys
     1322.67 real         4.69 user       153.00 sys
     1367.49 real         5.91 user       160.45 sys
      409.95 real         2.59 user       114.20 sys

     1330.93 real         5.37 user       156.93 sys
     1374.03 real         5.59 user       159.14 sys
     1367.17 real         5.41 user       160.84 sys
     1318.14 real         5.50 user       159.75 sys
      411.94 real         2.22 user       114.86 sys

I took a snapshot of the cylinder group state at about 75% full as well
as at 100%.  Even at 75%, there are a number of cylinder groups that are
totally full.  I think that one of the problems is that the dirpref
allocator lingers too long on a given cylinder group.  It should
probably move to a new cylinder group before the old one is totally
full, somewhere around the minfree reserve level.  Also, as the file
system fills and a large number of the cylinder groups are totally
filled, the average free space per cylinder group will be quite small,
so the dirpref code will consider cylinder groups with only a small
amount of free space as candidates even though there may be other
cylinder groups that are nearly empty that would be better choices.

75%
dumpfs /dev/da0s2a | grep nbfree
nbfree  191340  ndir    94629   nifree  994237  nffree  1232
cs[].cs_(nbfree,ndir,nifree,nffree):
nbfree  7256    ndir    1976    nifree  14679   nffree  5
nbfree  7592    ndir    1976    nifree  14853   nffree  7
nbfree  35      ndir    663     nifree  20677   nffree  32
nbfree  5992    ndir    35      nifree  23096   nffree  3
nbfree  0       ndir    2965    nifree  10371   nffree  29
nbfree  0       ndir    2465    nifree  12592   nffree  83
nbfree  38      ndir    2463    nifree  12630   nffree  39
nbfree  115     ndir    2461    nifree  12736   nffree  44
nbfree  45      ndir    2462    nifree  12440   nffree  31
nbfree  16      ndir    2461    nifree  12778   nffree  36
nbfree  644     ndir    408     nifree  21729   nffree  56
nbfree  65      ndir    2966    nifree  10759   nffree  58
nbfree  2516    ndir    2462    nifree  12452   nffree  1
nbfree  2859    ndir    2964    nifree  10626   nffree  7
nbfree  723     ndir    2964    nifree  10517   nffree  18
nbfree  2678    ndir    2967    nifree  10184   nffree  24
nbfree  4279    ndir    2983    nifree  10730   nffree  0
nbfree  0       ndir    2982    nifree  10215   nffree  40
nbfree  0       ndir    549     nifree  20947   nffree  44
nbfree  0       ndir    0       nifree  23552   nffree  10
nbfree  0       ndir    724     nifree  20416   nffree  16
nbfree  38      ndir    0       nifree  23552   nffree  67
nbfree  0       ndir    1200    nifree  17872   nffree  12
nbfree  0       ndir    2963    nifree  10769   nffree  7
nbfree  0       ndir    2963    nifree  10506   nffree  17
nbfree  0       ndir    0       nifree  23552   nffree  17
nbfree  0       ndir    2963    nifree  10765   nffree  4
nbfree  2       ndir    2963    nifree  10240   nffree  18
nbfree  4266    ndir    2983    nifree  10137   nffree  1
nbfree  9442    ndir    2982    nifree  10321   nffree  0
nbfree  9415    ndir    2963    nifree  10476   nffree  4
nbfree  10594   ndir    1194    nifree  18382   nffree  4
nbfree  2       ndir    0       nifree  23552   nffree  39
nbfree  8212    ndir    3050    nifree  10268   nffree  1
nbfree  10508   ndir    1288    nifree  17943   nffree  6
nbfree  1       ndir    0       nifree  23552   nffree  4
nbfree  11381   ndir    0       nifree  23552   nffree  0
nbfree  11391   ndir    0       nifree  23552   nffree  0
nbfree  0       ndir    2       nifree  23321   nffree  51
nbfree  0       ndir    0       nifree  23552   nffree  18
nbfree  7902    ndir    40      nifree  22960   nffree  3
nbfree  91      ndir    0       nifree  23552   nffree  46
nbfree  7862    ndir    0       nifree  23552   nffree  0
nbfree  8433    ndir    0       nifree  23552   nffree  0
nbfree  9341    ndir    0       nifree  23552   nffree  0
nbfree  5       ndir    0       nifree  23552   nffree  17
nbfree  8880    ndir    0       nifree  23552   nffree  0
nbfree  11      ndir    1958    nifree  14708   nffree  58
nbfree  12      ndir    1962    nifree  15043   nffree  54
nbfree  2151    ndir    1957    nifree  14900   nffree  20
nbfree  40      ndir    1958    nifree  15136   nffree  29
nbfree  5764    ndir    1957    nifree  14470   nffree  31
nbfree  6517    ndir    1959    nifree  15192   nffree  1
nbfree  8163    ndir    1976    nifree  14941   nffree  6
nbfree  4107    ndir    1956    nifree  15229   nffree  8
nbfree  3       ndir    1975    nifree  14289   nffree  37
nbfree  0       ndir    1974    nifree  15026   nffree  18
nbfree  6475    ndir    1976    nifree  14747   nffree  7
nbfree  0       ndir    1974    nifree  14882   nffree  43
nbfree  5200    ndir    1975    nifree  14912   nffree  1

100%
dumpfs /dev/da0s2a | grep nbfree
nbfree  51875   ndir    120875  nifree  877882  nffree  1443
cs[].cs_(nbfree,ndir,nifree,nffree):
nbfree  3167    ndir    2963    nifree  10330   nffree  6
nbfree  3583    ndir    2982    nifree  10562   nffree  4
nbfree  52      ndir    663     nifree  20677   nffree  39
nbfree  4265    ndir    2982    nifree  10131   nffree  0
nbfree  4185    ndir    2982    nifree  10340   nffree  7
nbfree  9       ndir    2465    nifree  12592   nffree  60
nbfree  2       ndir    2463    nifree  12630   nffree  34
nbfree  1642    ndir    2461    nifree  12736   nffree  19
nbfree  38      ndir    2462    nifree  12440   nffree  31
nbfree  3008    ndir    2461    nifree  12778   nffree  36
nbfree  0       ndir    633     nifree  20564   nffree  42
nbfree  0       ndir    2963    nifree  10778   nffree  22
nbfree  0       ndir    2460    nifree  12459   nffree  12
nbfree  0       ndir    2963    nifree  10667   nffree  7
nbfree  0       ndir    2963    nifree  10491   nffree  3
nbfree  51      ndir    2963    nifree  10626   nffree  35
nbfree  0       ndir    2963    nifree  10547   nffree  18
nbfree  2       ndir    2963    nifree  10673   nffree  38
nbfree  0       ndir    549     nifree  20947   nffree  40
nbfree  0       ndir    0       nifree  23552   nffree  11
nbfree  3       ndir    0       nifree  23552   nffree  0
nbfree  87      ndir    0       nifree  23552   nffree  51
nbfree  0       ndir    1319    nifree  17311   nffree  5
nbfree  30      ndir    2963    nifree  10498   nffree  17
nbfree  4586    ndir    2983    nifree  10062   nffree  2
nbfree  0       ndir    0       nifree  23552   nffree  19
nbfree  9401    ndir    388     nifree  21774   nffree  5
nbfree  2       ndir    3473    nifree  8167    nffree  113
nbfree  103     ndir    3470    nifree  8345    nffree  28
nbfree  395     ndir    3471    nifree  7913    nffree  64
nbfree  1       ndir    3467    nifree  8476    nffree  5
nbfree  1690    ndir    3486    nifree  8049    nffree  7
nbfree  5065    ndir    3486    nifree  8302    nffree  2
nbfree  5762    ndir    3485    nifree  8214    nffree  4
nbfree  5       ndir    3472    nifree  8363    nffree  9
nbfree  0       ndir    2356    nifree  13130   nffree  33
nbfree  0       ndir    0       nifree  23552   nffree  6
nbfree  0       ndir    0       nifree  23552   nffree  11
nbfree  0       ndir    2       nifree  23321   nffree  51
nbfree  0       ndir    0       nifree  23552   nffree  18
nbfree  0       ndir    40      nifree  22960   nffree  6
nbfree  6       ndir    0       nifree  23552   nffree  48
nbfree  0       ndir    0       nifree  23552   nffree  51
nbfree  506     ndir    0       nifree  23552   nffree  22
nbfree  0       ndir    2965    nifree  10371   nffree  52
nbfree  0       ndir    0       nifree  23552   nffree  17
nbfree  139     ndir    2969    nifree  10603   nffree  63
nbfree  0       ndir    1958    nifree  14708   nffree  43
nbfree  37      ndir    1962    nifree  15043   nffree  57
nbfree  237     ndir    1957    nifree  14900   nffree  17
nbfree  0       ndir    1958    nifree  15136   nffree  21
nbfree  0       ndir    2964    nifree  10118   nffree  12
nbfree  805     ndir    3005    nifree  10331   nffree  6
nbfree  561     ndir    2964    nifree  10525   nffree  10
nbfree  5       ndir    2199    nifree  14133   nffree  19
nbfree  0       ndir    1975    nifree  14289   nffree  25
nbfree  2       ndir    1974    nifree  15026   nffree  11
nbfree  2437    ndir    2923    nifree  10441   nffree  5
nbfree  4       ndir    1974    nifree  14882   nffree  36
nbfree  2       ndir    2963    nifree  10451   nffree  8


I think it would work better if dirpref were converted to a two pass
algorithm.  The first pass would only consider those cylinder groups
that had more than minfree space.  If this first pass failed, the second
pass would look at all cylinder groups.

Another change that I suspect would help is rather than comparing
cylinder groups to minbfree and minifree, calculate how many directories
containing avgfilesperdir files of size avgfilesize they could hold, and
then calculate the average and minimum threshold values of that.

It would be an interesting project to write a filesystem allocation
simulator to test different allocation algorithms without having to bang
on physical disks.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200311172127.hAHLRTeF088888>