Date: Mon, 17 Nov 2003 13:27:29 -0800 (PST) From: Don Lewis <truckman@FreeBSD.org> To: kmarx@vicor.com Cc: mckusick@beastie.mckusick.com Subject: Re: 4.8 ffs_dirpref problem Message-ID: <200311172127.hAHLRTeF088888@gw.catspoiler.org> In-Reply-To: <200311170331.hAH3VleF086693@gw.catspoiler.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 16 Nov, Don Lewis wrote: > On 16 Nov, Don Lewis wrote: > >>> I'm somewhat tempted to change the calculation to: >>> min(avgbfree, max(1, (avgbfree - avgbfree/4), (dirsize/fs->fs_bsize))) >>> where the last term works out to 4500 with your tunefs parameters. >> >> I tried a variation of this on my -CURRENT box and it benchmarked >> consistently worse. I've got a "spare' 10 GB partition which first >> copied my /usr/ports/packages to, and then filled by repeatedly tarring >> my /usr/ports tree over to it. The partition was 100% full, including >> the reserve space, after four iterations. > > I just looked again, and it is more than 100% full, but only slightly > into the reserve space. > >> With minbfree set to max((avgbfree - avgbfree/4), 1) here are two >> iterations (the fifth line of timing data is for the 'rm -rf' command): >> >> 1310.47 real 5.48 user 141.90 sys >> 1336.78 real 5.62 user 152.27 sys >> 1368.84 real 6.02 user 151.75 sys >> 1359.70 real 5.55 user 154.01 sys >> 423.44 real 2.25 user 107.26 sys >> >> 1300.56 real 5.65 user 148.82 sys >> 1372.20 real 5.79 user 152.25 sys >> 1359.01 real 6.03 user 152.63 sys >> 1380.90 real 5.31 user 153.71 sys >> 437.22 real 2.20 user 105.61 sys >> >> With minbfree set to >> max(min(max(avgbfree - avgbfree / 4, dirsize / fs->fs_bsize), >> avgbfree), 1) >> I get the following: >> >> 1314.61 real 5.66 user 175.43 sys >> 1350.40 real 6.12 user 179.15 sys >> 1386.86 real 6.32 user 179.12 sys >> 1418.60 real 5.74 user 181.64 sys >> 508.67 real 2.67 user 119.66 sys >> >> 1361.19 real 5.97 user 176.94 sys >> 1327.63 real 5.72 user 179.60 sys >> 1376.16 real 6.33 user 179.72 sys >> 1356.47 real 6.07 user 180.24 sys >> 462.67 real 2.30 user 119.18 sys >> >> I'm using the newfs defaults, but dirsize is recalculated as the >> filesystem fills if the appropriate value is larger than what is >> calculated from the parameters set by newfs. > > I filled up the file system again with the > minbree = max((avgbfree - avgbfree/4), 1) > version of the code. > > Based on the output of df and dumpfs, I calculate: > avgfilesize = 18K > curdirsize = 83K > avgbfree = 864 > avgifree = 14631 > > What suprises me is the poor distribution of free space across the > cylinder groups in the file system. I now suspect the culprit is > minifree. The current code calculates minifree as 75% of avgifree, or > about 10973. There are some cylinder groups that are less than half > full (capacity is 11761 blocks/group) in this filesystem, but their free > inode counts are near the 10K minifree limit. It looks like the free > inode count should be de-emphasized if the filesystem will run out of > blocks before it runs out of inodes, and vice-versa if inodes are likely > to be exhausted first. I now suspect that the other version of the > minbfree code was more likely to bail out because it could not find any > cylinder groups that met both selection criteria and used the fallback > code, which probably selected the cylinder groups that were already full > but had a large number of free inodes. Something to ponder ... I ran another test with minifree set to a small value, which effectively removed it from the cylinder group selection criteria. I used max(min(max(avgbfree - avgbfree / 4, dirsize / fs->fs_bsize), avgbfree), 1) for minbfree. The results were similar to the previous max((avgbfree - avgbfree/4), 1) tests. 1337.34 real 5.69 user 150.63 sys 1323.58 real 5.87 user 157.96 sys 1347.14 real 5.52 user 159.77 sys 1361.57 real 5.37 user 160.50 sys 419.49 real 2.52 user 114.75 sys 1344.53 real 5.47 user 157.03 sys 1326.97 real 4.77 user 151.57 sys 1322.67 real 4.69 user 153.00 sys 1367.49 real 5.91 user 160.45 sys 409.95 real 2.59 user 114.20 sys 1330.93 real 5.37 user 156.93 sys 1374.03 real 5.59 user 159.14 sys 1367.17 real 5.41 user 160.84 sys 1318.14 real 5.50 user 159.75 sys 411.94 real 2.22 user 114.86 sys I took a snapshot of the cylinder group state at about 75% full as well as at 100%. Even at 75%, there are a number of cylinder groups that are totally full. I think that one of the problems is that the dirpref allocator lingers too long on a given cylinder group. It should probably move to a new cylinder group before the old one is totally full, somewhere around the minfree reserve level. Also, as the file system fills and a large number of the cylinder groups are totally filled, the average free space per cylinder group will be quite small, so the dirpref code will consider cylinder groups with only a small amount of free space as candidates even though there may be other cylinder groups that are nearly empty that would be better choices. 75% dumpfs /dev/da0s2a | grep nbfree nbfree 191340 ndir 94629 nifree 994237 nffree 1232 cs[].cs_(nbfree,ndir,nifree,nffree): nbfree 7256 ndir 1976 nifree 14679 nffree 5 nbfree 7592 ndir 1976 nifree 14853 nffree 7 nbfree 35 ndir 663 nifree 20677 nffree 32 nbfree 5992 ndir 35 nifree 23096 nffree 3 nbfree 0 ndir 2965 nifree 10371 nffree 29 nbfree 0 ndir 2465 nifree 12592 nffree 83 nbfree 38 ndir 2463 nifree 12630 nffree 39 nbfree 115 ndir 2461 nifree 12736 nffree 44 nbfree 45 ndir 2462 nifree 12440 nffree 31 nbfree 16 ndir 2461 nifree 12778 nffree 36 nbfree 644 ndir 408 nifree 21729 nffree 56 nbfree 65 ndir 2966 nifree 10759 nffree 58 nbfree 2516 ndir 2462 nifree 12452 nffree 1 nbfree 2859 ndir 2964 nifree 10626 nffree 7 nbfree 723 ndir 2964 nifree 10517 nffree 18 nbfree 2678 ndir 2967 nifree 10184 nffree 24 nbfree 4279 ndir 2983 nifree 10730 nffree 0 nbfree 0 ndir 2982 nifree 10215 nffree 40 nbfree 0 ndir 549 nifree 20947 nffree 44 nbfree 0 ndir 0 nifree 23552 nffree 10 nbfree 0 ndir 724 nifree 20416 nffree 16 nbfree 38 ndir 0 nifree 23552 nffree 67 nbfree 0 ndir 1200 nifree 17872 nffree 12 nbfree 0 ndir 2963 nifree 10769 nffree 7 nbfree 0 ndir 2963 nifree 10506 nffree 17 nbfree 0 ndir 0 nifree 23552 nffree 17 nbfree 0 ndir 2963 nifree 10765 nffree 4 nbfree 2 ndir 2963 nifree 10240 nffree 18 nbfree 4266 ndir 2983 nifree 10137 nffree 1 nbfree 9442 ndir 2982 nifree 10321 nffree 0 nbfree 9415 ndir 2963 nifree 10476 nffree 4 nbfree 10594 ndir 1194 nifree 18382 nffree 4 nbfree 2 ndir 0 nifree 23552 nffree 39 nbfree 8212 ndir 3050 nifree 10268 nffree 1 nbfree 10508 ndir 1288 nifree 17943 nffree 6 nbfree 1 ndir 0 nifree 23552 nffree 4 nbfree 11381 ndir 0 nifree 23552 nffree 0 nbfree 11391 ndir 0 nifree 23552 nffree 0 nbfree 0 ndir 2 nifree 23321 nffree 51 nbfree 0 ndir 0 nifree 23552 nffree 18 nbfree 7902 ndir 40 nifree 22960 nffree 3 nbfree 91 ndir 0 nifree 23552 nffree 46 nbfree 7862 ndir 0 nifree 23552 nffree 0 nbfree 8433 ndir 0 nifree 23552 nffree 0 nbfree 9341 ndir 0 nifree 23552 nffree 0 nbfree 5 ndir 0 nifree 23552 nffree 17 nbfree 8880 ndir 0 nifree 23552 nffree 0 nbfree 11 ndir 1958 nifree 14708 nffree 58 nbfree 12 ndir 1962 nifree 15043 nffree 54 nbfree 2151 ndir 1957 nifree 14900 nffree 20 nbfree 40 ndir 1958 nifree 15136 nffree 29 nbfree 5764 ndir 1957 nifree 14470 nffree 31 nbfree 6517 ndir 1959 nifree 15192 nffree 1 nbfree 8163 ndir 1976 nifree 14941 nffree 6 nbfree 4107 ndir 1956 nifree 15229 nffree 8 nbfree 3 ndir 1975 nifree 14289 nffree 37 nbfree 0 ndir 1974 nifree 15026 nffree 18 nbfree 6475 ndir 1976 nifree 14747 nffree 7 nbfree 0 ndir 1974 nifree 14882 nffree 43 nbfree 5200 ndir 1975 nifree 14912 nffree 1 100% dumpfs /dev/da0s2a | grep nbfree nbfree 51875 ndir 120875 nifree 877882 nffree 1443 cs[].cs_(nbfree,ndir,nifree,nffree): nbfree 3167 ndir 2963 nifree 10330 nffree 6 nbfree 3583 ndir 2982 nifree 10562 nffree 4 nbfree 52 ndir 663 nifree 20677 nffree 39 nbfree 4265 ndir 2982 nifree 10131 nffree 0 nbfree 4185 ndir 2982 nifree 10340 nffree 7 nbfree 9 ndir 2465 nifree 12592 nffree 60 nbfree 2 ndir 2463 nifree 12630 nffree 34 nbfree 1642 ndir 2461 nifree 12736 nffree 19 nbfree 38 ndir 2462 nifree 12440 nffree 31 nbfree 3008 ndir 2461 nifree 12778 nffree 36 nbfree 0 ndir 633 nifree 20564 nffree 42 nbfree 0 ndir 2963 nifree 10778 nffree 22 nbfree 0 ndir 2460 nifree 12459 nffree 12 nbfree 0 ndir 2963 nifree 10667 nffree 7 nbfree 0 ndir 2963 nifree 10491 nffree 3 nbfree 51 ndir 2963 nifree 10626 nffree 35 nbfree 0 ndir 2963 nifree 10547 nffree 18 nbfree 2 ndir 2963 nifree 10673 nffree 38 nbfree 0 ndir 549 nifree 20947 nffree 40 nbfree 0 ndir 0 nifree 23552 nffree 11 nbfree 3 ndir 0 nifree 23552 nffree 0 nbfree 87 ndir 0 nifree 23552 nffree 51 nbfree 0 ndir 1319 nifree 17311 nffree 5 nbfree 30 ndir 2963 nifree 10498 nffree 17 nbfree 4586 ndir 2983 nifree 10062 nffree 2 nbfree 0 ndir 0 nifree 23552 nffree 19 nbfree 9401 ndir 388 nifree 21774 nffree 5 nbfree 2 ndir 3473 nifree 8167 nffree 113 nbfree 103 ndir 3470 nifree 8345 nffree 28 nbfree 395 ndir 3471 nifree 7913 nffree 64 nbfree 1 ndir 3467 nifree 8476 nffree 5 nbfree 1690 ndir 3486 nifree 8049 nffree 7 nbfree 5065 ndir 3486 nifree 8302 nffree 2 nbfree 5762 ndir 3485 nifree 8214 nffree 4 nbfree 5 ndir 3472 nifree 8363 nffree 9 nbfree 0 ndir 2356 nifree 13130 nffree 33 nbfree 0 ndir 0 nifree 23552 nffree 6 nbfree 0 ndir 0 nifree 23552 nffree 11 nbfree 0 ndir 2 nifree 23321 nffree 51 nbfree 0 ndir 0 nifree 23552 nffree 18 nbfree 0 ndir 40 nifree 22960 nffree 6 nbfree 6 ndir 0 nifree 23552 nffree 48 nbfree 0 ndir 0 nifree 23552 nffree 51 nbfree 506 ndir 0 nifree 23552 nffree 22 nbfree 0 ndir 2965 nifree 10371 nffree 52 nbfree 0 ndir 0 nifree 23552 nffree 17 nbfree 139 ndir 2969 nifree 10603 nffree 63 nbfree 0 ndir 1958 nifree 14708 nffree 43 nbfree 37 ndir 1962 nifree 15043 nffree 57 nbfree 237 ndir 1957 nifree 14900 nffree 17 nbfree 0 ndir 1958 nifree 15136 nffree 21 nbfree 0 ndir 2964 nifree 10118 nffree 12 nbfree 805 ndir 3005 nifree 10331 nffree 6 nbfree 561 ndir 2964 nifree 10525 nffree 10 nbfree 5 ndir 2199 nifree 14133 nffree 19 nbfree 0 ndir 1975 nifree 14289 nffree 25 nbfree 2 ndir 1974 nifree 15026 nffree 11 nbfree 2437 ndir 2923 nifree 10441 nffree 5 nbfree 4 ndir 1974 nifree 14882 nffree 36 nbfree 2 ndir 2963 nifree 10451 nffree 8 I think it would work better if dirpref were converted to a two pass algorithm. The first pass would only consider those cylinder groups that had more than minfree space. If this first pass failed, the second pass would look at all cylinder groups. Another change that I suspect would help is rather than comparing cylinder groups to minbfree and minifree, calculate how many directories containing avgfilesperdir files of size avgfilesize they could hold, and then calculate the average and minimum threshold values of that. It would be an interesting project to write a filesystem allocation simulator to test different allocation algorithms without having to bang on physical disks.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200311172127.hAHLRTeF088888>