Date: Sun, 27 Jan 2002 04:05:21 -0800 From: Terry Lambert <tlambert2@mindspring.com> To: "Gary W. Swearingen" <swear@blarg.net> Cc: freebsd-chat@FreeBSD.ORG Subject: Re: Bad disk partitioning policies (was: "Re: FreeBSD Intaller (was "Re: ... RedHat ...")") Message-ID: <3C53ED01.61407A02@mindspring.com> References: <20020123124025.A60889@HAL9000.wox.org> <3C4F5BEE.294FDCF5@mindspring.com> <20020123223104.SM01952@there> <p0510122eb875d9456cf4@[10.0.1.3]> <15440.35155.637495.417404@guru.mired.org> <p0510123fb876493753e0@[10.0.1.3]> <15440.53202.747536.126815@guru.mired.org> <p05101242b876db6cd5d7@[10.0.1.3]> <15441.17382.77737.291074@guru.mired.org> <p05101245b8771d04e19b@[10.0.1.3]> <20020125212742.C75216@over-yonder.net> <p05101203b8788a930767@[10.0.1.14]> <gc1ygc7sfi.ygc@localhost.localdomain> <3C534C4A.35673769@mindspring.com> <0s3d0s5dos.d0s@localhost.localdomain>
next in thread | previous in thread | raw e-mail | index | archive | help
"Gary W. Swearingen" wrote: > That's odd. Your example there shows relative and I interpret the rest > of your comments about hashing to imply that it's relative. I meant "relative to the size". It's an absolute scaling factor (the percentage of the disk that should be used for the free reserve is invariant). > (Maybe my use of absolute and relative wasn't clear. Absolute meant > the reserve space for good defraging (or SA reserve) wasn't (much) > dependant on partition size, while relative meant the reserve space > needs was a set fraction of the partition size.) Yep. See above. > Trust me. It's not easy to understand from this thread so far, and I > don't expect it to be; I can go to the FFS treatise for understanding. > I feel bad even seeing you spend your time trying to explain reasons. Nonsense. If I can't explain reasons, then they are unsupportable (by me, at least ;^)). > But I am asking for statements of how the algorithm behaves which > would be helpful in knowing whether to twist the -m knob or how far. The algorithm operates by hashing for selection of where to write next. If it collides, it has to do collision handling, and that inflates the cost considerably. The free reserve is intended to keep the disk empty enough to prevent hash collisions from occurring. At 85% fill, this is a probability of 1:1.040 of getting an empty area (for a perfect hash). You really need to read the FFS paper and the Knuth book, if you want to understand the math that makes it work, since I am a poor math teacher (IMO 8-)). I do much better in person, on a whiteboard, and waving my hands. The tweaks are typically to reduce the free reserve, and to reduce the threshold below which optimization will be for space filling, rather than for speed (the 5% number below which performance becomes a factor of 3 slower is the space filling optimization threshold). Dropping the free reserve decreases the required free space, and when the disk fills to the point where it is more than 85% full, then every fractional percent more full it gets after that increases the probability of collision on attempt to allocate free space via hashing. When you get a collision, you get two things: (1) the speed decreases, both writing and reading, since you are taking longer to find places to write, and the writing and reading occur in scattered chunks, instead of clusters of the FS block size, and (2) the fragmentation of the disk increases for files created or extended during the low free reserve period. It's because of the hashing that the FFS does not suffer from fragmentation, and therefore, there is no need for a "defragger"; many people don't understand this, and ask were they can get a defragger anyway. > > If you have a friend who is a statistician, you should > > ask them to explain "The Birthday Paradox" to you. > > I've read about it several times, always forgetting the math, but I > remember you need only about 50 people for a 0.5 match probability. 23. For 50 people, the probability is 96.5%. Basically, people's birthdays are hashed using modulus 365, and you are checking for hash collisions after hashing everyone into one of 365 buckets based on their birthday. There's a really nice statistical explanation at: http://www.howstuffworks.com/question261.htm 8-). 365 is a really bad number for a hash, since it not prime at all. It's far from a "perfect hash"/"Fibbonacci hash". > > You'd probably benefit from reading the original FFS paper. > > No doubt. Though I trust you that the performance of the algorithm > is not a function of the partition size, but of the reserve relative > to that size (and the space filled relative to that size), I'll need > to read more to believe that I care as much about poor performance with > relatively full big disk than with a small one. For example, I might > accept slow performance to get an extra 5 GB when I wouldn't for 50 MB. Relative to the size of your disk, people complain about very large disks for even a very small free reserve percentage, mostly because they grew up in an era when "that was a lot of space!". The reality is that the algorithm needs a certain percentage of the space to work correctly, and if you take that away, then it doesn't work correctly. If you want to use a different algorithm, fine. But so far, the only competing one that seems to be worth anything is to extent or log structure your FS, and then spend CPU cycles and some percentage of your disk access cycles, on a "cleaner" process, that follows around and manually defragments the disk behind users using it. This process is expensive enough that for disks that are under where the free reserve would be, you are paying a performance penalty for any disk with more data than a single "cleaner" relocation block. In other words, there's a trade off. If you assume your disks are full, or you know your limiting factor is going to be I/O, and never CPU, then you might be better off using a different approach. For general purpose use, though, FFS has served us well for a couple of decades now. 8-). > > You know, you could worry about something else... like > > the fact that a formatted disk has less capacity than an > > unformatted one. > > I probably would, if there was a poorly-documented knob for that too. 8-). > But when I read silly recommendations to set the swap/RAM knob to 2, > regardless of the size of RAM or applications, I find it easy to > question other recommendations for which the justification is only deep > in the source or developer archives or even hairy treatises or seemingly > wrong (as the above tunefs(8) quote). It used to be that the swap/RAM knob was 1:1. It became 2:1 by default when we started doing memory overcommit in UNIX, and it really hasn't been reexamined much since then. Really, it'd probably be a good idea to find a reasonable way to make swap take up disk space until you ran out, on the theory that the limiting factor will be the limiting factor, so if it's swap space, or it's disk space, it doesn't matter, it's preferrable to exceed administrative limits (at least to the limits of the available hardware), to not doing the job you intended the hardeware to do. NeXTStep did this, and Windows does this currently (that's the real reason for the API to get the physical block list for a file, which we discussed as a way of putting a FreeBSD disk into an NTFS file, in the "partitioning" thread. The problem with doing "swap files" is that accessing swap through an FS adds another level of indirection (that's what the Windows direct sector list access API is taking out, but it can't make it as fast as a raw swap partition, because it can't guaranteed physical adjacency of logically adjacent file blocks ...unless the disk isn't very full). > Actually, my worry was not really in how something worked or could be > optimized as much as it was a response to what I find to be a poorly > documented config setting. If it just said "leave this to experts" I > probably wouldn't have brought it up. But when I read the tunefs quote > above, I see an implication that I'm quite sure is absolutely wrong: It > implies that the throughput will always be poor, regardless of how full > the disk is. That is misleading and tends to make people twist the knob > less far than they would if the statement expressed the truth better. > Maybe it only needs to change "throughput" to "worst-case throughput" or > "near-full throughput". It's also quite common-sensical to think that > the reserve wouldn't be as necessary for bigs disks as it was for small > ones. Better documentation would head off many FAQs on this issue. This issue has been discussed many times before. It's in the literature, and it's in the FreeBSD list archives dozens of times, at least. 8-). To address your suggestions: this would imply that the you could get non-worst-case performance on a full disk near a very small free reserve selected administratively. The real answer is that the more data on the disk above the optimal free reserve for the algorithm used for block selection, the worse the performance will be, and "worst case" is defined as "the last write before hitting the free reserve limit". So disk performance degrades steadily, the fuller it gets over the optimal free reserve (which is ~15%, much higher than the free reserve kept on most disks). The other misleading thing is that, once written fragged, the file will remain fragged, even if you drop the system back down below the "space optimization" limit, or even down below the "free reserve" limit. If that happens to an important file, then you are screwed, since there's no defragger, since the system was never designed to be run with the disk full over the free reserve. Thus it's a good idea to keep a large free reserve, so that a run-away user process can't screw up the on disk access speed for important files for another, more important process. BTW: "root" is immune to the free reserve limit; that's why you can sometimes see disks that are "110% full": it's a calcualtion based on the amount used, divided into the available space _under the reserve_. BTWBTW: If you screw up an important file this way, you can fix it by backing it up, deleting it, and restoring it, once the disk has dropped down to the optimal free reserve. This is known as "the poor man's defragger". -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3C53ED01.61407A02>