Date: Mon, 10 Mar 1997 15:10:38 -0700 (MST) From: Terry Lambert <terry@lambert.org> To: steve@visint.co.uk (Stephen Roome) Cc: freebsd-hackers@freebsd.org Subject: Re: Hard Link Count too small! Message-ID: <199703102210.PAA23655@phaeton.artisoft.com> In-Reply-To: <Pine.BSF.3.91.970310131830.23316B-100000@bagpuss.visint.co.uk> from "Stephen Roome" at Mar 10, 97 01:23:39 pm
next in thread | previous in thread | raw e-mail | index | archive | help
> Asynchronous mounted filesystems might not be POSIX compliant, but > synchronous is unbeleivable slow at times. I'm not sure how much it > is dependant on the actual disk/driver/buffering, but for me rm -rf > seems to be at least an order of magnitude faster when the filesystem > is mounted asynchronously. > > Is there not some way that perhaps mount could be more sensible and > perhaps do some of it's I/O async, or is it really that unsafe. It is dangerous to not guarantee order of operation for metadata, since it is impossible to disambiguate what source state led to the current FS state if you can't know that there was only a single metadata operation before a crash. You can run async if you want, with this risk. This is the same risk that Linux users using Ext2fs face (which runs async by default). The most correct way to guartantee order of operation that has been found so far is soft updates, from the UKY paper by Ganger and Patt. The implementation in Appendix A of their online version of their paper is SVR4 specific. It also has a number of (obvious when you try to compile) omissions and typos, though these are easy to fix. Keith Bostic is known to be working on integrating soft updates into BSD 4.4-Lite2's FFS. The problem with the Ganger/Patt approach, which is being echoed by Bostic's work from the reports I have heard, is that it is implementation specific. An acyclic directed dependency graph for a single FS is defined, and then transitive closure is computed over the graph once, and then the code is generated from the closure computation. It looks more like a simplified routing graph because they aren't trying to solve the problem generally. The first obvious consequence of this approach is that it works for only a single file system: all of the other FS's which currently share code with FFS have to be broken out as seperate code, and therefore seperate kernel bloat: LFS, MFS, the BSD EXT2FS, etc.. A less obvious consequence is that the directed graph in the FreeBSD case will be edge-opaque as a result of the VM/buffer cache unification (I have discussed this extensively with a number of professional kernel engineers, including several USL people, several Network Appliance people, and one of the engineers responsible for the Dell UNIX product). The net effect of this will be deterioration of the graph in the FreeBSD case to the "worst-case" soft update behaviour in all cases, which will make it (effectively) equivalent to DOW (Delayed Ordered Writes, used in UnixWare 2.x SMP). This will still perform better than an ununified cache without "worst-case", for instance, BSDI, since the effect of cache unification is marginally greater than the effect of moving from DOW to soft updates. > I know ppl seem to want POSIX compliancy, but for some things (like this) > is it really worth it to be losing so much speed.. The POSIX compliance had to do with whether or not the times were marked for update vs. being updated. If you made soft update or DOW ordering guarantees (which the async case CAN NOT) for the metadata writes, then you would maintain PSOIX compliance. As it is, the order of change/mark and change/update are ambiguous (and Bruce is wrong, even though I think the particular place he is wrong was snuck into POSIX by some overzealous VMS weenie, so it isn't Bruce's fault -- it has to do with access time updates in the getdents case and modification time updates in the inode metadata case, in particular, size and other metadata updates that occur as a result of extending the file, a very special case, but a frequent one for news servers). > Also, exactly how 'dangerous' is async, as far as I know linux is async > and I don't hear that many people complaining that their filesystems > have all been destroyed. (I'm probably wrong about this, as I don't keep > up with linux much anymore) The insidious part of this is that, after a crash, the FS is returned to a consistent state. The problem is that you can't know if it's the *right* consistent state. With the number of operations that can be outstanding, you can only guarantee one order of unroll: 2^(n) -> 2^(n-1) for n >= 1 For any n != 1, there are 2^(n-1) - 1 ambiguous base states that could have led to the ambiguous state that you want to "repair" from. When you mount async, you let n get larger than 1 (2^(1-1) - 1 = 0 for the non-async case means that there is no ambiguity if you only allow one metadata operation at a time to occur in order). For instance, say you async write 9 metadata operations, and the system crashes after doing 4 of them, in any order (async == "in any order"). This leaves you with 2^(5-1) - 1, or 15 possible incorrect states out of 16 possible source states to which you will "repair" your FS. You have a 1 in 16 chance of guessing the right one. No matter which one you guess, all 16 are "consistent". But only one is "right". So the FS "looks" repaired after the crash. If you are using an index file of some kind, then you have "implied state" across two files: you imply a gaurantee that if you update one file, you will update both files or neither in the event of a crash. This may mean regenerating the index each time you reboot following a crash. The answer to "exactly how 'dangerous' is async" depends on a lot of factors. You can calculate it for yourself; start with the MTBF for your system for all components in the write path (motherboard, CPU, RAM, cache, disk controller, disk, power supply, etc.). Then you will need the probability scaling factor for the propagation latency of a write through the machine (how long does it stay in the queue before it is forcibly flushed, etc.). This is basically, the amount of time between the call, and the actual write taking place. The easy answer is "it depends". The harder answer is "if you don't know the real answer, are you comfortable taking chances with the data?". If you are doing a news server, is this a posting host, or only a reading host? If a reading host, then you can regenerate all of your news data following a crash, so the probability bears on latency to restart: is it unacceptable for your probability? If a posting host, then the news posted by your users can be lost, and will be unrecoverable. Is this type of denial of service unacceptable for your probability. Etc.. For different applications than news servers, the risk is proportional to the calculated probability of a failure, the ability to recover the data, and the service expectations of the people using the service. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199703102210.PAA23655>