Date: Thu, 14 Mar 2002 12:47:19 -0800 From: Terry Lambert <tlambert2@mindspring.com> To: Parity Error <bootup@mail.ru> Cc: freebsd-fs@FreeBSD.org Subject: Re: metadata update durability ordering/soft updates Message-ID: <3C910C57.71C2D823@mindspring.com> References: <E16lReK-000C3T-00@f10.mail.ru>
next in thread | previous in thread | raw e-mail | index | archive | help
Parity Error wrote: > i am referring not to file data, but filesystem metadata, which > is now _delayed_ write. I understand this. Do you understand that delaying the metatadata writes in soft updates does not affect the dependency ordering, but may affect the time ordering? If I have two dependent lists of operations, A-B-C and D-B-E, then I am ony guaranteed that A and D will occur before B, and C andc E will occur after B, but there is no guarantee on the order of [A,D] vs. [D,A] or [C,E] vs. [E,C]. If I have to OTHER dependent lists of operations, Q-R and S-T, then I am only guaranteed that Q will occur before R, and S will occur before T, but there is no guarantee on the order of [ [Q,S], [Q,T], [R,S], [R,T] ] vs. [ [S,Q], [T,Q], [S,R], [T,R] ]; Q-R-S-T is a valid order, as is S-T-Q-R, as is [Q-S-T-R], as is [Q-S-R-T], etc.. > When we did synch write to sequence multiple metadata updates > belonging to one operation for ensuring recoverability of that > one operation, we also got inter-operation ordering for free Yes. > (and apps/users could have started depending on it) . No. Only misinformed users. The system *never* made *any* guarantees with regard to implied metadata. Your statement "multiple metadata updates belonging to one operation" is bogus. There is no such thing as "one operation" in this context. Multiple metadata updates are multiple operations, and the filesystem guarantees are only that the operations will not return to the user until they have completed in the guaranteed order, not that they have completed in any time relative order compared to each other. > Unix provides no guarantess reg the order in which file data > will become stable, and apps should use fsync/O_SYNC or logging > or whatever to ensure the consistency of their data stores. That's nice, but it's irrelevant to this discussion, since file data was never guaranteed for write anyway. THe reason the fsync/O_SYNC work to serialize the metadata operations is that the operations are guaranteed to occur using synchronous I/O, before they return. In other words, they are stall barriers instituted by the application programmer in order to get the behaviour the users ..."could have started depending on"... on purpose, rather than getting it as a result of an accident of the implementation of the underlying primitives. > But, the ordering in which different metadata operations becomes > stables, if not enforced could result in the following scenario. [ ... demonstration of failure of bogus assumptions ... ] Yes. Bogus assumptions are bogus. That's a circular argument. One must not make bogus assumptions, if one wants one's code to operate reliably. Your example is poor, as well, unless you intended the "touch" operations to occur concurrently. > These kind of things would not occur when we did synch write of > metadata (disk scheduling would not affect this). unlink could > possibly produce even more dramatic effects. Now the question is > whether this kind of behaviour from the filesystem is acceptable > and whether some applications can actually fail badly due to this. A1: The behaviour is acceptable, since the behaviour guarantees for metadata stability are mandated by operational guarantees. To boils this down to laymans language: the OS provides a set of services upon which reliable services can be built, if they are correctly engineered. It is up to the people building the layers of services on top of the OS services to provide those facilities that do not exist within the OS proper, such that they are reliable. In other words, the purpose of the OS is to provide an unconstrained foundation. So long as you don't mount the FS in such a way that the metadata updates are not carried out in the correct order, (e.g. async), then you can create a system in which the ordering guarantees are maintained from end-to-end, and you can reliably know the state that you would have been in had you not crashed, following a crash, and can recover by rolling the operation forward, if all necessary data is available, or backward, if it is not. A2: Applications which expect behaviour other than that guaranteed by the API definitions can be expected to fail badly when their assumptions are proven to be unfounded in reality. STANDARDS COMPLIANCE AND METADATA UPDATES, WITH A SURVEY OF OS/FS's Certaint metadata updates, such as those to ctime, mtime, and atime, are guaranteed by the POSIX standard. These, in turn, imply that the containers for these objects are similarly guaranteed, to the root operation, such that the guaranteed operations are always reliable. Any OS which fails to make these guarantees is, by its definition, non-compliant with POSIX. You can intentionally choose to operate certain filesystems in a POSIX-non-compliant mode; for example, you can use an MFS, or you can mount a filesystem async, such that metatadata update guarantees required for conformance to the standard are not observed. But you knowingly give up standards compliance when you do this. For example, Linux running EXT2FS mounted asynchronously fails to comply with the POSIX standard with regard to update of ctime, atime, and mtime updates, both because of the direct failure for such updates to be committed to stable storage, and because of the indirect failure of the updates to be committed, since the containers are not committed, thus making the containers in which the commits are taking place fail to comply with the definition of "stable storage". Another example would be FreeBSD running FFS, if you went out of the way to mount it async, rather than sync (or with more recent installations, with soft updates). Similarly, mounting it noatime also fails this test. If you were to mount a System V UFS in SVR4.2 by default, without specifying "sync" or "async", then you get a behaviour called DOW (Delayed Ordered Writes), in which an intentionally stall point is inserted between dependeny convergences. THis is similar to soft updates, in that the stall point requires synchronization of the stable storage at the point where the intersection would occur, but it provides only non-commutability on non-commutable operations in a given edge, and does not permit reordering of associativity, even though operations are associative, and effeciency might be gained, thereby. Thus the original A-B-C, D-B-E operation actually *must* occur in A-B B-E ordering, with a stall between the "B" and the "B". This only coincidently makes a *partial* ordering guarantee on the order of independent metadata updates -- so even here, you can not rely on the system ordering independent updates, only on it being standards compliant in the API guarantees. If you want this behaviour on Linux, ReiserFS uses the USL patented DOW technology without a license. If you are outside the US, and don't plan on selling into the US until at least 2018, you could use ReiserFS to get metadata update ordering withing standards guaranteed operations, and it will only stall out as often as the SVR4.2 UFS with DOW. But you will have the same problem with your software that assumes -- incorrectly -- that serially requested independent metadata updates will take place serially... when, in fact, there is no such guarantee. PS: FWIW, it's *possible* to generalize the soft updates mechanism to export a transactioning interface -- actually, a dependency edge that can be used to implement transactioning -- to user space. The effect of doing this would be to also export an edge of the dependency graph upward. For two independent graphs, implying an edge between the top nodes establishes a precedence order on completion, and therefore guarantees ordering of operations within a transaction. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3C910C57.71C2D823>