Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 15 Mar 2002 06:56:51 -0800
From:      Josh MacDonald <jmacd@CS.Berkeley.EDU>
To:        Terry Lambert <tlambert2@mindspring.com>, Parity Error <bootup@mail.ru>
Cc:        freebsd-fs@FreeBSD.ORG, reiserfs-dev@namesys.com
Subject:   Re: metadata update durability ordering/soft updates
Message-ID:  <20020315065651.02637@helen.CS.Berkeley.EDU>
In-Reply-To: <3C910C57.71C2D823@mindspring.com>; from Terry Lambert on Thu, Mar 14, 2002 at 12:47:19PM -0800
References:  <E16lReK-000C3T-00@f10.mail.ru> <3C910C57.71C2D823@mindspring.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Quoting Terry Lambert (tlambert2@mindspring.com):
> Parity Error wrote:
> > i am referring not to file data, but filesystem metadata, which
> > is now _delayed_ write.
> 
> I understand this.  Do you understand that delaying the metatadata
> writes in soft updates does not affect the dependency ordering, but
> may affect the time ordering?
> 
> If I have two dependent lists of operations, A-B-C and D-B-E,
> then I am ony guaranteed that A and D will occur before B,
> and C andc E will occur after B, but there is no guarantee on
> the order of [A,D] vs. [D,A] or [C,E] vs. [E,C].
> 
> If I have to OTHER dependent lists of operations, Q-R and S-T,
> then I am only guaranteed that Q will occur before R, and S
> will occur before T, but there is no guarantee on the order of
> [ [Q,S], [Q,T], [R,S], [R,T] ] vs. [ [S,Q], [T,Q], [S,R], [T,R] ];
> Q-R-S-T is a valid order, as is S-T-Q-R, as is [Q-S-T-R], as is
> [Q-S-R-T], etc..
> 
> > When we did synch write to sequence multiple metadata updates
> > belonging to one operation for ensuring recoverability of that
> > one operation, we also got inter-operation ordering for free
> 
> Yes.
> 
> > (and apps/users could have started depending on it) .
> 
> No.  Only misinformed users.  The system *never* made *any*
> guarantees with regard to implied metadata.  Your statement
> "multiple metadata updates belonging to one operation" is
> bogus.  There is no such thing as "one operation" in this
> context.  Multiple metadata updates are multiple operations,
> and the filesystem guarantees are only that the operations
> will not return to the user until they have completed in
> the guaranteed order, not that they have completed in any
> time relative order compared to each other.
> 
> 
> > Unix provides no guarantess reg the order in which file data
> > will become stable, and apps should use fsync/O_SYNC or logging
> > or whatever to ensure the consistency of their data stores.
> 
> That's nice, but it's irrelevant to this discussion, since
> file data was never guaranteed for write anyway.
> 
> THe reason the fsync/O_SYNC work to serialize the metadata
> operations is that the operations are guaranteed to occur
> using synchronous I/O, before they return.
> 
> In other words, they are stall barriers instituted by the
> application programmer in order to get the behaviour the
> users ..."could have started depending on"... on purpose,
> rather than getting it as a result of an accident of the
> implementation of the underlying primitives.
> 
> > But, the ordering in which different metadata operations becomes
> > stables, if not enforced could result in the following scenario.
> 
> [ ... demonstration of failure of bogus assumptions ... ]
> 
> Yes.  Bogus assumptions are bogus.  That's a circular argument.
> One must not make bogus assumptions, if one wants one's code
> to operate reliably.
> 
> Your example is poor, as well, unless you intended the "touch"
> operations to occur concurrently.
> 
> 
> >  These kind of things would not occur when we did synch write of
> > metadata (disk scheduling would not affect this). unlink could
> > possibly produce even more dramatic effects.  Now the question is
> > whether this kind of behaviour from the filesystem is acceptable
> > and whether some applications can actually fail badly due to this.
> 
> A1: The behaviour is acceptable, since the behaviour guarantees
> for metadata stability are mandated by operational guarantees.
> 
> To boils this down to laymans language: the OS provides a set of
> services upon which reliable services can be built, if they are
> correctly engineered.  It is up to the people building the layers
> of services on top of the OS services to provide those facilities
> that do not exist within the OS proper, such that they are reliable.
> 
> In other words, the purpose of the OS is to provide an unconstrained
> foundation.  So long as you don't mount the FS in such a way that
> the metadata updates are not carried out in the correct order, (e.g.
> async), then you can create a system in which the ordering guarantees
> are maintained from end-to-end, and you can reliably know the state
> that you would have been in had you not crashed, following a crash,
> and can recover by rolling the operation forward, if all necessary
> data is available, or backward, if it is not.
> 
> 
> A2: Applications which expect behaviour other than that guaranteed
> by the API definitions can be expected to fail badly when their
> assumptions are proven to be unfounded in reality.
> 
> 
> STANDARDS COMPLIANCE AND METADATA UPDATES, WITH A SURVEY OF OS/FS's
> 
> Certaint metadata updates, such as those to ctime, mtime, and
> atime, are guaranteed by the POSIX standard.  These, in turn, imply
> that the containers for these objects are similarly guaranteed, to
> the root operation, such that the guaranteed operations are always
> reliable.  Any OS which fails to make these guarantees is, by its
> definition, non-compliant with POSIX.
> 
> You can intentionally choose to operate certain filesystems in a
> POSIX-non-compliant mode; for example, you can use an MFS, or you
> can mount a filesystem async, such that metatadata update guarantees
> required for conformance to the standard are not observed.  But you
> knowingly give up standards compliance when you do this.
> 
> For example, Linux running EXT2FS mounted asynchronously fails
> to comply with the POSIX standard with regard to update of ctime,
> atime, and mtime updates, both because of the direct failure for
> such updates to be committed to stable storage, and because of the
> indirect failure of the updates to be committed, since the containers
> are not committed, thus making the containers in which the commits
> are taking place fail to comply with the definition of "stable
> storage".
> 
> Another example would be FreeBSD running FFS, if you went out of the
> way to mount it async, rather than sync (or with more recent
> installations, with soft updates).  Similarly, mounting it noatime
> also fails this test.
> 
> If you were to mount a System V UFS in SVR4.2 by default, without
> specifying "sync" or "async", then you get a behaviour called DOW
> (Delayed Ordered Writes), in which an intentionally stall point is
> inserted between dependeny convergences.  THis is similar to soft
> updates, in that the stall point requires synchronization of the
> stable storage at the point where the intersection would occur, but
> it provides only non-commutability on non-commutable operations in
> a given edge, and does not permit reordering of associativity, even
> though operations are associative, and effeciency might be gained,
> thereby.  Thus the original A-B-C, D-B-E operation actually *must*
> occur in A-B B-E ordering, with a stall between the "B" and the "B".
> This only coincidently makes a *partial* ordering guarantee on the
> order of independent metadata updates -- so even here, you can not
> rely on the system ordering independent updates, only on it being
> standards compliant in the API guarantees.
> 
> If you want this behaviour on Linux, ReiserFS uses the USL patented
> DOW technology without a license.  If you are outside the US, and
> don't plan on selling into the US until at least 2018, you could
> use ReiserFS to get metadata update ordering withing standards
> guaranteed operations, and it will only stall out as often as the
> SVR4.2 UFS with DOW.  But you will have the same problem with your
> software that assumes -- incorrectly -- that serially requested
> independent metadata updates will take place serially... when, in
> fact, there is no such guarantee.

Terry,

I'm not sure what you're talking about with regards to DOW and
ReiserFS.  It doesn't sound right, and I'm pretty sure we're not using
anything like the patented DOW technique as you've described it.

We are developing a transaction facility for many of the reasons
suggested at by the original post in this thread.

To summarize:

- The file system has never made any guarantees.

- You can use fsync() to stabilize a single file and its metadata
dependencies.

- You can use two-phase commit above and beyond that.

- If you're not doing the right thing, "then by definition, your
application can't have it's correctness effected... since it has no
correctness to lose."

- And, "the OS provides a set of services upon which reliable services
can be built, if they are correctly engineered."

All of these statements are true.  Your attitude seems to be that this
is a fine state of affairs, that anyone who writes an application
should be fully informed of all these "transactional" issues, and that
anyone who is not fully informed of all these issues is a complete
moron if they expect to write reliable applications.

The problem is that you're asking way to much of the average
programmer, who doesn't understand transactions and isn't aware of how
little the operating system actually guarantees in this regard.

The other problem is that fsync() and two-phase-commit can seriously
limit application performance, unless you use highly sophisticated
techniques, which again rules out the average programmer.

The fact is, it is very difficult to write "reliable services" on top
of the standard primitives, and it is not good enough to call people
morons if they don't understand this.

There is a document describing our transactions design for ReiserFS
version 4, which is currently under development:

   http://namesys.com/txn-doc.html

And somewhat off topic, I have demonstrated that using fsync() and
rename() as a means for reliable, atomic file updates can seriously
limit application performance and that having file system transactions
solves the problem.  My point is that applications will perform
better, not worse, if the operating system helps construct reliable
services instead of this do-it-yourself approach.

Master's thesis:

   http://prdownloads.sourceforge.net/xdelta/xdfs.pdf

and the graph that shows it all:

   http://www.cs.berkeley.edu/~jmacd/xdfs-vs-rcs.eps

Regards,

-josh

-- 
PRCS version control system    http://sourceforge.net/projects/prcs
Xdelta storage & transport     http://sourceforge.net/projects/xdelta
Need a concurrent skip list?   http://sourceforge.net/projects/skiplist

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020315065651.02637>