Date: Thu, 14 Mar 2002 12:35:52 +0300 From: "Parity Error" <bootup@mail.ru> To: "Terry Lambert" <tlambert2@mindspring.com> Cc: freebsd-fs@FreeBSD.org Subject: Re[2]: metadata update durability ordering/soft updates Message-ID: <E16lReK-000C3T-00@f10.mail.ru> In-Reply-To: <3C8FA1E4.A89F52FF@mindspring.com>
next in thread | previous in thread | raw e-mail | index | archive | help
i am referring not to file data, but filesystem metadata, which is now _delayed_ write. When we did synch write to sequence multiple metadata updates belonging to one operation for ensuring recoverability of that one operation, we also got inter-operation ordering for free (and apps/users could have started depending on it) . Unix provides no guarantess reg the order in which file data will become stable, and apps should use fsync/O_SYNC or logging or whatever to ensure the consistency of their data stores. But, the ordering in which different metadata operations becomes stables, if not enforced could result in the following scenario. md a touch a/file{0,1}{0,1}{0,1}{0,1} md a/b touch a/b/file{0,1}{0,1}{0,1}{0,1} < a crash happens sometime later > after recovery, it could turn out that all of a/b/file* is there, but only a few of a/file* are there (possibly those in the first dir block). These kind of things would not occur when we did synch write of metadata (disk scheduling would not affect this). unlink could possibly produce even more dramatic effects. Now the question is whether this kind of behaviour from the filesystem is acceptable and whether some applications can actually fail badly due to this. -----Original Message----- From: Terry Lambert <tlambert2@mindspring.com> To: Parity Error <bootup@mail.ru> Date: Wed, 13 Mar 2002 11:00:52 -0800 Subject: Re: metadata update durability ordering/soft updates Parity Error wrote: > with soft-updates metadata updates are delayed write. I am > wondering if, say there are two independent structural changes, > one after another, and then a crash happens. > > Is there a possibility that the latter structural change got > written to disk before the former due to some memory replacement > policy ? Independent writes are independent, by definition. They are permitted to occur in either order. Metadata updates are only ordered by soft updates insofar as necessary to satify dependencies. Thus indepependent writes can occur in any order, but will *usually* occur in order, due to the way that a scheduled write can not be reordered once it is given to the disk controller. This is due to a locking issue on the disk operations queue in the driver, and is arguably a bug. It's likely that some work currently in progress will forceed to the point that the "likely ordering" of independent operations will "go away in the future, so you can't even safely depend on it being likely. This is normally an issue only for updates that do things like update both an index and a record file, and imply a dependency order in the operation. In other words, there is implied metadata between the two files, and therefore an implied dependency. It's the application's responsibility to signal the dependency to the OS, so that the updates are ordered. The normal way to do this is to use a two stage commit operation (per standard database theoury, Circa IBM, 1965). In UNIX this is done by requesting that the first operation be committed, before making the request to begin the second operation (e.g. a software barrier instruction). To find out more about this, you should use "man fsync" and "man open" (in the "open" page, look for "O_FSYNC"). As to misordering of dependent writes, even if you use synchronous I/O properly... Yes, this can happen due to the memory replacement policy on many IDE hard drives, which lie about data having been committed to stable storage, when in fact it has only been written to the disk write cache, which is far from stable storage, being as it's not battery backed, and it is not guaranteed to be written to the disk after a power failure, except on some IBM and Quantum drives which are no longer manufactured. You can ensure this doesn't happen to you by using only disks which can correctly support cache flush primitives and tagged command queues, or disabling write caching on the device. SCSI devices don't have this problem. Another potential problem is that some IDE disks will acknowledge disabling write caching, but will in fact not disable it, no matter what commands you spit at them. For some of these disks, there are firmware updates available, but if you are unlucky enough to own one of these disks, then there is usually no option but to buy a good disk instead. May I recommend SCSI? > could this affect the correctness of some applications ? The disk caching issue could. The implied metadata could not. If you have an application that uses implied metadata, but does not take the necessary steps for UNIX to ensure that the OS is signalled about the implied ordering dependency, then by definition, your application can't have it's correctness effected... since it has no correctness to lose. 8-). -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E16lReK-000C3T-00>