From owner-freebsd-fs Thu Sep 21 14: 9: 9 2000 Delivered-To: freebsd-fs@freebsd.org Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135]) by hub.freebsd.org (Postfix) with ESMTP id 6DC9537B43E for ; Thu, 21 Sep 2000 14:09:02 -0700 (PDT) Received: (from daemon@localhost) by smtp05.primenet.com (8.9.3/8.9.3) id OAA01454; Thu, 21 Sep 2000 14:09:18 -0700 (MST) Received: from usr08.primenet.com(206.165.6.208) via SMTP by smtp05.primenet.com, id smtpdAAABAaWWc; Thu Sep 21 14:09:11 2000 Received: (from tlambert@localhost) by usr08.primenet.com (8.8.5/8.8.5) id OAA16992; Thu, 21 Sep 2000 14:08:52 -0700 (MST) From: Terry Lambert Message-Id: <200009212108.OAA16992@usr08.primenet.com> Subject: Crash recovery: SU vs. LFS vs. JFS To: mbendiks@eunet.no (Marius Bendiksen) Date: Thu, 21 Sep 2000 21:08:52 +0000 (GMT) Cc: tuinstra@clarkson.edu (Dwight Tuinstra), freebsd-fs@FreeBSD.ORG (freebsd-fs) In-Reply-To: from "Marius Bendiksen" at Sep 21, 2000 06:22:28 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > > As a long-term graduate research project, I've been looking into > > the code for LFS (Log-structured File System) on NetBSD. Such a > > system is optimized for many small writes, and given the amounts > > of RAM available for read caches nowadays, should deliver read > > performance comparable (or not much worse) than FFS. Additionally, > > LFS should provide better and faster crash recovery than either FFS > > or journaling file systems. > > Research (IIRC, Seltzer and Matthews) has shown that FFS outperforms > LFS when the FFS clustering code has been activated. The crash recovery > times can supposedly be alleviated by soft updates, I've not looked at > that yet. Journalling crash recovery vs LFS crash recovery is more complex > than a mere comparison of speed, as these can be tuned in both cases. The crash recovery of soft updates can be sped up considerably, in theory. In practice, you can't tell if the reason for the crash was an FS fault, or whether it was some other fault. Further, for most drives, if you had a DC failure to the drive in the middle of an actual write, you can have single sector format corruption (my personal opinion is that if this is possible, so is multiple sector corruption, if timed just right). This means that you need NVRAM for at least sector logging, and at most, track logging, in order to ensure replay based de-corruption on in progress writes. Further, consider that the drive may have a track cache that will result in out of range (for the OS) being written, and the corruption occuring there. The _only_ software approach for soft updates is really soft readonly, and this only works if the system is quiescent at the time of the crash (soft readonly in force, and the FS marked clean). This ignores the possibility of a software failure (e.g. the failure is assumed to be hardware or power). In the software failure case, there is no telling what data was corrupted, or whether some of it was written to disk prior to the corruption hitting something which the system noticed sufficiently to actually fail. Thus, soft updates are not a good strategy for fast failure recovery. --- LFS is probably fastest, but does not support implied relationships between data. For example, if I have a data file and an index file, and I two-stage commit it by writing new data and then a new index, if the failure occurs between these operations, I potentially lose my transaction. If this were a bank transaction, I would be really hurting in the wallet. If this were a different transaction, it may have less financial risk, but the correctness risk is the same. NB: The database above obviously records the new record to a different record entry, so that the old one is still a valid record in case of failure; the operation is thus non-atomic, but idempotent. The way LFS normally recovers is to go back to oldes time stamped and valid marked log as "the correct state of the FS", and discard and partial logs (and with them, implied metadata). --- JFS is slower. JFS recovers nearly the same as LFS, but must look at outstanding transactions, and actually back them out if they are incomplete, or roll them forward, if it can. The difference between these is whether the journal contains only a journal of events that have transpired, or a journal of both events that have transpired, and those intended to transpire, but not yet commited to the disk. This means that a JFS will keep implied relationships between data intact, so long as it is signalled before and after a transaction involving an implied relationship takes place. So long as the transaction completion is not signalled to the client application until after the transaction has been commited (this requires another hook to user space), then we have no problem with bank transactions. --- Someone said that LFS logs data and metadata, but JFS logs only metadata, and that's the difference. Obviously, this is untrue; a JFS logs transactions. Whether these transactions include only metadata, or include both metadata and data, is really a JFS implementation detail, not an attribute of JFS' themselves. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message