From owner-freebsd-fs Fri Jan 17 2:31:20 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9BA0D37B401 for ; Fri, 17 Jan 2003 02:31:16 -0800 (PST) Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188]) by mx1.FreeBSD.org (Postfix) with ESMTP id F099E43ED8 for ; Fri, 17 Jan 2003 02:31:15 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0018.cvx40-bradley.dialup.earthlink.net ([216.244.42.18] helo=mindspring.com) by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18ZTmF-0003hP-00; Fri, 17 Jan 2003 02:31:08 -0800 Message-ID: <3E27DA7F.D5DBEFB@mindspring.com> Date: Fri, 17 Jan 2003 02:27:11 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: David Schultz Cc: Jason Schoonover , freebsd-fs@FreeBSD.ORG Subject: JFS vs. Soft Updates (again) (was: Re: large filesystem, journaling filesystem support) References: <20030114192634.75751.qmail@web13505.mail.yahoo.com> <20030117075118.GA3493@HAL9000.homeunix.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a44530405ad5ade39a7d84765512278c9ea7ce0e8f8d31aa3f350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org This posting is in favor of a JFS. It gives detailed technical arguments about why some of the soft updates claims some people are making are actually incorrect. For the record, Kirk McKusick has stated on FreeBSD -arch that background fsck has the problems I note, in passing, below. > FreeBSD uses softupdates, which achieves similar efficiency and > reliability goals to journaling. With softupdates, you don't need > to fsck at all at boot time following a power failure or crash > because the worst case scenario (hardware failure aside) is that > some disk space that is really free is marked as allocated. No, the worst case following a power failure is a screwed disk track. Modern disk drives read and write a track at a time; this is to avoid rotational latency that woul happen if you waited for a hard "sector start" marker to come around, and it avoids the need for "low level formatting". For a very small window of time in the late 1990's, two manufacturers, IBM and Quantum, created disk drives which were capable of using rotational energy as a power source (regenerative braking) to complete a write in progress, following a DC failure (this provided a small post-failure hold-up time. Modern disk drives no longer do this, because disk manufacturers are morons (or one was a moron, and the others had to compete on price, which amounts to the same thing). The net result is that a DC failure can result in an entire track getting trashed, if it happens at the right time. So why is this important? Soft updates optimizes for sector writing, not track writing, while journalling can journal on the basis of track-sized extents. If it is written correctly (there are a number of technical challenges to writing this correctly, and SGI, IBM, and Linux haven't done it, but it's theoretically possible, though very hard on IDE -- much easier on SCSI because the physical geometry can be accessed via mode page 2). The upshot of this is that a journalled FS can recover any damage from a power failure, if needs be, whereas if this were to happen on a disk protected by soft updates, you are screwed. Journalling and soft updates are orthogonal technologies; they do not solve the same problem space, although there is some minor overlap. > In FreeBSD 5.0, you can actually run fsck in the background at any > time to reclaim this space. In fact, this is not true. You can only run a fsck in the background in the case that you know that the failure mode was a power failure, and that no data was corrupted. This is not something you can know for certain without CMOS. A panic failure situation may result in corrupted disk buffers that are flushed to the disk, prior to the panic. A hardware failure can result in a similar failure. And a power failure can result in a corrupt track. In all three of these cases, a background fsck is unable to recover the system appropriately. Neither is it possible to mount the FS read-only, and make it read/write on a cylinder group basis, following a fsck, until all of the areas of the disks have been checked, else it's possible to load and run corrupt code that then corrupts a previously OK area of the disk. The only reasonable fix is a CMOS area that contains a failure condition code. Unfortunately, one of FreeBSD's failure modes is a spontaneous reboot; this is because this is the normal failure mode for PC hardware on a triple fault, which may occur as a result of a condition that should result in a panic (corruption of kernel memory), if the memory so corrupted is the GDT, or certain other types of failures occur. Thus the only safe way of dealing with this in the soft updates case is a DC holdup circuit whose sole job is to write a "fower fail" code into NVRAM, which can be read out by the OS. This means that the first thing an OS should do following succesful recovery after read that value is to write a non-"power fail" code into the CMOS, so that it can differentiate a power failure from a soft failure. PC hardware has no such assitance for OS's, despite Microsoft and Intel attempting to accelerate the recovery process (maybe the simply didn't think about the problem in sufficient detail to realize hardware help is needed). A journaling FS has the same vulnerability to corrupt kernel buffers that were written out, but not the same vulnerability on recovery, as it does not need to distinguish reboots due to power failure from reboots due to other causes (because it can be insensitive to the difference, by being insensitive to single track failures from write in progress). The upshot of this is that a journalling FS can recover using an abbreviated process, with only software CMOS cause notification, without needing special hardware additions for "power fail" differentiation. > That said, there is some limited > interest in porting a journaling filesystem to FreeBSD. Several > people have started, but I don't know if anyone has finished. Part of the disincentive here is that people keep saying that Soft Updates is "just as good as journalling" or "solves the same problem space journalling solves", etc., when it doesn't, and the technologies are actually complementary. People should stop claiming this, when it isn't true. If you want to talk about the overlap, fine; but don't claim that soft updates or bacground fsck adequately solves the loss of power problem, unless you happen to have an IBM drive from 1997. Personally, I would welcome a journalling FS on FreeBSD. It would have saved us the cost of a custom power supply that provided DC holdup and AC fail notification. While it was significantly cheaper than a UPS, and we were able to make the change because of soft updates, it would have been even cheaper if we could have avoided the problem entirely. The biggest problem, to my mind, that adoption of a journalling filesystem by FreeBSD keeps hitting its head on, is that people keep wanting to port GPL'ed JFS code to FreeBSD, not understanding that it's impossible for a GPL'ed FS to ever be the default for FreeBSD, because the GPL specifically prohibits use of other licenses in statically linked code, and the boot file system must be statically linked into the code in order to mount root, and to load kernel modules. If you want to write a JFS for FreeBSD: fine; but if you are going to start with third party code, be sure that code is under the BSD license, so that your FS can ship in a binary and usable form on the CDROM. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message