From owner-freebsd-stable@FreeBSD.ORG Sat Sep 27 06:44:19 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1395E1065678 for ; Sat, 27 Sep 2008 06:44:19 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from QMTA01.emeryville.ca.mail.comcast.net (qmta01.emeryville.ca.mail.comcast.net [76.96.30.16]) by mx1.freebsd.org (Postfix) with ESMTP id E607B8FC17 for ; Sat, 27 Sep 2008 06:44:18 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from OMTA13.emeryville.ca.mail.comcast.net ([76.96.30.52]) by QMTA01.emeryville.ca.mail.comcast.net with comcast id KiAQ1a00M17UAYkA1ijxha; Sat, 27 Sep 2008 06:43:57 +0000 Received: from koitsu.dyndns.org ([67.180.253.227]) by OMTA13.emeryville.ca.mail.comcast.net with comcast id KikH1a0014v8bD78ZikHan; Sat, 27 Sep 2008 06:44:18 +0000 X-Authority-Analysis: v=1.0 c=1 a=TxirYYpeSEAA:10 a=QO6ccaido9wA:10 a=H0umD5oqAAAA:8 a=6I5d2MoRAAAA:8 a=QycZ5dHgAAAA:8 a=_OaQ9GrwlmQ36l0rxd0A:9 a=3IQQfE3oOPw-PTbXXRAA:7 a=8Atas1TtQs9DCbB1GHqyLtYDCJIA:4 a=EoioJ0NPDVgA:10 a=LY0hPdMaydYA:10 Received: by icarus.home.lan (Postfix, from userid 1000) id 2FB69C9432; Fri, 26 Sep 2008 23:44:17 -0700 (PDT) Date: Fri, 26 Sep 2008 23:44:17 -0700 From: Jeremy Chadwick To: Derek Kuli??ski Message-ID: <20080927064417.GA43638@icarus.home.lan> References: <20080921213426.GA13923@0lsen.net> <20080921215203.GC9494@icarus.home.lan> <20080921215930.GA25826@0lsen.net> <20080921220720.GA9847@icarus.home.lan> <249873145.20080926213341@takeda.tk> <20080927051413.GA42700@icarus.home.lan> <765067435.20080926223557@takeda.tk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <765067435.20080926223557@takeda.tk> User-Agent: Mutt/1.5.18 (2008-05-17) Cc: freebsd-stable@FreeBSD.org, Clint Olsen Subject: Re: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 27 Sep 2008 06:44:19 -0000 On Fri, Sep 26, 2008 at 10:35:57PM -0700, Derek Kuli??ski wrote: > Hello Jeremy, > > Friday, September 26, 2008, 10:14:13 PM, you wrote: > > >> Actually what's the advantage of having fsck run in background if it > >> isn't capable of fixing things? > >> Isn't it more dangerous to be it like that? i.e. administrator might > >> not notice the problem; also filesystem could break even further... > > > This question should really be directed at a set of different folks, > > e.g. actual developers of said stuff (UFS2 and soft updates in > > specific), because it's opening up a can of worms. > > > I believe it has to do with the fact that there is much faith given to > > UFS2 soft updates -- the ability to background fsck allows the user to > > boot their system and have it up and working (able to log in, etc.) in a > > much shorter amount of time[1]. It makes the assumption that "everything > > will work just fine", which is faulty. > > As far as I know (at least ideally, when write caching is disabled) Re: write caching: wheelies and burn-outs in empty parking lots detected. Let's be realistic. We're talking about ATA and SATA hard disks, hooked up to on-board controllers -- these are the majority of users. Those with ATA/SATA RAID controllers (not on-board RAID either; most/all of those do not let you disable drive write caching) *might* have a RAID BIOS menu item for disabling said feature. FreeBSD atacontrol does not let you toggle such features (although "cap" will show you if feature is available and if it's enabled or not). Users using SCSI will most definitely have the ability to disable said feature (either via SCSI BIOS or via camcontrol). But the majority of users are not using SCSI disks, because the majority of users are not going to spend hundreds of dollars on a controller followed by hundreds of dollars for a small (~74GB) disk. Regardless of all of this, end-users should, in no way shape or form, be expected to go to great lengths to disable their disk's write cache. They will not, I can assure you. Thus, we must assume: write caching on a disk will be enabled, period. If a filesystem is engineered with that fact ignored, then the filesystem is either 1) worthless, or 2) serves a very niche purpose and should not be the default filesystem. Do we agree? > the data should always be consistent, and all fsck supposed to be > doing is to free unreferenced blocks that were allocated. fsck does a heck of a lot more than that, and there's no guarantee that's all fsck is going to do on a UFS2+SU filesystem. I'm under the impression it does a lot more than just looking for unref'd blocks. > Wouldn't be possible for background fsck to do that while the > filesystem is mounted, and if there's some unrepairable error, that > somehow happen (while in theory it should be impossible) just > periodically scream on the emergency log level? The system is already up and the filesystems mounted. If the error in question is of such severity that it would impact a user's ability to reliably use the filesystem, how do you expect constant screaming on the console will help? A user won't know what it means; there is already evidence of this happening (re: mysterious ATA DMA errors which still cannot be figured out[6]). IMHO, a dirty filesystem should not be mounted until it's been fully analysed/scanned by fsck. So again, people are putting faith into UFS2+SU despite actual evidence proving that it doesn't handle all scenarios. > > It also gives the impression of a journalled filesystem, which UFS2 soft > > updates are not. gjournal(8) on the other hand, is, and doesn't require > > fsck at all[2]. > > > I also think this further adds fuel to the "so why are we enabling soft > > updates by default and using UFS2 as a filesystem again?" fire. I'm > > sure someone will respond to this with "So use ZFS and shut up". *sigh* > > I think the reason for using Soft Updates by default is that it was > a pretty hard thing to implement, and (at least in theory it supposed > by as reliable as journaling. The problem here is that when it was created, it was sort of an "experiment". Now, when someone installs FreeBSD, UFS2 is the default filesystem used, and SU are enabled on every filesystem except the root fs. Thus, we have now put ourselves into a situation where said feature ***must*** be reliable in all cases. You're also forgetting a huge focus of SU -- snapshots[1]. However, there are more than enough facts on the table at this point concluding that snapshots are causing more problems[7] than previously expected. And there's further evidence filesystem snapshots shouldn't even be used in this way[8]. > Also, if I remember correctly, PJD said that gjournal is performing > much better with small files, while softupdates is faster with big > ones. Okay, so now we want to talk about benchmarks. The benchmarks you're talking about are in two places[2][3]. The benchmarks pjd@ provided were very basic/simple, which I feel is good, because the tests were realistic (common tasks people will do). The benchmarks mckusick@ provided for UFS2+SU were based on SCSI disks, which is... interesting to say the least. Bruce Evans responded with some more data[4]. I particularly enjoy this quote in his benchmark: "I never found the exact cause of the slower readback ...", followed by (plausible) speculations as to why that is. I'm sorry that I sound like such a hard-ass on this matter, but there is a glaring fact that people seem to be overlooking intentionally: Filesystems have to be reliable; data integrity is focus #1, and cannot be sacrificed. Users and administrators *expect* a filesystem to be reliable. No one is going to keep using a filesystem if it has disadvantages which can result in data loss or "waste of administrative time" (which I believe is what's occurring here). Users *will* switch to another operating system that has filesystems which were not engineered/invented with these features in mind. Or, they can switch to another filesystem assuming the OS offers one which performs equally as good/well and is guaranteed to be reliable -- and that's assuming the user wants to spend the time to reformat and reinstall just to get that. In the case of "bit rot" (e.g. drive cache going bad silently, bad cables, or other forms of low-level data corruption), a filesystem is likely not to be able to cope with this (but see below). A common rebuttal here would be: "so use UFS2 without soft updates". Excellent advice! I might consider it myself! But the problem is that we cannot expect users to do that. Why? Because the defaults chosen during sysinstall are to use SU for all filesystems except root. If SU is not reliable (or is "reliable in most cases" -- same thing if you ask me), then it should not be enabled by default. I think we (FreeBSD) might have been a bit hasty in deciding to choose that as a default. Next: a system locking up (or a kernel panic) should result in a dirty filesystem. That filesystem should be *fully recoverable* from that kind of error, with no risk of data loss (but see below). (There is the obvious case where a file is written to the disk, and the disk has not completed writing the data from its internal cache to the disk itself (re: write caching); if power is lost, the disk may not have finished writing the cache to disk. In this case, the file is going to be sparse -- there is absolutely nothing that can be done about this with any filesystem, including ZFS (to my knowledge). This situation is acceptable; nature of the beast.) The filesystem should be fully analysed and any errors repaired (either with user interaction or automatically -- I'm sure it depends on the kind of error) **before** the filesystem is mounted. This is where SU gets in the way. The filesystem is mounted and the system is brought up + online 60 seconds before the fsck starts. The assumption made is that the errors in question will be fully recoverable by an automatic fsck, which as this thread proves, is not always the case. ZFS is the first filesystem, to my knowledge, which provides 1) a reliable filesystem, 2) detection of filesystem problems in real-time or during scrubbing, 3) repair of problems in real-time (assuming raidz1 or raidz2 are used), and 4) does not need fsck. This makes ZFS powerful. "So use ZFS!" A good piece of advice -- however, I've already had reports from users that they will not consider ZFS for FreeBSD at this time. Why? Because ZFS on FreeBSD can panic the system easily due to kmem exhaustion. Proper tuning can alleviate this problem, but users do not want to to have to "tune" their system to get stability (and I feel this is a very legitimate argument). Additionally, FreeBSD doesn't offer ZFS as a filesystem during installation. PC-BSD does, AFAIK. So on FreeBSD, you have to go through a bunch of rigmarole[5] to get it to work (and doing this after-the-fact is a real pain in the rear -- believe me, I did it this weekend.) So until both of these ZFS-oriented issues can be dealt with, some users aren't considering it. This is the reality of the situation. I don't think what users and administrators want is unreasonable; they may be rough demands, but that's how things are in this day and age. Have I provided enough evidence? :-) [1]: http://www.usenix.org/publications/library/proceedings/bsdcon02/mckusick/mckusick_html/index.html [2]: http://lists.freebsd.org/pipermail/freebsd-current/2006-June/064043.html [3]: http://www.usenix.org/publications/library/proceedings/usenix2000/general/full_papers/seltzer/seltzer_html/index.html [4]: http://lists.freebsd.org/pipermail/freebsd-current/2006-June/064166.html [5]: http://wiki.freebsd.org/JeremyChadwick/FreeBSD_7.x_on_a_ZFS_pool [6]: http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting [7]: http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues [8]: http://lists.freebsd.org/pipermail/freebsd-stable/2007-January/032070.html -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |