From owner-freebsd-fs@FreeBSD.ORG Wed Mar 6 06:43:20 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 3AED3E2C; Wed, 6 Mar 2013 06:43:20 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (gw.catspoiler.org [75.1.14.242]) by mx1.freebsd.org (Postfix) with ESMTP id 05408405; Wed, 6 Mar 2013 06:43:19 +0000 (UTC) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id r266hBWU015053; Tue, 5 Mar 2013 22:43:15 -0800 (PST) (envelope-from truckman@FreeBSD.org) Message-Id: <201303060643.r266hBWU015053@gw.catspoiler.org> Date: Tue, 5 Mar 2013 22:43:11 -0800 (PST) From: Don Lewis Subject: Re: Panic in ffs_valloc (Was: Unexpected SU+J inconsistency AGAIN -- please, don't shift topic to ZFS!) To: lev@FreeBSD.org In-Reply-To: <1352492388.20130302002244@serebryakov.spb.ru> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=iso-8859-5 Content-Transfer-Encoding: 8BIT Cc: mckusick@mckusick.com, freebsd-fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Mar 2013 06:43:20 -0000 On 2 Mar, Lev Serebryakov wrote: > Hello, Kirk. > You wrote 1 марта 2013 г., 22:00:51: > >>> As far, as I understand, if this theory is right (file system >>> corruption which left unnoticed by "standard" fsck), it is bug in FFS >>> SU+J too, as it should not be corrupted by reordered writes (if >>> writes is properly reported as completed even if they were >>> reordered). > KM> If the bitmaps are left corrupted (in particular if blocks are marked > KM> free that are actually in use), then that panic can occur. Such a state > KM> should never be possible when running with SU even if you have crashed > KM> multiple times and restarted without running fsck. > I run fsck every time (ok, every half-year) when server crashes due > to my awkward experiments on live system, but I run it as it runs: > with journal after upgrade to 9-STABLE, not full old-fashioned run. > > KM> To reduce the number of possible points of failure, I suggest that > KM> you try running with just SU (i.e., turn off the SU+J jornalling). > KM> you can do this with `tunefs -j disable /dev/fsdisk'. This will > KM> turn off journalling, but not soft updates. You can verify this > KM> by then running `tunefs -p /dev/fsdisk' to ensure that soft updates > KM> are still enabled. > And wait another half a year :) > > I'm trying to reproduce this situation on VM (VirtualBox with > virtual HDDs), but no luck (yet?). > > KM> I will MFC 246876 and 246877 once they have been in head long enough > KM> to have confidence that they will not cause trouble. That means at > KM> least a month (well more than the two weeks they have presently been > KM> there). > > KM> Note these changes only pass the barrier request down to the GEOM > KM> layer. I don't know whether it actually makes it to the drive layer > KM> and if it does whether the drive layer actually implements it. My > KM> goal was to get the ball rolling. > I'm have controversial feelings about this barriers. IMHO, all > writes to UFS (FFS) could and should be divided into two classes: > data writes and metadata (including journal, as FFS doesn't have data > journaling) writes. IMHO (it is last time I type these 4 letters, > but, please, add it when you read this before and after each my > sentence, as I'm not FS expert at any grade), data writes could be > done as best effort till fsync() is called (or file is opened with > appropriate flag, which is equivalent to automatic fsync() after each > write). They could be delayed, reordered, etc. But metadata should > have some strong guarantees (and fsync()'ed data too, of course). > Such division could allow best possible performance & consistent FS > metadata (maybe not consistent user data -- but every application > which needs strong guarantees, like RDBMS, use fsync() anyway). When growing a file, the data *must* be written before writing the block pointer that points to it. If this ordering isn't obeyed, then a system crash that occurs between the block pointer write and the data write would result in the file containing whatever garbage was in the data block. That garbage could be the confidential contents of some other user's previously deleted file. > Now you add "BARRIER" write. It looks too strong to use it often. > It will force writing of ALL data from caches, even if your intention > is to write only 2 or 3 blocks of metadata. It could solve problems > with FS metadata, but it will degrade performance, especially in > multithreaded load. Update of inode map for creating 0 bytes file > flag by one process (protected with barrier) will flush whole data > cache (maybe, hundred of meagbytes) of other one. I'm not a fan of barriers for that reason. My understanding is that older versions of the Linux kernel were pretty lax about write ordering in the ext3 filesystem and that recent kernels enable barriers for ext3 and ext4. I've heard a lot of complaints about mysql peformance tanking with these newer kernels and the commonly suggested workaround is to disable barriers and make sure the system is on an UPS (though I've had plenty of UPS failures over the years). Fortunately in the case of 246877, the barrier operation should happen infrequently, especially after the filesystem has been in use for a while. > It is better than noting, but, it is not best solution. Every write > should be marked as "critical" or "loose" and critical-marked buffers > (BIOs) must be written ASAP and before all other _crtitcal_ BIOs (not > all BIOs after it with or without flag). So, barrier should affect > only other barriers (ordered writes). Default, "loose" semantic (for > data) will exactly what we have now. > > It is very hard to implement contract "It only ensure that buffers > written before that buffer will get to the media before any buffers > written after that buffer" in any other way but full flush, which, as I > stated above, will hurt performace in such cases as effective > RAID5-like implementations which gain a lot from combining wrties > together by spatial (not time) property. > > And for full flush (which is needed sometimes, of course) we > already have BIO_FLUSH command. > > Anyway, I'll support new semantic in geom_raid5 ASAP. But, > unfortunately, now it could be supported as it is simple write > followed by BIO_FLUSH -- not very effective :( >