From owner-freebsd-fs@FreeBSD.ORG Fri Mar 1 20:23:01 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id EDABBEBE; Fri, 1 Mar 2013 20:23:01 +0000 (UTC) (envelope-from lev@FreeBSD.org) Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru [IPv6:2a01:4f8:131:60a2::2]) by mx1.freebsd.org (Postfix) with ESMTP id 7B1F115E9; Fri, 1 Mar 2013 20:23:01 +0000 (UTC) Received: from lion.home.serebryakov.spb.ru (unknown [IPv6:2001:470:923f:1:9421:367:9d7d:512b]) (Authenticated sender: lev@serebryakov.spb.ru) by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id 65EFB4AC58; Sat, 2 Mar 2013 00:22:51 +0400 (MSK) Date: Sat, 2 Mar 2013 00:22:44 +0400 From: Lev Serebryakov Organization: FreeBSD Project X-Priority: 3 (Normal) Message-ID: <1352492388.20130302002244@serebryakov.spb.ru> To: Kirk McKusick Subject: Re: Panic in ffs_valloc (Was: Unexpected SU+J inconsistency AGAIN -- please, don't shift topic to ZFS!) In-Reply-To: <201303011800.r21I0pBD034998@chez.mckusick.com> References: <352538988.20130301102237@serebryakov.spb.ru> <201303011800.r21I0pBD034998@chez.mckusick.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: freebsd-fs@freebsd.org, Don Lewis X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: lev@FreeBSD.org List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Mar 2013 20:23:02 -0000 Hello, Kirk. You wrote 1 =D0=BC=D0=B0=D1=80=D1=82=D0=B0 2013 =D0=B3., 22:00:51: >> As far, as I understand, if this theory is right (file system >> corruption which left unnoticed by "standard" fsck), it is bug in FFS >> SU+J too, as it should not be corrupted by reordered writes (if >> writes is properly reported as completed even if they were >> reordered). KM> If the bitmaps are left corrupted (in particular if blocks are marked KM> free that are actually in use), then that panic can occur. Such a state KM> should never be possible when running with SU even if you have crashed KM> multiple times and restarted without running fsck. I run fsck every time (ok, every half-year) when server crashes due to my awkward experiments on live system, but I run it as it runs: with journal after upgrade to 9-STABLE, not full old-fashioned run. KM> To reduce the number of possible points of failure, I suggest that KM> you try running with just SU (i.e., turn off the SU+J jornalling). KM> you can do this with `tunefs -j disable /dev/fsdisk'. This will KM> turn off journalling, but not soft updates. You can verify this KM> by then running `tunefs -p /dev/fsdisk' to ensure that soft updates KM> are still enabled. And wait another half a year :) I'm trying to reproduce this situation on VM (VirtualBox with virtual HDDs), but no luck (yet?). KM> I will MFC 246876 and 246877 once they have been in head long enough KM> to have confidence that they will not cause trouble. That means at KM> least a month (well more than the two weeks they have presently been KM> there). KM> Note these changes only pass the barrier request down to the GEOM KM> layer. I don't know whether it actually makes it to the drive layer KM> and if it does whether the drive layer actually implements it. My KM> goal was to get the ball rolling. I'm have controversial feelings about this barriers. IMHO, all writes to UFS (FFS) could and should be divided into two classes: data writes and metadata (including journal, as FFS doesn't have data journaling) writes. IMHO (it is last time I type these 4 letters, but, please, add it when you read this before and after each my sentence, as I'm not FS expert at any grade), data writes could be done as best effort till fsync() is called (or file is opened with appropriate flag, which is equivalent to automatic fsync() after each write). They could be delayed, reordered, etc. But metadata should have some strong guarantees (and fsync()'ed data too, of course). Such division could allow best possible performance & consistent FS metadata (maybe not consistent user data -- but every application which needs strong guarantees, like RDBMS, use fsync() anyway). Now you add "BARRIER" write. It looks too strong to use it often. It will force writing of ALL data from caches, even if your intention is to write only 2 or 3 blocks of metadata. It could solve problems with FS metadata, but it will degrade performance, especially in multithreaded load. Update of inode map for creating 0 bytes file flag by one process (protected with barrier) will flush whole data cache (maybe, hundred of meagbytes) of other one. It is better than noting, but, it is not best solution. Every write should be marked as "critical" or "loose" and critical-marked buffers (BIOs) must be written ASAP and before all other _crtitcal_ BIOs (not all BIOs after it with or without flag). So, barrier should affect only other barriers (ordered writes). Default, "loose" semantic (for data) will exactly what we have now. It is very hard to implement contract "It only ensure that buffers written before that buffer will get to the media before any buffers written after that buffer" in any other way but full flush, which, as I stated above, will hurt performace in such cases as effective RAID5-like implementations which gain a lot from combining wrties together by spatial (not time) property. And for full flush (which is needed sometimes, of course) we already have BIO_FLUSH command. Anyway, I'll support new semantic in geom_raid5 ASAP. But, unfortunately, now it could be supported as it is simple write followed by BIO_FLUSH -- not very effective :( --=20 // Black Lion AKA Lev Serebryakov