From owner-freebsd-geom@FreeBSD.ORG Wed Mar 6 08:15:28 2013 Return-Path: Delivered-To: freebsd-geom@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id F2A7192C; Wed, 6 Mar 2013 08:15:27 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (gw.catspoiler.org [75.1.14.242]) by mx1.freebsd.org (Postfix) with ESMTP id BBF69932; Wed, 6 Mar 2013 08:15:27 +0000 (UTC) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id r268FIl5015220; Wed, 6 Mar 2013 00:15:22 -0800 (PST) (envelope-from truckman@FreeBSD.org) Message-Id: <201303060815.r268FIl5015220@gw.catspoiler.org> Date: Wed, 6 Mar 2013 00:15:18 -0800 (PST) From: Don Lewis Subject: Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift topic to ZFS! To: lev@FreeBSD.org In-Reply-To: <1644513757.20130306113250@serebryakov.spb.ru> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=iso-8859-5 Content-Transfer-Encoding: 8BIT Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Mar 2013 08:15:28 -0000 On 6 Mar, Lev Serebryakov wrote: > Hello, Don. > You wrote 6 марта 2013 г., 11:15:13: > > DL> Did you have any power failures that took down the system sometime > DL> before this panic occured? By default FreeBSD enables write caching on > I had other panic due to my inaccurate hands... But I don't remeber > any power failures, as I havee UPS and this one works (I check it every > month). > > DL> That means that the drive will immediately acknowledge writes and is > DL> free to reorder them as it pleases. > > DL> When UFS+SU allocates a new inode, it first clears the available bit in > DL> the bitmap and writes the bitmap block to disk before it writes the new > DL> inode contents to disk. When a file is deleted, the inode is zeroed on > DL> disk before the available bit is set in the bitmap and the bitmap block > DL> is written. That means that if an inode is marked as available in the > DL> bitmap, then it should be zero. The system panic that you experienced > DL> happened when the system was attempting to allocate an inode for a new > DL> file and when it peeked at an inode that was marked as available, it > DL> found that the inode was non-zero. > > DL> What might have happened is that sometime in the past, the system was in >>[SKIPPED] > DL> tried to allocate the inode in question. > This scenario looks plausible, but it raises another question: does > barriers will protect against it? It doesn't look so, as now barrier > write is issued only when new inode BLOCK is allocated. And it leads > us to my other question: why did not mark such vital writes with > flag, which will force driver to mark them as "uncacheable" (And same > for fsync()-inducted writes). Again, not BIO_FLUSH, which should > flush whole cache, but flag for BIO. I was told my mav@ (ahci driver > author) that ATA has such capability. And I'm sure, that SCSI/SAS drives > should have one too. In the existing implementation, barriers wouldn't help since they aren't used in enough nearly enough places. UFS+SU currently expects the drive to tell it when the data actually hits the platter so that it can control the write ordering. In theory, barriers could be used instead, but performance would be terrible if they got turned into cache flushes. With NCQ or TCQ, the drive can have a sizeable number of writes internally queued and it is free to reorder them as it pleases even with write caching disabled, but if write caching is disabled it has to delay the notification of their completion until the data is on the platters so that UFS+SU can enforce the proper dependency ordering. Many years ago, when UFS+SU was fairly new, I experimented with enabling and disabling write caching on a SCSI drive with TCQ. Performance was about the same either way. I always disabled write caching on my SCSI drives after that because that is what UFS+SU expectes so that it can avoid inconsistencies in the case of power failure. I don't know enough about ATA to say if it supports marking individual writes as uncacheable. To support consistency on a drive with write caching enabled, UFS+SU would have to mark many of its writes as uncacheable. Even if this works, calls to fsync() would have to be turned into cache flushes to force the file data (assuming that it was was written with a cacheable write) to be written to the platters and only return to the userland program after the data is written. If drive write caching is off, then UFS+SU keeps track of the outstanding writes and an fsync() call won't return until the drive notifies UFS+SU that the data blocks for that file are actually written. In this case, the fsync() call doesn't need to get propagated down to the drive.