From owner-freebsd-geom@FreeBSD.ORG  Wed Mar  6 08:15:28 2013
Return-Path: <owner-freebsd-geom@FreeBSD.ORG>
Delivered-To: freebsd-geom@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id F2A7192C;
 Wed,  6 Mar 2013 08:15:27 +0000 (UTC)
 (envelope-from truckman@FreeBSD.org)
Received: from gw.catspoiler.org (gw.catspoiler.org [75.1.14.242])
 by mx1.freebsd.org (Postfix) with ESMTP id BBF69932;
 Wed,  6 Mar 2013 08:15:27 +0000 (UTC)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
 by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id r268FIl5015220;
 Wed, 6 Mar 2013 00:15:22 -0800 (PST)
 (envelope-from truckman@FreeBSD.org)
Message-Id: <201303060815.r268FIl5015220@gw.catspoiler.org>
Date: Wed, 6 Mar 2013 00:15:18 -0800 (PST)
From: Don Lewis <truckman@FreeBSD.org>
Subject: Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift
 topic to ZFS!
To: lev@FreeBSD.org
In-Reply-To: <1644513757.20130306113250@serebryakov.spb.ru>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=iso-8859-5
Content-Transfer-Encoding: 8BIT
Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 06 Mar 2013 08:15:28 -0000

On  6 Mar, Lev Serebryakov wrote:
> Hello, Don.
> You wrote 6 марта 2013 г., 11:15:13:
> 
> DL> Did you have any power failures that took down the system sometime
> DL> before this panic occured?  By default FreeBSD enables write caching on
>   I  had other panic due to my inaccurate hands... But I don't remeber
> any  power  failures, as I havee UPS and this one works (I check it every
> month).
> 
> DL> That means that the drive will immediately acknowledge writes and is
> DL> free to reorder them as it pleases.
> 
> DL> When UFS+SU allocates a new inode, it first clears the available bit in
> DL> the bitmap and writes the bitmap block to disk before it writes the new
> DL> inode contents to disk.  When a file is deleted, the inode is zeroed on
> DL> disk before the available bit is set in the bitmap and the bitmap block
> DL> is written.  That means that if an inode is marked as available in the
> DL> bitmap, then it should be zero.  The system panic that you experienced
> DL> happened when the system was attempting to allocate an inode for a new
> DL> file and when it peeked at an inode that was marked as available, it
> DL> found that the inode was non-zero.
> 
> DL> What might have happened is that sometime in the past, the system was in
>>[SKIPPED]
> DL> tried to allocate the inode in question.
>   This  scenario  looks plausible, but it raises another question: does
>  barriers will protect against it? It doesn't look so, as now barrier
>  write is issued only when new inode BLOCK is allocated. And it leads
>  us to my other question: why did not mark such vital writes with
>  flag, which will force driver to mark them as "uncacheable" (And same
>  for fsync()-inducted writes). Again, not BIO_FLUSH, which should
>  flush whole cache, but flag for BIO. I was told my mav@ (ahci driver
>  author) that ATA has such capability. And I'm sure, that SCSI/SAS drives
>  should have one too.

In the existing implementation, barriers wouldn't help since they aren't
used in enough nearly enough places.  UFS+SU currently expects the drive
to tell it when the data actually hits the platter so that it can
control the write ordering.  In theory, barriers could be used instead,
but performance would be terrible if they got turned into cache flushes.

With NCQ or TCQ, the drive can have a sizeable number of writes
internally queued and it is free to reorder them as it pleases even with
write caching disabled, but if write caching is disabled it has to delay
the notification of their completion until the data is on the platters
so that UFS+SU can enforce the proper dependency ordering.

Many years ago, when UFS+SU was fairly new, I experimented with enabling
and disabling write caching on a SCSI drive with TCQ.  Performance was
about the same either way.  I always disabled write caching on my SCSI
drives after that because that is what UFS+SU expectes so that it can
avoid inconsistencies in the case of power failure.

I don't know enough about ATA to say if it supports marking individual
writes as uncacheable.  To support consistency on a drive with write
caching enabled, UFS+SU would have to mark many of its writes as
uncacheable.  Even if this works, calls to fsync() would have to be
turned into cache flushes to force the file data (assuming that it was
was written with a cacheable write) to be written to the platters and
only return to the userland program after the data is written.  If drive
write caching is off, then UFS+SU keeps track of the outstanding writes
and an fsync() call won't return until the drive notifies UFS+SU that
the data blocks for that file are actually written.  In this case, the
fsync() call doesn't need to get propagated down to the drive.