From owner-freebsd-fs@FreeBSD.ORG  Sun Dec  9 01:17:01 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D29D416A417
	for <freebsd-fs@FreeBSD.org>; Sun,  9 Dec 2007 01:17:01 +0000 (UTC)
	(envelope-from truckman@FreeBSD.org)
Received: from gw.catspoiler.org (adsl-75-1-14-242.dsl.scrm01.sbcglobal.net
	[75.1.14.242]) by mx1.freebsd.org (Postfix) with ESMTP id 92FCF13C448
	for <freebsd-fs@FreeBSD.org>; Sun,  9 Dec 2007 01:17:01 +0000 (UTC)
	(envelope-from truckman@FreeBSD.org)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
	by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id lB91Go19070069;
	Sat, 8 Dec 2007 17:16:54 -0800 (PST)
	(envelope-from truckman@FreeBSD.org)
Message-Id: <200712090116.lB91Go19070069@gw.catspoiler.org>
Date: Sat, 8 Dec 2007 17:16:50 -0800 (PST)
From: Don Lewis <truckman@FreeBSD.org>
To: bg@sics.se
In-Reply-To: <20071207143348.17470be3@ibook.sics.se>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8BIT
Cc: freebsd-fs@FreeBSD.org, des@des.no
Subject: Re: FSCK doesn't correct file size when INCORRECT BLOCK COUNT is
 found
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 09 Dec 2007 01:17:01 -0000

On  7 Dec, Bjorn Gronvall wrote:
> On Fri, 07 Dec 2007 13:48:12 +0100
> Dag-Erling Smørgrav <des@des.no> wrote:
> 
> Hi Dag-Erling,
> 
>> Bjorn Gronvall <bg@sics.se> writes:
>> > Filesystems in general and UFS with soft updates in particular rely on
>> > disks providing accurate response to writes. When write caching is
>> > enabled the disk will lie and tell the operating system that the write
>> > has completed successfully, in reality the data is only cached in disk
>> > RAM. When the power disappears the data will be gone forever.
>> 
>> No.  This used to be the case with some cheaper disks which ignored the
>> ATA "flush cache" command to score higher on benchmarks, but I doubt
>> you'll find any disks on the market that still do that (at least from
>> reputable manufacturers).
> 
> Agreed, but the software must also be written to actually make use of
> the more recent "flush cache" feature. I know that the GEOM journal
> can make use of this feature but does UFS with soft updates use it?

UFS with soft updates does not use the "flush cache" feature.  it
assumes that once the drive says that the data has been written, that
the data is actually on the platter.  If the drive does write caching,
this is an invalid assumption because the drive will indicate that data
has been written as soon as it gets transferred to the drive's cache.

Disabling write caching fixes this problem, but badly hurts the
performance of ATA drives, because it forces each I/O operation to be
done sequentually.  This is much less of an issue with SCSI drives,
because they have tagged command queuing (which is supported by
FreeBSD), which allows multiple simultaneous I/O requests to be queued
to the drive, which is free to re-order them more optimally, and to
report their status in what ever order the operations are completed.
Modern SATA drives have something similar, Native Command Queuing (NCQ),
but it is not yet supported by FreeBSD.

I'm also under the impression that modern ATA drives boost their
capacity by always rewriting a full track so that they can eliminate the
overhead of sector headers and trailers.  This hurts performance when
write caching is disabled, because even a single sector write requires
the full track to be rewritten, which could require multiple revolutions
of the spindle (a full track read if the track has not been cached, a
full track write, and possibly a partial revolution to get to the
correct location to start the write), and multiple writes to the same
track can not be combined.

Also, unless the drive can complete the entire track rewrite after it
detects power starting to fail, a power failure could corrupt data on
the same track as a sector being rewritten.  This data might be totally
unrelated to the sector(s) being modified and would be expected by the
file system to be stable.  The checksumming done by ZFS in combination
with RAID would help with this, but a power failure could still
potentially wipe out all the redundant copies of the data.