From owner-freebsd-fs@FreeBSD.ORG  Mon Jan 24 14:42:39 2011
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9A088106564A
	for <freebsd-fs@freebsd.org>; Mon, 24 Jan 2011 14:42:39 +0000 (UTC)
	(envelope-from jdc@koitsu.dyndns.org)
Received: from qmta13.westchester.pa.mail.comcast.net
	(qmta13.westchester.pa.mail.comcast.net [76.96.59.243])
	by mx1.freebsd.org (Postfix) with ESMTP id 44B198FC08
	for <freebsd-fs@freebsd.org>; Mon, 24 Jan 2011 14:42:38 +0000 (UTC)
Received: from omta19.westchester.pa.mail.comcast.net ([76.96.62.98])
	by qmta13.westchester.pa.mail.comcast.net with comcast
	id zSiP1f00B27AodY5DSifbW; Mon, 24 Jan 2011 14:42:39 +0000
Received: from koitsu.dyndns.org ([98.248.34.134])
	by omta19.westchester.pa.mail.comcast.net with comcast
	id zSid1f00f2tehsa3fSiddM; Mon, 24 Jan 2011 14:42:38 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
	id 0D9A09B427; Mon, 24 Jan 2011 06:42:36 -0800 (PST)
Date: Mon, 24 Jan 2011 06:42:36 -0800
From: Jeremy Chadwick <freebsd@jdc.parodius.com>
To: Olivier Smedts <olivier@gid0.org>
Message-ID: <20110124144236.GA19500@icarus.home.lan>
References: <1ABA88EDF84B6472579216FE@Octa64>
	<20110122111045.GA59117@icarus.home.lan>
	<AANLkTik_rii-F_QWTP3OdyTS0gx1tDxv6--2LGGF6Ear@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <AANLkTik_rii-F_QWTP3OdyTS0gx1tDxv6--2LGGF6Ear@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: freebsd-fs@freebsd.org
Subject: Re: Write cache, is write cache, is write cache?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Jan 2011 14:42:39 -0000

On Mon, Jan 24, 2011 at 02:55:51PM +0100, Olivier Smedts wrote:
> 2011/1/22 Jeremy Chadwick <freebsd@jdc.parodius.com>:
> > On Sat, Jan 22, 2011 at 10:39:13AM +0000, Karl Pielorz wrote:
> >> I've a small HP server I've been using recently (an NL36). I've got
> >> ZFS setup on it, and it runs quite nicely.
> >>
> >> I was using the server for zeroing some drives the other day - and
> >> noticed that a:
> >>
> >>  dd if=/dev/zero of=/dev/ada0 bs=2m
> >>
> >> Gives around 12Mbyte/sec throughput when that's all that's running
> >> on the machine.
> >>
> >> Looking in the BIOS is a "Enabled drive write cache" option - which
> >> was set to 'No'. Changing it to 'Yes' - I now get around
> >> 90-120Mbyte/sec doing the same thing.
> >>
> >> Knowing all the issues with IDE drives and write caches - is there
> >> any way of telling if this would be safe to enable with ZFS? (i.e.
> >> if the option is likely to be making the drive completely ignore
> >> flush requests?) - or if it's still honouring the various 'write
> >> through' options if set on data to be written?
> >>
> >> I'm presuming DD won't by default be writing the data with the
> >> 'flush' bit set - as it probably doesn't know about it.
> >>
> >> Is there anyway of testing this? (say using some tool to write the
> >> data using either lots of 'cache flush' or 'write through' stuff) -
> >> and seeing if the performance drops back to nearer the 12Mbyte/sec?
> >>
> >> I've not enabled the option with the ZFS drives in the machine - I
> >> suppose I could test it.
> >>
> >> Write performance on the unit isn't that bad [it's not stunning] -
> >> though with 4 drives in a mirrored set - it probably helps hide some
> >> of the impact this option might have.
> >
> > I'm stating the below with the assumption that you have SATA disks with
> > some form of AHCI-based controller (possibly Intel ICHxx or ESBx
> > on-board), and *not* a hardware RAID controller with cache/RAM of its
> > own:
> >
> > Keep write caching *enabled* in the system BIOS.  ZFS will take care of
> > any underlying "issues" in the case the system abruptly loses power
> > (hard disk cache contents lost), since you're using ZFS mirroring.  The
> > same would apply if you were using raidz{1,2}, but not if you were using
> > ZFS on a single device (no mirroring/raidz).  In that scenario, expect
> > data loss; but the same could be said of any non-journalling filesystem.
> 
> Could you explain this behavior ? I don't see why ZFS would not ask a
> single disk to flush its caches like in a mirror/raidz. It's necessary
> for the ZIL, and to avoid FS corruption.

What seems to be "missing" from the discussion is that in the case of an
abrupt power failure, the in-kernel caching mechanisms (regardless of
filesystem (ZFS vs. UFS, etc.)) are all lost, in addition to any data
that's within a hard disk's on-PCB memory cache.  AFAIK, ZFS flushes its
caches to disk at set intervals.

The term "flush" means many different things.  fsync(2), for example,
behaves differently on UFS than it does on ZFS.  People think that
"flush" means "guarantee the data written was written to disk", but
ensuring an actual ATA/SCSI command completes **and** has had its data
written to the platters is an entirely different beast (IMO) than "flush
kernel buffers to disk and hope for the best".

In the case of ZFS, why would all data be written to the disk every
single time there's a write(2) operation?  Performance-wise that makes
absolutely no sense.  So there is absolutely going to be a "window of
failure" that can happen, and mirroring/raidz can recover from that, as
a result of the checksum "stuff".

With single disks, all I've seen are read/write errors which can't be
repaired.  "zpool status" will actually show what files got affected as
a result of the issue, though sometimes "zpool scrub" needs to be run
before this can be detected.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.               PGP 4BD6C0CB |