From owner-freebsd-fs@FreeBSD.ORG Mon Jan 24 14:42:39 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9A088106564A for ; Mon, 24 Jan 2011 14:42:39 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta13.westchester.pa.mail.comcast.net (qmta13.westchester.pa.mail.comcast.net [76.96.59.243]) by mx1.freebsd.org (Postfix) with ESMTP id 44B198FC08 for ; Mon, 24 Jan 2011 14:42:38 +0000 (UTC) Received: from omta19.westchester.pa.mail.comcast.net ([76.96.62.98]) by qmta13.westchester.pa.mail.comcast.net with comcast id zSiP1f00B27AodY5DSifbW; Mon, 24 Jan 2011 14:42:39 +0000 Received: from koitsu.dyndns.org ([98.248.34.134]) by omta19.westchester.pa.mail.comcast.net with comcast id zSid1f00f2tehsa3fSiddM; Mon, 24 Jan 2011 14:42:38 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 0D9A09B427; Mon, 24 Jan 2011 06:42:36 -0800 (PST) Date: Mon, 24 Jan 2011 06:42:36 -0800 From: Jeremy Chadwick To: Olivier Smedts Message-ID: <20110124144236.GA19500@icarus.home.lan> References: <1ABA88EDF84B6472579216FE@Octa64> <20110122111045.GA59117@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@freebsd.org Subject: Re: Write cache, is write cache, is write cache? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Jan 2011 14:42:39 -0000 On Mon, Jan 24, 2011 at 02:55:51PM +0100, Olivier Smedts wrote: > 2011/1/22 Jeremy Chadwick : > > On Sat, Jan 22, 2011 at 10:39:13AM +0000, Karl Pielorz wrote: > >> I've a small HP server I've been using recently (an NL36). I've got > >> ZFS setup on it, and it runs quite nicely. > >> > >> I was using the server for zeroing some drives the other day - and > >> noticed that a: > >> > >>  dd if=/dev/zero of=/dev/ada0 bs=2m > >> > >> Gives around 12Mbyte/sec throughput when that's all that's running > >> on the machine. > >> > >> Looking in the BIOS is a "Enabled drive write cache" option - which > >> was set to 'No'. Changing it to 'Yes' - I now get around > >> 90-120Mbyte/sec doing the same thing. > >> > >> Knowing all the issues with IDE drives and write caches - is there > >> any way of telling if this would be safe to enable with ZFS? (i.e. > >> if the option is likely to be making the drive completely ignore > >> flush requests?) - or if it's still honouring the various 'write > >> through' options if set on data to be written? > >> > >> I'm presuming DD won't by default be writing the data with the > >> 'flush' bit set - as it probably doesn't know about it. > >> > >> Is there anyway of testing this? (say using some tool to write the > >> data using either lots of 'cache flush' or 'write through' stuff) - > >> and seeing if the performance drops back to nearer the 12Mbyte/sec? > >> > >> I've not enabled the option with the ZFS drives in the machine - I > >> suppose I could test it. > >> > >> Write performance on the unit isn't that bad [it's not stunning] - > >> though with 4 drives in a mirrored set - it probably helps hide some > >> of the impact this option might have. > > > > I'm stating the below with the assumption that you have SATA disks with > > some form of AHCI-based controller (possibly Intel ICHxx or ESBx > > on-board), and *not* a hardware RAID controller with cache/RAM of its > > own: > > > > Keep write caching *enabled* in the system BIOS.  ZFS will take care of > > any underlying "issues" in the case the system abruptly loses power > > (hard disk cache contents lost), since you're using ZFS mirroring.  The > > same would apply if you were using raidz{1,2}, but not if you were using > > ZFS on a single device (no mirroring/raidz).  In that scenario, expect > > data loss; but the same could be said of any non-journalling filesystem. > > Could you explain this behavior ? I don't see why ZFS would not ask a > single disk to flush its caches like in a mirror/raidz. It's necessary > for the ZIL, and to avoid FS corruption. What seems to be "missing" from the discussion is that in the case of an abrupt power failure, the in-kernel caching mechanisms (regardless of filesystem (ZFS vs. UFS, etc.)) are all lost, in addition to any data that's within a hard disk's on-PCB memory cache. AFAIK, ZFS flushes its caches to disk at set intervals. The term "flush" means many different things. fsync(2), for example, behaves differently on UFS than it does on ZFS. People think that "flush" means "guarantee the data written was written to disk", but ensuring an actual ATA/SCSI command completes **and** has had its data written to the platters is an entirely different beast (IMO) than "flush kernel buffers to disk and hope for the best". In the case of ZFS, why would all data be written to the disk every single time there's a write(2) operation? Performance-wise that makes absolutely no sense. So there is absolutely going to be a "window of failure" that can happen, and mirroring/raidz can recover from that, as a result of the checksum "stuff". With single disks, all I've seen are read/write errors which can't be repaired. "zpool status" will actually show what files got affected as a result of the issue, though sometimes "zpool scrub" needs to be run before this can be detected. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP 4BD6C0CB |