From owner-freebsd-fs@FreeBSD.ORG  Thu May 22 14:26:18 2014
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id C5C48F83
 for <freebsd-fs@freebsd.org>; Thu, 22 May 2014 14:26:18 +0000 (UTC)
Received: from smtp102-5.vfemail.net (eight.vfemail.net [108.76.175.8])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 545172764
 for <freebsd-fs@freebsd.org>; Thu, 22 May 2014 14:26:18 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=simple; d=vfemail.net; h=date
 :message-id:from:to:subject:references:in-reply-to:content-type
 :mime-version:content-transfer-encoding; s=default; bh=c+0AWxkdR
 H/uCFiC2SfBD//pqwsJef4AXmnO3uckdfM=; b=X3oxgTuZRUMqsP3d+06xMcQrE
 19IeLU3z6DPuP+Dbk70UyzMCbaNKiNBOiMid30ylkD4bee+8kdNqcI77a+ymSvzC
 5EreyMtfWLqrc9A8r8c8JlYLWllu/i5s07YjJM6/43P5+y5VmQz3cxbGgn2ideKw
 ic1QylGVyZpnU2bIJk=
Received: (qmail 23387 invoked by uid 89); 22 May 2014 14:19:33 -0000
Received: by simscan 1.4.0 ppid: 23380, pid: 23383, t: 0.0893s scanners:none
Received: from unknown (HELO www111)
 (cmlja0BoYXZva21vbi5jb20=@MTcyLjE2LjEwMC45Mw==)
 by 172.16.100.62 with ESMTPA; 22 May 2014 14:19:33 -0000
Received: from rrcs-98-103-53-237.central.biz.rr.com
 (rrcs-98-103-53-237.central.biz.rr.com [98.103.53.237]) by www.vfemail.net
 (Horde Framework) with HTTP; Thu, 22 May 2014 09:19:32 -0500
Date: Thu, 22 May 2014 09:19:32 -0500
Message-ID: <20140522091932.Horde.hsT5LUjnShIYq2YrtCVdnA1@www.vfemail.net>
From: Rick Romero <rick@havokmon.com>
To: freebsd-fs@freebsd.org
Subject: Re: Turn off RAID read and write caching with ZFS? [SB QUAR: Thu
 May 22 08:33:59 2014]
References: <719056985.20140522033824@supranet.net>
 <537DF2F3.10604@denninger.net>
 <alpine.GSO.2.01.1405220825290.1735@freddy.simplesystems.org>
 <537E0301.4010509@denninger.net>
In-Reply-To: <537E0301.4010509@denninger.net>
User-Agent: Internet Messaging Program (IMP) H5 (6.1.7)
X-VFEmail-Originating-IP: OTguMTAzLjUzLjIzNw==
X-VFEmail-AntiSpam: Notify admin@vfemail.net of any spam, and include
 VFEmail headers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed; DelSp=Yes
Content-Transfer-Encoding: 8bit
Content-Disposition: inline
Content-Description: Plaintext Message
X-Content-Filtered-By: Mailman/MimeDel 2.1.18
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 22 May 2014 14:26:18 -0000

  Quoting Karl Denninger <karl@denninger.net>:

> On 5/22/2014 8:33 AM, Bob Friesenhahn wrote:
>> On Thu, 22 May 2014, Karl Denninger wrote:
>>> Write-caching is very evil in a ZFS world, because ZFS checksums each
>>> block. If the filesystem gets back an "OK" for a block not actually on
>>> the disk ZFS will presume the checksum is ok.  If that assumption
>>> proves to be false down the road you're going to have a very bad day.
>>
>> I don't agree with the above statement.  Non-volatile write caching is
>> very beneficial for zfs since it allows transactions (particularly
>> synchronous zil writes) to complete much quicker. This is important for
>> NFS servers and for databases.  What is important is that the cache
>> either be non-volatile (e.g. battery-backed RAM) or absolutely observe
>> zfs's cache flush requests.  Volatile caches which don't obey cache
>> flush requests can result in a corrupted pool on power loss, system
>> panic, or controller failure.
>>
>> Some plug-in RAID cards have poorly performing firmware which causes
>> problems.  Only testing or experience from other users can help
>> identify such cards so that they can be avoided or set to their least
>> harmful configuration.
>
> Let's think this one though.
>
> You have said disk on said controller.
>
> It has a battery-backed RAM cache and JBOD drives on it.
>
> Your database says "Write/Commit" and the controller does, to cache, and
> says "ok, done."  The data is now in the battery-backed cache. Let's
> further assume the cache is ECC-corrected and we'll accept the risk of
> an undetected ECC failure (very, very long odds on that one so that
> seems reasonable.)
>
> Some time passes and other I/O takes place without incident.
>
> Now the *DRIVE* returns an unrecoverable data error during the actual
> write to spinning rust when the controller (eventually) flushes its
> cache.

Technically, you have the same problem on the local drive's cache. But
disabling write cache on every device just to satisfy ZFS causes it to be
ungodly slow - IMHO. 

Also, IMHO, your scenario is a bit overstated. In this case, the drive
should mark the sector as bad, and write it's cache data to a new sector -
instead of going down the path of having the controller disable the entire
disk as you described.

Which, in the case of the controller disabling the entire drive, that is
safer under a controller-based RAID scenario - because the controller cache
can write to a different drive if that entire drive fails. When run as
cached JBOD - then sure, you could be hosed if the entire drive fails and
it's not caught before a write.

So bascially, IMHO again, if you run write cache on the controller and have
BBC + UPS, then use controller-based RAID.  Don't disable the drive cache
in either case, unless you want complete ZFS protection at the cost of
performance.

I have had ZFS detect a power supply issue by repeatedly disabling drives -
so I don't recommend the controller based RAID + write cache, just take the
performance hit.

Rick