From owner-freebsd-fs@FreeBSD.ORG Thu May 22 14:26:18 2014 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C5C48F83 for ; Thu, 22 May 2014 14:26:18 +0000 (UTC) Received: from smtp102-5.vfemail.net (eight.vfemail.net [108.76.175.8]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 545172764 for ; Thu, 22 May 2014 14:26:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=simple; d=vfemail.net; h=date :message-id:from:to:subject:references:in-reply-to:content-type :mime-version:content-transfer-encoding; s=default; bh=c+0AWxkdR H/uCFiC2SfBD//pqwsJef4AXmnO3uckdfM=; b=X3oxgTuZRUMqsP3d+06xMcQrE 19IeLU3z6DPuP+Dbk70UyzMCbaNKiNBOiMid30ylkD4bee+8kdNqcI77a+ymSvzC 5EreyMtfWLqrc9A8r8c8JlYLWllu/i5s07YjJM6/43P5+y5VmQz3cxbGgn2ideKw ic1QylGVyZpnU2bIJk= Received: (qmail 23387 invoked by uid 89); 22 May 2014 14:19:33 -0000 Received: by simscan 1.4.0 ppid: 23380, pid: 23383, t: 0.0893s scanners:none Received: from unknown (HELO www111) (cmlja0BoYXZva21vbi5jb20=@MTcyLjE2LjEwMC45Mw==) by 172.16.100.62 with ESMTPA; 22 May 2014 14:19:33 -0000 Received: from rrcs-98-103-53-237.central.biz.rr.com (rrcs-98-103-53-237.central.biz.rr.com [98.103.53.237]) by www.vfemail.net (Horde Framework) with HTTP; Thu, 22 May 2014 09:19:32 -0500 Date: Thu, 22 May 2014 09:19:32 -0500 Message-ID: <20140522091932.Horde.hsT5LUjnShIYq2YrtCVdnA1@www.vfemail.net> From: Rick Romero To: freebsd-fs@freebsd.org Subject: Re: Turn off RAID read and write caching with ZFS? [SB QUAR: Thu May 22 08:33:59 2014] References: <719056985.20140522033824@supranet.net> <537DF2F3.10604@denninger.net> <537E0301.4010509@denninger.net> In-Reply-To: <537E0301.4010509@denninger.net> User-Agent: Internet Messaging Program (IMP) H5 (6.1.7) X-VFEmail-Originating-IP: OTguMTAzLjUzLjIzNw== X-VFEmail-AntiSpam: Notify admin@vfemail.net of any spam, and include VFEmail headers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed; DelSp=Yes Content-Transfer-Encoding: 8bit Content-Disposition: inline Content-Description: Plaintext Message X-Content-Filtered-By: Mailman/MimeDel 2.1.18 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 22 May 2014 14:26:18 -0000 Quoting Karl Denninger : > On 5/22/2014 8:33 AM, Bob Friesenhahn wrote: >> On Thu, 22 May 2014, Karl Denninger wrote: >>> Write-caching is very evil in a ZFS world, because ZFS checksums each >>> block. If the filesystem gets back an "OK" for a block not actually on >>> the disk ZFS will presume the checksum is ok.  If that assumption >>> proves to be false down the road you're going to have a very bad day. >> >> I don't agree with the above statement.  Non-volatile write caching is >> very beneficial for zfs since it allows transactions (particularly >> synchronous zil writes) to complete much quicker. This is important for >> NFS servers and for databases.  What is important is that the cache >> either be non-volatile (e.g. battery-backed RAM) or absolutely observe >> zfs's cache flush requests.  Volatile caches which don't obey cache >> flush requests can result in a corrupted pool on power loss, system >> panic, or controller failure. >> >> Some plug-in RAID cards have poorly performing firmware which causes >> problems.  Only testing or experience from other users can help >> identify such cards so that they can be avoided or set to their least >> harmful configuration. > > Let's think this one though. > > You have said disk on said controller. > > It has a battery-backed RAM cache and JBOD drives on it. > > Your database says "Write/Commit" and the controller does, to cache, and > says "ok, done."  The data is now in the battery-backed cache. Let's > further assume the cache is ECC-corrected and we'll accept the risk of > an undetected ECC failure (very, very long odds on that one so that > seems reasonable.) > > Some time passes and other I/O takes place without incident. > > Now the *DRIVE* returns an unrecoverable data error during the actual > write to spinning rust when the controller (eventually) flushes its > cache. Technically, you have the same problem on the local drive's cache. But disabling write cache on every device just to satisfy ZFS causes it to be ungodly slow - IMHO.  Also, IMHO, your scenario is a bit overstated. In this case, the drive should mark the sector as bad, and write it's cache data to a new sector - instead of going down the path of having the controller disable the entire disk as you described. Which, in the case of the controller disabling the entire drive, that is safer under a controller-based RAID scenario - because the controller cache can write to a different drive if that entire drive fails. When run as cached JBOD - then sure, you could be hosed if the entire drive fails and it's not caught before a write. So bascially, IMHO again, if you run write cache on the controller and have BBC + UPS, then use controller-based RAID.  Don't disable the drive cache in either case, unless you want complete ZFS protection at the cost of performance. I have had ZFS detect a power supply issue by repeatedly disabling drives - so I don't recommend the controller based RAID + write cache, just take the performance hit. Rick