From owner-freebsd-fs@FreeBSD.ORG Thu May 22 14:00:45 2014 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A0E19CCD for ; Thu, 22 May 2014 14:00:45 +0000 (UTC) Received: from fs.denninger.net (wsip-70-169-168-7.pn.at.cox.net [70.169.168.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "NewFS.denninger.net", Issuer "NewFS.denninger.net" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4A16C24B3 for ; Thu, 22 May 2014 14:00:44 +0000 (UTC) Received: from [127.0.0.1] (localhost [127.0.0.1]) by fs.denninger.net (8.14.8/8.14.8) with ESMTP id s4ME0ck0040533 for ; Thu, 22 May 2014 09:00:38 -0500 (CDT) (envelope-from karl@denninger.net) Received: from [127.0.0.1] (TLS/SSL) [192.168.1.40] by Spamblock-sys (LOCAL/AUTH); Thu May 22 09:00:38 2014 Message-ID: <537E0301.4010509@denninger.net> Date: Thu, 22 May 2014 09:00:33 -0500 From: Karl Denninger User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Turn off RAID read and write caching with ZFS? [SB QUAR: Thu May 22 08:33:59 2014] References: <719056985.20140522033824@supranet.net> <537DF2F3.10604@denninger.net> In-Reply-To: Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg=sha1; boundary="------------ms000509040906010807010604" X-Antivirus: avast! (VPS 140521-1, 05/21/2014), Outbound message X-Antivirus-Status: Clean X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 22 May 2014 14:00:45 -0000 This is a cryptographically signed message in MIME format. --------------ms000509040906010807010604 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable On 5/22/2014 8:33 AM, Bob Friesenhahn wrote: > On Thu, 22 May 2014, Karl Denninger wrote: >> >> Write-caching is very evil in a ZFS world, because ZFS checksums each = >> block. If the filesystem gets back an "OK" for a block not actually=20 >> on the disk ZFS will presume the checksum is ok. If that assumption=20 >> proves to be false down the road you're going to have a very bad day. > > I don't agree with the above statement. Non-volatile write caching is = > very beneficial for zfs since it allows transactions (particularly=20 > synchronous zil writes) to complete much quicker. This is important=20 > for NFS servers and for databases. What is important is that the=20 > cache either be non-volatile (e.g. battery-backed RAM) or absolutely=20 > observe zfs's cache flush requests. Volatile caches which don't obey=20 > cache flush requests can result in a corrupted pool on power loss,=20 > system panic, or controller failure. > > Some plug-in RAID cards have poorly performing firmware which causes=20 > problems. Only testing or experience from other users can help=20 > identify such cards so that they can be avoided or set to their least=20 > harmful configuration. > Let's think this one though. You have said disk on said controller. It has a battery-backed RAM cache and JBOD drives on it. Your database says "Write/Commit" and the controller does, to cache, and = says "ok, done." The data is now in the battery-backed cache. Let's=20 further assume the cache is ECC-corrected and we'll accept the risk of=20 an undetected ECC failure (very, very long odds on that one so that=20 seems reasonable.) Some time passes and other I/O takes place without incident. Now the *DRIVE* returns an unrecoverable data error during the actual=20 write to spinning rust when the controller (eventually) flushes its cache= =2E Note that the controller can't rebuild the drive as it doesn't have a=20 second copy; it's JBOD. When does the operating system find out about=20 the fault and what locality of the fault does it learn about? Be very careful with your assumptions here. If there is more than one=20 filesystem on that drive the I/O that actually returns a fault (because=20 of when it is detected) may in fact be to a *different filesystem* than=20 the one that actually faulted! The only safe thing for the adapter to do if it detects a failure on a=20 deferred (battery-backed) write is to declare the entire *disk* dead and = return error for all subsequent I/O attempts to it, effectively forcing=20 all data on that pack to be declared "gone" at the OS level. You better = hope the adapter does that (are you sure yours does?) or you're going to = get a surprise of a most-unpleasant sort because there is no way for the = adapter to go back and declare a formerly-committed-and-confirmed I/O=20 invalid. At a minimum by doing this you have multiplied a single-block failure=20 into a failure of *all* blocks on the media as soon as the first one=20 fails. In practice that may not be all that far off the mark (drives=20 has a distressing habit of failing far more than one block at a time)=20 but to force that behavior is something you should be aware of. There is a very good argument for what amounts to a battery-backed RAM=20 "disk" for ZIL for the reasons you noted. And I do agree there are=20 significant performance improvements to be had from battery-backed RAM=20 adapters in a ZFS environment (by the way, set the zfs logbias to=20 "throughput" rather than "latency" if you're using a controller cache=20 since ZFS is incapable of deterministically predicting latency and that=20 can lead to some really odd behavior) but in terms of operational=20 integrity you are taking risk by doing this. Then again we lived with that risk in the world before ZFS and=20 hardware-backed RAID in that an *undetected* sector fault was=20 potentially ruinous, and since individual blocks were not checksummed it = did occasionally happen. All configurations carry risk and you have to evaluate which ones you're = willing to live with and which ones you simply cannot accept. --=20 -- Karl karl@denninger.net --------------ms000509040906010807010604 Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" Content-Description: S/MIME Cryptographic Signature MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIFTzCC BUswggQzoAMCAQICAQgwDQYJKoZIhvcNAQEFBQAwgZ0xCzAJBgNVBAYTAlVTMRAwDgYDVQQI EwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM TEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkqhkiG9w0BCQEWIGN1c3Rv bWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0MB4XDTEzMDgyNDE5MDM0NFoXDTE4MDgyMzE5 MDM0NFowWzELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExFzAVBgNVBAMTDkthcmwg RGVubmluZ2VyMSEwHwYJKoZIhvcNAQkBFhJrYXJsQGRlbm5pbmdlci5uZXQwggIiMA0GCSqG SIb3DQEBAQUAA4ICDwAwggIKAoICAQC5n2KBrBmG22nVntVdvgKCB9UcnapNThrW1L+dq6th d9l4mj+qYMUpJ+8I0rTbY1dn21IXQBoBQmy8t1doKwmTdQ59F0FwZEPt/fGbRgBKVt3Quf6W 6n7kRk9MG6gdD7V9vPpFV41e+5MWYtqGWY3ScDP8SyYLjL/Xgr+5KFKkDfuubK8DeNqdLniV jHo/vqmIgO+6NgzPGPgmbutzFQXlxUqjiNAAKzF2+Tkddi+WKABrcc/EqnBb0X8GdqcIamO5 SyVmuM+7Zdns7D9pcV16zMMQ8LfNFQCDvbCuuQKMDg2F22x5ekYXpwjqTyfjcHBkWC8vFNoY 5aFMdyiN/Kkz0/kduP2ekYOgkRqcShfLEcG9SQ4LQZgqjMpTjSOGzBr3tOvVn5LkSJSHW2Z8 Q0dxSkvFG2/lsOWFbwQeeZSaBi5vRZCYCOf5tRd1+E93FyQfpt4vsrXshIAk7IK7f0qXvxP4 GDli5PKIEubD2Bn+gp3vB/DkfKySh5NBHVB+OPCoXRUWBkQxme65wBO02OZZt0k8Iq0i4Rci WV6z+lQHqDKtaVGgMsHn6PoeYhjf5Al5SP+U3imTjF2aCca1iDB5JOccX04MNljvifXgcbJN nkMgrzmm1ZgJ1PLur/ADWPlnz45quOhHg1TfUCLfI/DzgG7Z6u+oy4siQuFr9QT0MQIDAQAB o4HWMIHTMAkGA1UdEwQCMAAwEQYJYIZIAYb4QgEBBAQDAgWgMAsGA1UdDwQEAwIF4DAsBglg hkgBhvhCAQ0EHxYdT3BlblNTTCBHZW5lcmF0ZWQgQ2VydGlmaWNhdGUwHQYDVR0OBBYEFHw4 +LnuALyLA5Cgy7T5ZAX1WzKPMB8GA1UdIwQYMBaAFF3U3hpBZq40HB5VM7B44/gmXiI0MDgG CWCGSAGG+EIBAwQrFilodHRwczovL2N1ZGFzeXN0ZW1zLm5ldDoxMTQ0My9yZXZva2VkLmNy bDANBgkqhkiG9w0BAQUFAAOCAQEAZ0L4tQbBd0hd4wuw/YVqEBDDXJ54q2AoqQAmsOlnoxLO 31ehM/LvrTIP4yK2u1VmXtUumQ4Ao15JFM+xmwqtEGsh70RRrfVBAGd7KOZ3GB39FP2TgN/c L5fJKVxOqvEnW6cL9QtvUlcM3hXg8kDv60OB+LIcSE/P3/s+0tEpWPjxm3LHVE7JmPbZIcJ1 YMoZvHh0NSjY5D0HZlwtbDO7pDz9sZf1QEOgjH828fhtborkaHaUI46pmrMjiBnY6ujXMcWD pxtikki0zY22nrxfTs5xDWGxyrc/cmucjxClJF6+OYVUSaZhiiHfa9Pr+41okLgsRB0AmNwE f6ItY3TI8DGCBQowggUGAgEBMIGjMIGdMQswCQYDVQQGEwJVUzEQMA4GA1UECBMHRmxvcmlk YTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExDMRwwGgYD VQQDExNDdWRhIFN5c3RlbXMgTExDIENBMS8wLQYJKoZIhvcNAQkBFiBjdXN0b21lci1zZXJ2 aWNlQGN1ZGFzeXN0ZW1zLm5ldAIBCDAJBgUrDgMCGgUAoIICOzAYBgkqhkiG9w0BCQMxCwYJ KoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNDA1MjIxNDAwMzNaMCMGCSqGSIb3DQEJBDEW BBTLVO0YxfZyq/y5yWqzHPdyOCucFjBsBgkqhkiG9w0BCQ8xXzBdMAsGCWCGSAFlAwQBKjAL BglghkgBZQMEAQIwCgYIKoZIhvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3DQMCAgFA MAcGBSsOAwIHMA0GCCqGSIb3DQMCAgEoMIG0BgkrBgEEAYI3EAQxgaYwgaMwgZ0xCzAJBgNV BAYTAlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoT EEN1ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkq hkiG9w0BCQEWIGN1c3RvbWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0AgEIMIG2BgsqhkiG 9w0BCRACCzGBpqCBozCBnTELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExEjAQBgNV BAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1zIExMQzEcMBoGA1UEAxMTQ3Vk YSBTeXN0ZW1zIExMQyBDQTEvMC0GCSqGSIb3DQEJARYgY3VzdG9tZXItc2VydmljZUBjdWRh c3lzdGVtcy5uZXQCAQgwDQYJKoZIhvcNAQEBBQAEggIApOXSJVcAmlMkrhpqUZuIT1xKhNb1 WKGUUTgVKjV7M4CkYGZkaXDBlRaiGLFI3njEhcafwDCT7eaMh45ARgS9u+EXvPuxwNpA9DFw loijUP3rmi2iyJZ095SLwFJEu7hjKw7BsAwLqSRdmYgwV5lw06NzZf6Mrxe0Mb1ibHjdH2/0 ZDt+wCoiFGickMjjzpRRyc2uD0VqLnnrMpw0lWnaNPiV9NcU9RlMU50XjwOD7RqO8tosxMRm 4RA+HanrO38xXv8N6UcPonkxIAnOnNrxVaswYodY/ejDSXQUU0tUykGTzJYwxKukG7beV3CQ Z7aqEU/YTj9AVbeqCrZWqBNnGp3jxumZKQz+Pw5gMeHx/TSe5wxsM5wixPlpz0wzd8ZwXLZV wJXY09qEKdnIcVckpcEUcb3Xok6vhp5hK7su1FL/aaSiDl9KGWrBYXTvA9vHjOg3OupMe2j3 ybWFgRgqG7BPrh5K8LBljr/UKMtsCQY0gltTqGUByJzdGhqXoZ70Is8oK47+1CLyIQYTA6oa vLCuXLqiBEWOHMQGvX1GGyXwgGknSIkIg75BAVuE+ycbSqIwqbryJITBYzWbex16PpztgJyQ QLznClROadXvN/o8OuCd9CPeQ7bEdd4ifax30ztkpkeA8Swc15Im3+WPu+9AoKcMJrP7HxMs u0uG+0AAAAAAAAA= --------------ms000509040906010807010604--