From owner-freebsd-hackers@freebsd.org Tue Jul 5 02:46:47 2016 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 52EEFB92892 for ; Tue, 5 Jul 2016 02:46:47 +0000 (UTC) (envelope-from karl@denninger.net) Received: from mail.denninger.net (wsip-70-169-168-7.pn.at.cox.net [70.169.168.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 017CD151C for ; Tue, 5 Jul 2016 02:46:46 +0000 (UTC) (envelope-from karl@denninger.net) Received: from [192.168.1.40] (Karl-Desktop.Denninger.net [192.168.1.40]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.denninger.net (Postfix) with ESMTPSA id 19B8B220FDA for ; Mon, 4 Jul 2016 21:46:44 -0500 (CDT) Subject: Re: ZFS ARC and mmap/page cache coherency question To: freebsd-hackers@freebsd.org References: <20160630140625.3b4aece3@splash.akips.com> <20160703123004.74a7385a@splash.akips.com> <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org> <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net> <155b84da0aa.ad3af0e6139335.8627172617037605875@nextbsd.org> <7e00af5a-86cd-25f8-a4c6-2d946b507409@denninger.net> <34cf2d30-8884-95b6-f852-457d55710daf@freebsd.org> <768b6169-70d9-5500-c455-563d8340972e@denninger.net> From: Karl Denninger Message-ID: Date: Mon, 4 Jul 2016 21:46:29 -0500 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg=sha-512; boundary="------------ms010409020104080906010506" X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jul 2016 02:46:47 -0000 This is a cryptographically signed message in MIME format. --------------ms010409020104080906010506 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable On 7/4/2016 21:36, Allan Jude wrote: > On 2016-07-04 22:32, Karl Denninger wrote: >> On 7/4/2016 21:28, Allan Jude wrote: >>> On 2016-07-04 22:26, Karl Denninger wrote: >>>> >>>> On 7/4/2016 18:45, Matthew Macy wrote: >>>>> >>>>> ---- On Sun, 03 Jul 2016 08:43:19 -0700 Karl Denninger >>>>> wrote ---- >>>>> > >>>>> > On 7/3/2016 02:45, Matthew Macy wrote: >>>>> > > >>>>> > > Cedric greatly overstates the intractability of >>>>> resolving it. Nonetheless, since the initial import very little >>>>> has been done to improve integration, and I don't know of anyone >>>>> who is up to the task taking an interest in it. Consequently, >>>>> mmap() performance is likely "doomed" for the foreseeable >>>>> future.-M---- >>>>> > >>>>> > Wellllll.... >>>>> > >>>>> > I've done a fair bit of work here (see >>>>> > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D187594) and = the >>>>> > political issues are at least as bad as the coding ones. >>>>> > >>>>> >>>>> >>>>> Strictly speaking, the root of the problem is the ARC. Not ZFS per >>>>> se. Have you ever tried disabling MFU caching to see how much >>>>> worse LRU only is? I'm not really convinced the ARC's benefits >>>>> justify its cost. >>>>> >>>>> -M >>>>> >>>> The ARC is very useful when it gets a hit as it avoid an I/O that >>>> would >>>> otherwise take place. >>>> >>>> Where it sucks is when the system evicts working set to preserve ARC= =2E >>>> That's always wrong in that you're trading a speculative I/O (if the= >>>> cache is hit later) for a *guaranteed* one (to page out) and maybe >>>> *two* >>>> (to page back in.) >>>> >>> ZFS is better behaved in 11.x, there is a sysctl >>> vfs.zfs.arc_free_target >>> that makes sure the ARC is reined in when there is memory pressure, b= y >>> ensuring a minimum amount of actually free pages. >>> >> Oh, but..... >> >> Again, go read the PR I linked (and the current version of the patch >> against 10-STABLE.) The issues are far more intertwined than that. >> Specifically, the dmu_tx cache decision (size of the write-back cache)= >> is flat-out broken and inappropriate in essentially all cases, and the= >> interaction of UMA and ARC is very destructive under a wide variety of= >> workloads. The patch has hack-around for the dmu_tx problem and a >> reasonably-effective fix for the UMA issues. Actually fixing dmu_tx, >> however, is nowhere near that easy since it really needs to be compute= d >> per-zvol on an actual bytes moved per-unit-of-time basis. >> >> Note that one of the patches in the set I developed is indeed >> arc_free_target (indeed it was the first approach I took) -- but witho= ut >> addressing the other two issues it doesn't solve the problem. >> > > You keep saying per zvol. Do you mean per vdev? I am under the > impression that no zvol's are involved in the use case this thread is > about. Sorry, per-vdev. The problem with dmu_tx is that it's system-wide.=20 This is wildly inappropriate for several reasons -- first, it is computed on size-of-RAM with a hard cap (which is stupid on its face) and it entirely insensitive to the performance of the vdev's in question. Specifically, it is very common for a system to have very fast (e.g. SSD) disks, perhaps in a mirror configuration, and then spinning rust in a RaidZ2 config for bulk storage. Those are very, very different performance wise and they should have wildly different write-back cache sizes. At present there is exactly one such write-back cache and it's both system-wide and pays exactly zero attention to the throughput of the underlying vdevs it is talking to. This is why you can provoke minute-long stalls on a system with moderate (e.g. 32GB) amounts of RAM if there are spinning rust devices in the configuration. > > Improving the way ZFS frees memory, specifically UMA and the 'kmem > caches' will help a lot as well. > Well, yeah. But that means you have to police up the size of the UMA =2Evs. how much is actually in use in the UMA. What the PR does is get pretty aggressive with that whenever RAM is tight, and before the pager can start playing hell with system performance. > In addition, another patch just went in to allow you to change the > arc_max and arc_min on a running system. > Yes, the PR I did a long time ago made that "active" on a running system.... so I've had that for quite some time. Not that you really ought to need to play with that (if you feel a need to then you're still at step 1 or 2 of what I went through with analyzing and working on this in the 10.x code.....) --=20 Karl Denninger karl@denninger.net /The Market Ticker/ /[S/MIME encrypted email preferred]/ --------------ms010409020104080906010506 Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" Content-Description: S/MIME Cryptographic Signature MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgMFADCABgkqhkiG9w0BBwEAAKCC Bl8wggZbMIIEQ6ADAgECAgEpMA0GCSqGSIb3DQEBCwUAMIGQMQswCQYDVQQGEwJVUzEQMA4G A1UECBMHRmxvcmlkYTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3Rl bXMgTExDMRwwGgYDVQQDExNDdWRhIFN5c3RlbXMgTExDIENBMSIwIAYJKoZIhvcNAQkBFhND dWRhIFN5c3RlbXMgTExDIENBMB4XDTE1MDQyMTAyMjE1OVoXDTIwMDQxOTAyMjE1OVowWjEL MAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM TEMxHjAcBgNVBAMTFUthcmwgRGVubmluZ2VyIChPQ1NQKTCCAiIwDQYJKoZIhvcNAQEBBQAD ggIPADCCAgoCggIBALmEWPhAdphrWd4K5VTvE5pxL3blRQPyGF3ApjUjgtavqU1Y8pbI3Byg XDj2/Uz9Si8XVj/kNbKEjkRh5SsNvx3Fc0oQ1uVjyCq7zC/kctF7yLzQbvWnU4grAPZ3IuAp 3/fFxIVaXpxEdKmyZAVDhk9az+IgHH43rdJRIMzxJ5vqQMb+n2EjadVqiGPbtG9aZEImlq7f IYDTnKyToi23PAnkPwwT+q1IkI2DTvf2jzWrhLR5DTX0fUYC0nxlHWbjgpiapyJWtR7K2YQO aevQb/3vN9gSojT2h+cBem7QIj6U69rEYcEDvPyCMXEV9VcXdcmW42LSRsPvZcBHFkWAJqMZ Myiz4kumaP+s+cIDaXitR/szoqDKGSHM4CPAZV9Yh8asvxQL5uDxz5wvLPgS5yS8K/o7zDR5 vNkMCyfYQuR6PAJxVOk5Arqvj9lfP3JSVapwbr01CoWDBkpuJlKfpQIEeC/pcCBKknllbMYq yHBO2TipLyO5Ocd1nhN/nOsO+C+j31lQHfOMRZaPQykXVPWG5BbhWT7ttX4vy5hOW6yJgeT/ o3apynlp1cEavkQRS8uJHoQszF6KIrQMID/JfySWvVQ4ksnfzwB2lRomrdrwnQ4eG/HBS+0l eozwOJNDIBlAP+hLe8A5oWZgooIIK/SulUAsfI6Sgd8dTZTTYmlhAgMBAAGjgfQwgfEwNwYI KwYBBQUHAQEEKzApMCcGCCsGAQUFBzABhhtodHRwOi8vY3VkYXN5c3RlbXMubmV0Ojg4ODgw CQYDVR0TBAIwADARBglghkgBhvhCAQEEBAMCBaAwCwYDVR0PBAQDAgXgMCwGCWCGSAGG+EIB DQQfFh1PcGVuU1NMIEdlbmVyYXRlZCBDZXJ0aWZpY2F0ZTAdBgNVHQ4EFgQUxRyULenJaFwX RtT79aNmIB/u5VkwHwYDVR0jBBgwFoAUJHGbnYV9/N3dvbDKkpQDofrTbTUwHQYDVR0RBBYw FIESa2FybEBkZW5uaW5nZXIubmV0MA0GCSqGSIb3DQEBCwUAA4ICAQBPf3cYtmKowmGIYsm6 eBinJu7QVWvxi1vqnBz3KE+HapqoIZS8/PolB/hwiY0UAE1RsjBJ7yEjihVRwummSBvkoOyf G30uPn4yg4vbJkR9lTz8d21fPshWETa6DBh2jx2Qf13LZpr3Pj2fTtlu6xMYKzg7cSDgd2bO sJGH/rcvva9Spkx5Vfq0RyOrYph9boshRN3D4tbWgBAcX9POdXCVfJONDxhfBuPHsJ6vEmPb An+XL5Yl26XYFPiODQ+Qbk44Ot1kt9s7oS3dVUrh92Qv0G3J3DF+Vt6C15nED+f+bk4gScu+ JHT7RjEmfa18GT8DcT//D1zEke1Ymhb41JH+GyZchDRWtjxsS5OBFMzrju7d264zJUFtX7iJ 3xvpKN7VcZKNtB6dLShj3v/XDsQVQWXmR/1YKWZ93C3LpRs2Y5nYdn6gEOpL/WfQFThtfnat HNc7fNs5vjotaYpBl5H8+VCautKbGOs219uQbhGZLYTv6okuKcY8W+4EJEtK0xB08vqr9Jd0 FS9MGjQE++GWo+5eQxFt6nUENHbVYnsr6bYPQsZH0CRNycgTG9MwY/UIXOf4W034UpR82TBG 1LiMsYfb8ahQJhs3wdf1nzipIjRwoZKT1vGXh/cj3gwSr64GfenURBxaFZA5O1acOZUjPrRT n3ci4McYW/0WVVA3lDGCBRMwggUPAgEBMIGWMIGQMQswCQYDVQQGEwJVUzEQMA4GA1UECBMH RmxvcmlkYTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExD MRwwGgYDVQQDExNDdWRhIFN5c3RlbXMgTExDIENBMSIwIAYJKoZIhvcNAQkBFhNDdWRhIFN5 c3RlbXMgTExDIENBAgEpMA0GCWCGSAFlAwQCAwUAoIICTTAYBgkqhkiG9w0BCQMxCwYJKoZI hvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNjA3MDUwMjQ2MjlaME8GCSqGSIb3DQEJBDFCBEAD QidOIbJLVCn4JDVQQmXLjHXBkph1n3i81pzVT6ckttaROoPA/2MTZQH3Bp6qaMZEHVS6RevL xClQSpCqnvEhMGwGCSqGSIb3DQEJDzFfMF0wCwYJYIZIAWUDBAEqMAsGCWCGSAFlAwQBAjAK BggqhkiG9w0DBzAOBggqhkiG9w0DAgICAIAwDQYIKoZIhvcNAwICAUAwBwYFKw4DAgcwDQYI KoZIhvcNAwICASgwgacGCSsGAQQBgjcQBDGBmTCBljCBkDELMAkGA1UEBhMCVVMxEDAOBgNV BAgTB0Zsb3JpZGExEjAQBgNVBAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1z IExMQzEcMBoGA1UEAxMTQ3VkYSBTeXN0ZW1zIExMQyBDQTEiMCAGCSqGSIb3DQEJARYTQ3Vk YSBTeXN0ZW1zIExMQyBDQQIBKTCBqQYLKoZIhvcNAQkQAgsxgZmggZYwgZAxCzAJBgNVBAYT AlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1 ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExIjAgBgkqhkiG 9w0BCQEWE0N1ZGEgU3lzdGVtcyBMTEMgQ0ECASkwDQYJKoZIhvcNAQEBBQAEggIAC47VkM+m ZQ2YAs6GfFwHC/bP3nsNN2feyRwnZMJ90eF4AL0Qm2H9KPNhoa0kDoNFQDEWl6AGeVj2gyxL Rk1HEX3m3f2RqZQqanMdBtIPe8P/AZxqMWOUErWBUES1ee1YMz50mqqAOUEcxBiYNFDMbFCN vwsqwHlIJdn2Rz+IYoUlUKlanTbSBXaODgKh7UjD4hAi917A7E67bOqwiAb9tp3cDjNRMEo4 dciyujK3tEHyEXmupTYvnXVOqT2kLjDxcxfiPDQF3B7tzTbHcStVCloTHCxSuvpZK3lfZhCB Xu84S3ZW/MmJF8CCl50b+Te0NWNJbc7yTRKHvS3b1Upb9U1jcXlbJF5OlFNJ3umazSTJoPoB TYKPkJBS8j3yfTnN4w+v5evrYaYpIFXSQ5KvAuMT87A7dDGUWpVx8EmrisTP2ZMYI4qSAPxb FwAeUTwxeI2hJ237gukNoNMb+eXDoMyn0FgAz6i4ngp2cpA6YAIghLYjhVYeaRMGSJj3ESSL d60a1QziYTAl2fbG644SoKBufKmQ43zMTFW0DdprnthW2S07K9NHXCVIDOxV4cun1yZMv54i 3zgGFXEdUaakTjUn4kF3F1vFuskPomi2ipZOyQwXngTH5molosR23Iwj9cSaPWho4jVY4wQG dRWjt/a65kWboAbhCuw+YMdXkiwAAAAAAAA= --------------ms010409020104080906010506--