From owner-freebsd-hackers@freebsd.org Tue Jul 5 19:09:14 2016 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3218BB20EDE for ; Tue, 5 Jul 2016 19:09:14 +0000 (UTC) (envelope-from karl@denninger.net) Received: from mail.denninger.net (wsip-70-169-168-7.pn.at.cox.net [70.169.168.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id E58741AFF for ; Tue, 5 Jul 2016 19:09:13 +0000 (UTC) (envelope-from karl@denninger.net) Received: from [192.168.1.40] (Karl-Desktop.Denninger.net [192.168.1.40]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.denninger.net (Postfix) with ESMTPSA id 7C3CC2209A0 for ; Tue, 5 Jul 2016 14:09:12 -0500 (CDT) Subject: Re: ZFS ARC and mmap/page cache coherency question To: freebsd-hackers@freebsd.org References: <20160630140625.3b4aece3@splash.akips.com> <20160703123004.74a7385a@splash.akips.com> <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org> <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net> <155b84da0aa.ad3af0e6139335.8627172617037605875@nextbsd.org> <7e00af5a-86cd-25f8-a4c6-2d946b507409@denninger.net> <155bc1260e6.12001bf18198857.6272515207330027022@nextbsd.org> <31f4d30f-4170-0d04-bd23-1b998474a92e@denninger.net> From: Karl Denninger Message-ID: <2be70811-add4-d630-7f5a-a5a53ee2a5d4@denninger.net> Date: Tue, 5 Jul 2016 14:08:55 -0500 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg=sha-512; boundary="------------ms070700000603020002030909" X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jul 2016 19:09:14 -0000 This is a cryptographically signed message in MIME format. --------------ms070700000603020002030909 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable You'd get most of the way to what Oracle did, I suspect, if the system: 1. Dynamically resized the write cache on a per-vdev basis so as to prevent a flush from stalling all write I/O for a material amount of time (which can and *does* happen now) 2. Made VM aware of UMA "committed-but-free" on an ongoing basis and policed it on a sliding basis (that is, as RAM pressure rises VM considers it more-important to reap UMA so as to prevent marked-used-but-in-fact-free RAM from accumulating when RAM is under pressure.) 3. Bi-directionally hooked VM so that it initiates and cooperates with ZFS on ARC size management. Specifically, if ZFS decides ARC is to be reaped then it must notify VM so that (1) UMA can be reaped first, if necessary and then if ARC *still* needs to be reaped it occurs *before* VM pages anything out. If and only if ARC is at minimum should the VM system evict working set to the pagefile. #1 is entirely within ZFS but is fairly hard to do well, and neither Illumos or FreeBSD's team have taken a serious crack at it. #2 I've taken a fairly decent look at but not implemented code on the VM side to do it. What I *have* done is implemented code on the ZFS side to do it within the ZFS paradigm, which is technically in the wrong place but works pretty well -- so long as the UMA fragmentation is coming from ZFS. #3 is a bear, especially if you don't move that code into VM (which intimately "marries" the ZFS and VM code; that's very bad from a maintainability perspective.) What I've implemented is somewhat of a hack in that regard in that it has ZFS triggering before VM does, it gets aggressive with reaping its own UMA areas and the writeback cache when there is RAM pressure and thus *most* of the time avoids the paging pathology while allowing the ARC to use the truly-free RAM. It ought to be in the VM code however, because the pressure sometimes does not come from ZFS. This is why one of my production machines looks like right now with the patch in -- this system runs a quite-active Postgres database along with a material number of other things at the same time; this doesn't look bad at all in terms of efficiency. [karl@NewFS ~]$ zfs-stats -A ------------------------------------------------------------------------ ZFS Subsystem Report Tue Jul 5 14:05:06 2016 ------------------------------------------------------------------------ ARC Summary: (HEALTHY) Memory Throttle Count: 0 ARC Misc: Deleted: 29.11m Recycle Misses: 0 Mutex Misses: 67.14k Evict Skips: 72.84m ARC Size: 72.10% 16.10 GiB Target Size: (Adaptive) 83.00% 18.53 GiB Min Size (Hard Limit): 12.50% 2.79 GiB Max Size (High Water): 8:1 22.33 GiB ARC Size Breakdown: Recently Used Cache Size: 81.84% 15.17 GiB Frequently Used Cache Size: 18.16% 3.37 GiB ARC Hash Breakdown: Elements Max: 1.84m Elements Current: 33.47% 614.39k Collisions: 41.78m Chain Max: 6 Chains: 39.45k ------------------------------------------------------------------------ ARC Efficiency: 1.88b Cache Hit Ratio: 78.45% 1.48b Cache Miss Ratio: 21.55% 405.88m Actual Hit Ratio: 77.46% 1.46b Data Demand Efficiency: 77.97% 1.45b Data Prefetch Efficiency: 24.82% 9.07m CACHE HITS BY CACHE LIST: Anonymously Used: 0.52% 7.62m Most Recently Used: 8.38% 123.87m Most Frequently Used: 90.36% 1.34b Most Recently Used Ghost: 0.18% 2.65m Most Frequently Used Ghost: 0.56% 8.30m CACHE HITS BY DATA TYPE: Demand Data: 76.71% 1.13b Prefetch Data: 0.15% 2.25m Demand Metadata: 21.82% 322.33m Prefetch Metadata: 1.33% 19.58m CACHE MISSES BY DATA TYPE: Demand Data: 78.91% 320.29m Prefetch Data: 1.68% 6.82m Demand Metadata: 16.70% 67.79m Prefetch Metadata: 2.70% 10.97m ------------------------------------------------------------------------ The system currently has 20Gb wired, ~3Gb free and ~1Gb inactive with a tiny amount in the cache bucket (~46mb) On 7/5/2016 13:40, Lionel Cons wrote: > So what Oracle did (based on work done by SUN for Opensolaris) was to: > 1. Modify ZFS to prevent *ANY* double/multi caching [this is > considered a design defect] > 2. Introduce a new VM subsystem which scales a lot better and provides > hooks for [1] so there are never two or more copies of the same data > in the system > > Given that this was a huge, paid, multiyear effort its not likely > going to happen that the design defects in opensource ZFS will ever go > away. > > Lionel > > On 5 July 2016 at 19:50, Karl Denninger wrote: >> On 7/5/2016 12:19, Matthew Macy wrote: >>> >>> ---- On Mon, 04 Jul 2016 19:26:06 -0700 Karl Denninger wrote ---- >>> > >>> > >>> > On 7/4/2016 18:45, Matthew Macy wrote: >>> > > >>> > > >>> > > ---- On Sun, 03 Jul 2016 08:43:19 -0700 Karl Denninger wrote ---- >>> > > > >>> > > > On 7/3/2016 02:45, Matthew Macy wrote: >>> > > > > >>> > > > > Cedric greatly overstates the intractability of= resolving it. Nonetheless, since the initial import very little has been= done to improve integration, and I don't know of anyone who is up to the= task taking an interest in it. Consequently, mmap() performance is likel= y "doomed" for the foreseeable future.-M---- >>> > > > >>> > > > Wellllll.... >>> > > > >>> > > > I've done a fair bit of work here (see >>> > > > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D187594) a= nd the >>> > > > political issues are at least as bad as the coding ones. >>> > > > >>> > > >>> > > >>> > > Strictly speaking, the root of the problem is the ARC. Not ZFS p= er se. Have you ever tried disabling MFU caching to see how much worse LR= U only is? I'm not really convinced the ARC's benefits justify its cost. >>> > > >>> > > -M >>> > > >>> > >>> > The ARC is very useful when it gets a hit as it avoid an I/O that = would >>> > otherwise take place. >>> > >>> > Where it sucks is when the system evicts working set to preserve A= RC. >>> > That's always wrong in that you're trading a speculative I/O (if t= he >>> > cache is hit later) for a *guaranteed* one (to page out) and maybe= *two* >>> > (to page back in.) >>> >>> The question wasn't ARC vs. no-caching. It was LRU only vs LRU + MFU.= There are a lot of issues stemming from the fact that ZFS is a transacti= onal object store with a POSIX FS on top. One is that it caches disk bloc= ks as opposed to file blocks. However, if one could resolve that and have= the page cache manage these blocks life would be much much better. Howev= er, you'd lose MFU. Hence my question. >>> >>> -M >>> >> I suspect there's an argument to be made there but the present problem= s >> make determining the impact of that difficult or impossible as those >> effects are swamped by the other issues. >> >> I can fairly-easily create workloads on the base code where simply >> typing "vi ", making a change and hitting ":w" will result = in >> a stall of tens of seconds or more while the cache flush that gets >> requested is run down. I've resolved a good part (but not all >> instances) of this through my work. >> >> My understanding is that 11- has had additional work done to the base >> code, but three underlying issues are not, from what I can see in the >> commit logs and discussions, addressed: The VM system will page out >> working set while leaving ARC alone, UMA reserved-but-not-in-use space= >> is not policed adequately when memory pressure exists *before* the pag= er >> starts considering evicting working set and the write-back cache is fo= r >> many machine configurations grossly inappropriate and cannot be tuned >> adequately by hand (particularly being true on a system with vdevs tha= t >> have materially-varying performance levels.) >> >> I have more-or-less stopped work on the tree on a forward basis since = I >> got to a place with 10.2 that (1) works for my production requirements= , >> resolving the problems and (2) ran into what I deemed to be intractabl= e >> political issues within core on progress toward eradicating the root o= f >> the problem. >> >> I will probably revisit the situation with 11- at some point, as I'll >> want to roll my production systems forward. However, I don't know whe= n >> that will be -- right now 11- is stable enough for some of my embedded= >> work (e.g. on the Raspberry Pi2) but is not on my server and >> client-class machines. Indeed just yesterday I got a lock-order >> reversal panic while doing a shutdown after a kernel update on one of = my >> lab boxes running a just-updated 11- codebase. >> >> -- >> Karl Denninger >> karl@denninger.net >> /The Market Ticker/ >> /[S/MIME encrypted email preferred]/ > > --=20 Karl Denninger karl@denninger.net /The Market Ticker/ /[S/MIME encrypted email preferred]/ --------------ms070700000603020002030909 Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" Content-Description: S/MIME Cryptographic Signature MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgMFADCABgkqhkiG9w0BBwEAAKCC Bl8wggZbMIIEQ6ADAgECAgEpMA0GCSqGSIb3DQEBCwUAMIGQMQswCQYDVQQGEwJVUzEQMA4G A1UECBMHRmxvcmlkYTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3Rl bXMgTExDMRwwGgYDVQQDExNDdWRhIFN5c3RlbXMgTExDIENBMSIwIAYJKoZIhvcNAQkBFhND dWRhIFN5c3RlbXMgTExDIENBMB4XDTE1MDQyMTAyMjE1OVoXDTIwMDQxOTAyMjE1OVowWjEL MAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM TEMxHjAcBgNVBAMTFUthcmwgRGVubmluZ2VyIChPQ1NQKTCCAiIwDQYJKoZIhvcNAQEBBQAD ggIPADCCAgoCggIBALmEWPhAdphrWd4K5VTvE5pxL3blRQPyGF3ApjUjgtavqU1Y8pbI3Byg XDj2/Uz9Si8XVj/kNbKEjkRh5SsNvx3Fc0oQ1uVjyCq7zC/kctF7yLzQbvWnU4grAPZ3IuAp 3/fFxIVaXpxEdKmyZAVDhk9az+IgHH43rdJRIMzxJ5vqQMb+n2EjadVqiGPbtG9aZEImlq7f IYDTnKyToi23PAnkPwwT+q1IkI2DTvf2jzWrhLR5DTX0fUYC0nxlHWbjgpiapyJWtR7K2YQO aevQb/3vN9gSojT2h+cBem7QIj6U69rEYcEDvPyCMXEV9VcXdcmW42LSRsPvZcBHFkWAJqMZ Myiz4kumaP+s+cIDaXitR/szoqDKGSHM4CPAZV9Yh8asvxQL5uDxz5wvLPgS5yS8K/o7zDR5 vNkMCyfYQuR6PAJxVOk5Arqvj9lfP3JSVapwbr01CoWDBkpuJlKfpQIEeC/pcCBKknllbMYq yHBO2TipLyO5Ocd1nhN/nOsO+C+j31lQHfOMRZaPQykXVPWG5BbhWT7ttX4vy5hOW6yJgeT/ o3apynlp1cEavkQRS8uJHoQszF6KIrQMID/JfySWvVQ4ksnfzwB2lRomrdrwnQ4eG/HBS+0l eozwOJNDIBlAP+hLe8A5oWZgooIIK/SulUAsfI6Sgd8dTZTTYmlhAgMBAAGjgfQwgfEwNwYI KwYBBQUHAQEEKzApMCcGCCsGAQUFBzABhhtodHRwOi8vY3VkYXN5c3RlbXMubmV0Ojg4ODgw CQYDVR0TBAIwADARBglghkgBhvhCAQEEBAMCBaAwCwYDVR0PBAQDAgXgMCwGCWCGSAGG+EIB DQQfFh1PcGVuU1NMIEdlbmVyYXRlZCBDZXJ0aWZpY2F0ZTAdBgNVHQ4EFgQUxRyULenJaFwX RtT79aNmIB/u5VkwHwYDVR0jBBgwFoAUJHGbnYV9/N3dvbDKkpQDofrTbTUwHQYDVR0RBBYw FIESa2FybEBkZW5uaW5nZXIubmV0MA0GCSqGSIb3DQEBCwUAA4ICAQBPf3cYtmKowmGIYsm6 eBinJu7QVWvxi1vqnBz3KE+HapqoIZS8/PolB/hwiY0UAE1RsjBJ7yEjihVRwummSBvkoOyf G30uPn4yg4vbJkR9lTz8d21fPshWETa6DBh2jx2Qf13LZpr3Pj2fTtlu6xMYKzg7cSDgd2bO sJGH/rcvva9Spkx5Vfq0RyOrYph9boshRN3D4tbWgBAcX9POdXCVfJONDxhfBuPHsJ6vEmPb An+XL5Yl26XYFPiODQ+Qbk44Ot1kt9s7oS3dVUrh92Qv0G3J3DF+Vt6C15nED+f+bk4gScu+ JHT7RjEmfa18GT8DcT//D1zEke1Ymhb41JH+GyZchDRWtjxsS5OBFMzrju7d264zJUFtX7iJ 3xvpKN7VcZKNtB6dLShj3v/XDsQVQWXmR/1YKWZ93C3LpRs2Y5nYdn6gEOpL/WfQFThtfnat HNc7fNs5vjotaYpBl5H8+VCautKbGOs219uQbhGZLYTv6okuKcY8W+4EJEtK0xB08vqr9Jd0 FS9MGjQE++GWo+5eQxFt6nUENHbVYnsr6bYPQsZH0CRNycgTG9MwY/UIXOf4W034UpR82TBG 1LiMsYfb8ahQJhs3wdf1nzipIjRwoZKT1vGXh/cj3gwSr64GfenURBxaFZA5O1acOZUjPrRT n3ci4McYW/0WVVA3lDGCBRMwggUPAgEBMIGWMIGQMQswCQYDVQQGEwJVUzEQMA4GA1UECBMH RmxvcmlkYTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExD MRwwGgYDVQQDExNDdWRhIFN5c3RlbXMgTExDIENBMSIwIAYJKoZIhvcNAQkBFhNDdWRhIFN5 c3RlbXMgTExDIENBAgEpMA0GCWCGSAFlAwQCAwUAoIICTTAYBgkqhkiG9w0BCQMxCwYJKoZI hvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNjA3MDUxOTA4NTVaME8GCSqGSIb3DQEJBDFCBED7 nX0i7Tq1EG+NZ07b4ciG2M2dlPOJhMp7qWSAVTBZF7zpU1fWik5soqXY+W3tvS8F1b0AM0fw AItIRDGAAGjnMGwGCSqGSIb3DQEJDzFfMF0wCwYJYIZIAWUDBAEqMAsGCWCGSAFlAwQBAjAK BggqhkiG9w0DBzAOBggqhkiG9w0DAgICAIAwDQYIKoZIhvcNAwICAUAwBwYFKw4DAgcwDQYI KoZIhvcNAwICASgwgacGCSsGAQQBgjcQBDGBmTCBljCBkDELMAkGA1UEBhMCVVMxEDAOBgNV BAgTB0Zsb3JpZGExEjAQBgNVBAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1z IExMQzEcMBoGA1UEAxMTQ3VkYSBTeXN0ZW1zIExMQyBDQTEiMCAGCSqGSIb3DQEJARYTQ3Vk YSBTeXN0ZW1zIExMQyBDQQIBKTCBqQYLKoZIhvcNAQkQAgsxgZmggZYwgZAxCzAJBgNVBAYT AlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1 ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExIjAgBgkqhkiG 9w0BCQEWE0N1ZGEgU3lzdGVtcyBMTEMgQ0ECASkwDQYJKoZIhvcNAQEBBQAEggIAMzDbYeTj MuQjFIFwt58V8f59IO003Oz6kMDf17uEqhVFg8mr+fd8x01kbb/PVdl5JdY7Yao3xGNUHl3X /Sy/yAdQQlgCtrpycO/GBrycnkK5tLh8DlluKisxWIarwaHiwwXIwl8xwAgc0KevBkqVuuiW VYTJMToLwnMbkXFZbLY6AovBUX6aPucjhlROXvUXWl7wG8/+g96rpDZHoHmE6DNK9bhZhekj UQHcDARuhYa/0aQGZcAPzndpba8RVnPOgY+OqxnL1XJrsTPbVi4pvymcYz4oSKNVdps8vt9L aZDJUh1vcWTVh+4rDXQWHTPDtarJBUiYKUpErzIQtgPzfClvBtfm0VMm3aGCCFDciD1gndVo nqo5cH4dyUmxxivWVniLU14CuWBcL/fEbSljRp+Gd5BgGk7/QD8UdAU3uiby6TolZvQ5S0Sk k0p3edFUQc8OeerZ5BoFU5jD5ogwjzgF+A8ot6qmisq9CcB+2cLHF3L6l+sCDz2grmVu8kGB iVmIdXrc4qKdIB/yOzjjluNCywvUSrjFsL3FCAJObc/ydEoymDFfSfY2rfyFs120DkNkaQry 3JCriuwUYqfV7ZzEvSK7yjp4fXRhVhi9Ez56iuFJRH/y9A1Ydv7xxyCtCcVzqWB7xHkj43Hg lrv1CUp++UIuFbt9XCRI1tgAQwoAAAAAAAA= --------------ms070700000603020002030909--