From owner-freebsd-hackers@freebsd.org Sun Jul 3 15:50:58 2016 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 96842B8FCEF for ; Sun, 3 Jul 2016 15:50:58 +0000 (UTC) (envelope-from karl@denninger.net) Received: from mail.denninger.net (wsip-70-169-168-7.pn.at.cox.net [70.169.168.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 3D14F28D7 for ; Sun, 3 Jul 2016 15:50:57 +0000 (UTC) (envelope-from karl@denninger.net) Received: from [192.168.1.40] (Karl-Desktop.Denninger.net [192.168.1.40]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.denninger.net (Postfix) with ESMTPSA id 9D8A121E527 for ; Sun, 3 Jul 2016 10:43:34 -0500 (CDT) Subject: Re: ZFS ARC and mmap/page cache coherency question To: freebsd-hackers@freebsd.org References: <20160630140625.3b4aece3@splash.akips.com> <20160703123004.74a7385a@splash.akips.com> <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org> From: Karl Denninger Message-ID: <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net> Date: Sun, 3 Jul 2016 10:43:19 -0500 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org> Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg=sha-512; boundary="------------ms010805070601040608020308" X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 03 Jul 2016 15:50:58 -0000 This is a cryptographically signed message in MIME format. --------------ms010805070601040608020308 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 7/3/2016 02:45, Matthew Macy wrote: > =20 > Cedric greatly overstates the intractability of resolving i= t. Nonetheless, since the initial import very little has been done to imp= rove integration, and I don't know of anyone who is up to the task taking= an interest in it. Consequently, mmap() performance is likely "doomed" f= or the foreseeable future.-M----=20 Wellllll.... I've done a fair bit of work here (see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D187594) and the political issues are at least as bad as the coding ones. In short what Cedric says about the root of the issue is real. VM is really-well implemented for what it handles, but the root of the issue is that while the UFS data cache is part of VM and thus it "knows" about it, ZFS is not because it is a "bolt-on." UMA leads to further (severe) complications for certain workloads.=20 Finally the underlying ZFS dmu_tx sizing code is just plain wrong and in fact this is one of the biggest issues as when the system runs into trouble it can take a bad situation and make it a *lot* worse. There is only one write-back cache maintained instead of one per zvol, and that's flat-out broken. Being able to re-order async writes to disk (where fsync() has not been called) and minimizing seek latency is excellent.=20 Sadly rotating media these days sabotages much of this due to opacity introduced at the drive level (e.g. varying sector counts per track, etc) but it can still help. But where things go dramatically wrong is on a system where a large write-back cache is allocated relative to the underlying zvol I/O performance (this occurs on moderately-large and bigger RAM systems) with moderate numbers of modest-performance rotating media; in this case it is entirely possible for a flush of the write buffers to require upwards of a *minute* to complete, during which all other writes block. If this happens during periods of high RAM demand and you manage to trigger a page-out at the same time system performance will go straight into the toilet. I have seen instances where simply trying to edit a text file with vi (or a "select" against a database table) will hang for upwards of a minute leading you to believe the system has crashed, when it fact it has not. The interaction of VM with the above can lead to severe pathological behavior because the VM system has no way to tell the ZFS subsystem to pare back ARC (and at least as important, perhaps more-so -- unused but allocated UMA) when memory pressure exists *before* it pages. ZFS tries to detect memory pressure and do this itself but it winds up competing with the VM system. This leads to demonstrably wrong behavior because you never want to hold disk cache in preference to RSS; if you have a block of data from the disk the best case is you avoid one I/O (to re-read it); if you page you are *guaranteed* to take one I/O (to write the paged-out RSS to disk) and *might* take two (if you then must read it back in.) In short trading the avoidance of one *possible* I/O for a *guaranteed* I/O and a second possible one is *always* a net lose. To "fix" all of this "correctly" (for all cases, instead of certain cases) VM would have to "know" about ARC and its use of UMA, along with being able to police both. ZFS also must have the dmu_tx writeback cache sized per-zvol with its size chosen by the actual I/O performance characteristics of the disks in the zvol itself. I've looked into doing both and it's fairly complex, and what's worse is that it would effectively "marry" VM and ZFS, removing the "bolt-on" aspect of things. This then leads to a lot of maintenance work over time because any time ZFS code changes (and it does, quite a bit) you then have to go back through that process in order to become coherent with Illumos. The PR above resolved (completely) the issues I was having along with a number of other people on 10.x and before (I've not yet rolled it forward to 11.) but it's quite clearly a hack of sorts, in that it detects and treats symptoms (e.g. dynamic TX cache size modification, etc) rather than integrating VM and ZFS cache management. --=20 Karl Denninger karl@denninger.net /The Market Ticker/ /[S/MIME encrypted email preferred]/ --------------ms010805070601040608020308 Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" Content-Description: S/MIME Cryptographic Signature MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgMFADCABgkqhkiG9w0BBwEAAKCC Bl8wggZbMIIEQ6ADAgECAgEpMA0GCSqGSIb3DQEBCwUAMIGQMQswCQYDVQQGEwJVUzEQMA4G A1UECBMHRmxvcmlkYTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3Rl bXMgTExDMRwwGgYDVQQDExNDdWRhIFN5c3RlbXMgTExDIENBMSIwIAYJKoZIhvcNAQkBFhND dWRhIFN5c3RlbXMgTExDIENBMB4XDTE1MDQyMTAyMjE1OVoXDTIwMDQxOTAyMjE1OVowWjEL MAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM TEMxHjAcBgNVBAMTFUthcmwgRGVubmluZ2VyIChPQ1NQKTCCAiIwDQYJKoZIhvcNAQEBBQAD ggIPADCCAgoCggIBALmEWPhAdphrWd4K5VTvE5pxL3blRQPyGF3ApjUjgtavqU1Y8pbI3Byg XDj2/Uz9Si8XVj/kNbKEjkRh5SsNvx3Fc0oQ1uVjyCq7zC/kctF7yLzQbvWnU4grAPZ3IuAp 3/fFxIVaXpxEdKmyZAVDhk9az+IgHH43rdJRIMzxJ5vqQMb+n2EjadVqiGPbtG9aZEImlq7f IYDTnKyToi23PAnkPwwT+q1IkI2DTvf2jzWrhLR5DTX0fUYC0nxlHWbjgpiapyJWtR7K2YQO aevQb/3vN9gSojT2h+cBem7QIj6U69rEYcEDvPyCMXEV9VcXdcmW42LSRsPvZcBHFkWAJqMZ Myiz4kumaP+s+cIDaXitR/szoqDKGSHM4CPAZV9Yh8asvxQL5uDxz5wvLPgS5yS8K/o7zDR5 vNkMCyfYQuR6PAJxVOk5Arqvj9lfP3JSVapwbr01CoWDBkpuJlKfpQIEeC/pcCBKknllbMYq yHBO2TipLyO5Ocd1nhN/nOsO+C+j31lQHfOMRZaPQykXVPWG5BbhWT7ttX4vy5hOW6yJgeT/ o3apynlp1cEavkQRS8uJHoQszF6KIrQMID/JfySWvVQ4ksnfzwB2lRomrdrwnQ4eG/HBS+0l eozwOJNDIBlAP+hLe8A5oWZgooIIK/SulUAsfI6Sgd8dTZTTYmlhAgMBAAGjgfQwgfEwNwYI KwYBBQUHAQEEKzApMCcGCCsGAQUFBzABhhtodHRwOi8vY3VkYXN5c3RlbXMubmV0Ojg4ODgw CQYDVR0TBAIwADARBglghkgBhvhCAQEEBAMCBaAwCwYDVR0PBAQDAgXgMCwGCWCGSAGG+EIB DQQfFh1PcGVuU1NMIEdlbmVyYXRlZCBDZXJ0aWZpY2F0ZTAdBgNVHQ4EFgQUxRyULenJaFwX RtT79aNmIB/u5VkwHwYDVR0jBBgwFoAUJHGbnYV9/N3dvbDKkpQDofrTbTUwHQYDVR0RBBYw FIESa2FybEBkZW5uaW5nZXIubmV0MA0GCSqGSIb3DQEBCwUAA4ICAQBPf3cYtmKowmGIYsm6 eBinJu7QVWvxi1vqnBz3KE+HapqoIZS8/PolB/hwiY0UAE1RsjBJ7yEjihVRwummSBvkoOyf G30uPn4yg4vbJkR9lTz8d21fPshWETa6DBh2jx2Qf13LZpr3Pj2fTtlu6xMYKzg7cSDgd2bO sJGH/rcvva9Spkx5Vfq0RyOrYph9boshRN3D4tbWgBAcX9POdXCVfJONDxhfBuPHsJ6vEmPb An+XL5Yl26XYFPiODQ+Qbk44Ot1kt9s7oS3dVUrh92Qv0G3J3DF+Vt6C15nED+f+bk4gScu+ JHT7RjEmfa18GT8DcT//D1zEke1Ymhb41JH+GyZchDRWtjxsS5OBFMzrju7d264zJUFtX7iJ 3xvpKN7VcZKNtB6dLShj3v/XDsQVQWXmR/1YKWZ93C3LpRs2Y5nYdn6gEOpL/WfQFThtfnat HNc7fNs5vjotaYpBl5H8+VCautKbGOs219uQbhGZLYTv6okuKcY8W+4EJEtK0xB08vqr9Jd0 FS9MGjQE++GWo+5eQxFt6nUENHbVYnsr6bYPQsZH0CRNycgTG9MwY/UIXOf4W034UpR82TBG 1LiMsYfb8ahQJhs3wdf1nzipIjRwoZKT1vGXh/cj3gwSr64GfenURBxaFZA5O1acOZUjPrRT n3ci4McYW/0WVVA3lDGCBRMwggUPAgEBMIGWMIGQMQswCQYDVQQGEwJVUzEQMA4GA1UECBMH RmxvcmlkYTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExD MRwwGgYDVQQDExNDdWRhIFN5c3RlbXMgTExDIENBMSIwIAYJKoZIhvcNAQkBFhNDdWRhIFN5 c3RlbXMgTExDIENBAgEpMA0GCWCGSAFlAwQCAwUAoIICTTAYBgkqhkiG9w0BCQMxCwYJKoZI hvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNjA3MDMxNTQzMTlaME8GCSqGSIb3DQEJBDFCBECY /gnxWw2Ru9QcdkEP45S3vFDHKc0DCTSTjQ0/rDnq0wnGcZ7nZvzOcYwUObgkXsJxxiNj3mAW k5yFS2ELzpI5MGwGCSqGSIb3DQEJDzFfMF0wCwYJYIZIAWUDBAEqMAsGCWCGSAFlAwQBAjAK BggqhkiG9w0DBzAOBggqhkiG9w0DAgICAIAwDQYIKoZIhvcNAwICAUAwBwYFKw4DAgcwDQYI KoZIhvcNAwICASgwgacGCSsGAQQBgjcQBDGBmTCBljCBkDELMAkGA1UEBhMCVVMxEDAOBgNV BAgTB0Zsb3JpZGExEjAQBgNVBAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1z IExMQzEcMBoGA1UEAxMTQ3VkYSBTeXN0ZW1zIExMQyBDQTEiMCAGCSqGSIb3DQEJARYTQ3Vk YSBTeXN0ZW1zIExMQyBDQQIBKTCBqQYLKoZIhvcNAQkQAgsxgZmggZYwgZAxCzAJBgNVBAYT AlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1 ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExIjAgBgkqhkiG 9w0BCQEWE0N1ZGEgU3lzdGVtcyBMTEMgQ0ECASkwDQYJKoZIhvcNAQEBBQAEggIAKEFxS8zS ezGQK4SJsAWSr1expcp8Abo06jjXZRPsJMlJPu+Pc7LKrjOQzlAtiqq5jhw0X42nmY/NC85y 8hOrB4PBxor36GgWp5+2v/mIgyA1xsE87UGedFZ7WKT9DtlJszM9zqd2uvDpFXK6tsj2ye3K 8XvRi6cfY5HnBwnqhi0Qr8e+60K7QXY1YEnKKeABFRpIRLBB2IzHihRcoL/AhpUnoZzUqUYc ZVOvI+xK7L7sw0nw95ovvYBOwuxKOTj6CVki58uTiKDpF4rV/SK+v4wXeD+N7dyNH/HR6T6i uZn2jiLwVVGbluAJHpOKrHBS0/NeD34wCX1QIB3mWVELPRHQpoALwwsBBMEUGyrCVld8siSL tZM0eq/YLl+7ruc9+dbKcKCKOKYfWZzyy97Y0VAzj/4RDgUJstb6xzRouaMJdFHXCDAWBByn DQxNCkObmSh8sKtGEJfbLihS0qbEvCZW5f54HkaKLE8i8B1tIAKzpaEFrcI63zpYXAnFx8ZL UNykcm06JPE9N0BtFkrcj/a1KWdqYxK+m4N10l8UAAaNj6e0rhfhqB7TPap/XNUSTbfwMWMD +58m5iLOV7WJil+p+rY6KoAkenGkF6tB5sx2ut2dBhOl4evPa5/5KVo0ngsMsxKgMdxS6Txe D8tDQcpYB4bS7nMKOtX8qoxGVqIAAAAAAAA= --------------ms010805070601040608020308--