FreeBSD Mail Archives

Date:      Sun, 3 Jul 2016 10:43:19 -0500
From:      Karl Denninger <karl@denninger.net>
To:        freebsd-hackers@freebsd.org
Subject:   Re: ZFS ARC and mmap/page cache coherency question
Message-ID:  <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net>
In-Reply-To: <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org>
References:  <20160630140625.3b4aece3@splash.akips.com> <CALXu0UfxRMnaamh%2Bpo5zp=iXdNUNuyj%2B7e_N1z8j46MtJmvyVA@mail.gmail.com> <20160703123004.74a7385a@splash.akips.com> <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org>

This is a cryptographically signed message in MIME format.

--------------ms010805070601040608020308
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

On 7/3/2016 02:45, Matthew Macy wrote:
>        =20
>             Cedric greatly overstates the intractability of resolving i=
t. Nonetheless, since the initial import very little has been done to imp=
rove integration, and I don't know of anyone who is up to the task taking=
 an interest in it. Consequently, mmap() performance is likely "doomed" f=
or the foreseeable future.-M----=20

Wellllll....

I've done a fair bit of work here (see
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D187594) and the
political issues are at least as bad as the coding ones.

In short what Cedric says about the root of the issue is real.  VM is
really-well implemented for what it handles, but the root of the issue
is that while the UFS data cache is part of VM and thus it "knows" about
it, ZFS is not because it is a "bolt-on."  UMA leads to further (severe)
complications for certain workloads.=20

Finally the underlying ZFS dmu_tx sizing code is just plain wrong and in
fact this is one of the biggest issues as when the system runs into
trouble it can take a bad situation and make it a *lot* worse.  There is
only one write-back cache maintained instead of one per zvol, and that's
flat-out broken.  Being able to re-order async writes to disk (where
fsync() has not been called) and minimizing seek latency is excellent.=20
Sadly rotating media these days sabotages much of this due to opacity
introduced at the drive level (e.g. varying sector counts per track,
etc) but it can still help.  But where things go dramatically wrong is
on a system where a large write-back cache is allocated relative to the
underlying zvol I/O performance (this occurs on moderately-large and
bigger RAM systems) with moderate numbers of modest-performance rotating
media; in this case it is entirely possible for a flush of the write
buffers to require upwards of a *minute* to complete, during which all
other writes block.  If this happens during periods of high RAM demand
and you manage to trigger a page-out at the same time system performance
will go straight into the toilet.  I have seen instances where simply
trying to edit a text file with vi (or a "select" against a database
table) will hang for upwards of a minute leading you to believe the
system has crashed, when it fact it has not.

The interaction of VM with the above can lead to severe pathological
behavior because the VM system has no way to tell the ZFS subsystem to
pare back ARC (and at least as important, perhaps more-so -- unused but
allocated UMA) when memory pressure exists *before* it pages.  ZFS tries
to detect memory pressure and do this itself but it winds up competing
with the VM system.  This leads to demonstrably wrong behavior because
you never want to hold disk cache in preference to RSS; if you have a
block of data from the disk the best case is you avoid one I/O (to
re-read it); if you page you are *guaranteed* to take one I/O (to write
the paged-out RSS to disk) and *might* take two (if you then must read
it back in.)

In short trading the avoidance of one *possible* I/O for a *guaranteed*
I/O and a second possible one is *always* a net lose.

To "fix" all of this "correctly" (for all cases, instead of certain
cases) VM would have to "know" about ARC and its use of UMA, along with
being able to police both.  ZFS also must have the dmu_tx writeback
cache sized per-zvol with its size chosen by the actual I/O performance
characteristics of the disks in the zvol itself.  I've looked into doing
both and it's fairly complex, and what's worse is that it would
effectively "marry" VM and ZFS, removing the "bolt-on" aspect of
things.  This then leads to a lot of maintenance work over time because
any time ZFS code changes (and it does, quite a bit) you then have to go
back through that process in order to become coherent with Illumos.

The PR above resolved (completely) the issues I was having along with a
number of other people on 10.x and before (I've not yet rolled it
forward to 11.) but it's quite clearly a hack of sorts, in that it
detects and treats symptoms (e.g. dynamic TX cache size modification,
etc) rather than integrating VM and ZFS cache management.

--=20
Karl Denninger
karl@denninger.net <mailto:karl@denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

--------------ms010805070601040608020308
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature

MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgMFADCABgkqhkiG9w0BBwEAAKCC
Bl8wggZbMIIEQ6ADAgECAgEpMA0GCSqGSIb3DQEBCwUAMIGQMQswCQYDVQQGEwJVUzEQMA4G
A1UECBMHRmxvcmlkYTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3Rl
bXMgTExDMRwwGgYDVQQDExNDdWRhIFN5c3RlbXMgTExDIENBMSIwIAYJKoZIhvcNAQkBFhND
dWRhIFN5c3RlbXMgTExDIENBMB4XDTE1MDQyMTAyMjE1OVoXDTIwMDQxOTAyMjE1OVowWjEL
MAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM
TEMxHjAcBgNVBAMTFUthcmwgRGVubmluZ2VyIChPQ1NQKTCCAiIwDQYJKoZIhvcNAQEBBQAD
ggIPADCCAgoCggIBALmEWPhAdphrWd4K5VTvE5pxL3blRQPyGF3ApjUjgtavqU1Y8pbI3Byg
XDj2/Uz9Si8XVj/kNbKEjkRh5SsNvx3Fc0oQ1uVjyCq7zC/kctF7yLzQbvWnU4grAPZ3IuAp
3/fFxIVaXpxEdKmyZAVDhk9az+IgHH43rdJRIMzxJ5vqQMb+n2EjadVqiGPbtG9aZEImlq7f
IYDTnKyToi23PAnkPwwT+q1IkI2DTvf2jzWrhLR5DTX0fUYC0nxlHWbjgpiapyJWtR7K2YQO
aevQb/3vN9gSojT2h+cBem7QIj6U69rEYcEDvPyCMXEV9VcXdcmW42LSRsPvZcBHFkWAJqMZ
Myiz4kumaP+s+cIDaXitR/szoqDKGSHM4CPAZV9Yh8asvxQL5uDxz5wvLPgS5yS8K/o7zDR5
vNkMCyfYQuR6PAJxVOk5Arqvj9lfP3JSVapwbr01CoWDBkpuJlKfpQIEeC/pcCBKknllbMYq
yHBO2TipLyO5Ocd1nhN/nOsO+C+j31lQHfOMRZaPQykXVPWG5BbhWT7ttX4vy5hOW6yJgeT/
o3apynlp1cEavkQRS8uJHoQszF6KIrQMID/JfySWvVQ4ksnfzwB2lRomrdrwnQ4eG/HBS+0l
eozwOJNDIBlAP+hLe8A5oWZgooIIK/SulUAsfI6Sgd8dTZTTYmlhAgMBAAGjgfQwgfEwNwYI
KwYBBQUHAQEEKzApMCcGCCsGAQUFBzABhhtodHRwOi8vY3VkYXN5c3RlbXMubmV0Ojg4ODgw
CQYDVR0TBAIwADARBglghkgBhvhCAQEEBAMCBaAwCwYDVR0PBAQDAgXgMCwGCWCGSAGG+EIB
DQQfFh1PcGVuU1NMIEdlbmVyYXRlZCBDZXJ0aWZpY2F0ZTAdBgNVHQ4EFgQUxRyULenJaFwX
RtT79aNmIB/u5VkwHwYDVR0jBBgwFoAUJHGbnYV9/N3dvbDKkpQDofrTbTUwHQYDVR0RBBYw
FIESa2FybEBkZW5uaW5nZXIubmV0MA0GCSqGSIb3DQEBCwUAA4ICAQBPf3cYtmKowmGIYsm6
eBinJu7QVWvxi1vqnBz3KE+HapqoIZS8/PolB/hwiY0UAE1RsjBJ7yEjihVRwummSBvkoOyf
G30uPn4yg4vbJkR9lTz8d21fPshWETa6DBh2jx2Qf13LZpr3Pj2fTtlu6xMYKzg7cSDgd2bO
sJGH/rcvva9Spkx5Vfq0RyOrYph9boshRN3D4tbWgBAcX9POdXCVfJONDxhfBuPHsJ6vEmPb
An+XL5Yl26XYFPiODQ+Qbk44Ot1kt9s7oS3dVUrh92Qv0G3J3DF+Vt6C15nED+f+bk4gScu+
JHT7RjEmfa18GT8DcT//D1zEke1Ymhb41JH+GyZchDRWtjxsS5OBFMzrju7d264zJUFtX7iJ
3xvpKN7VcZKNtB6dLShj3v/XDsQVQWXmR/1YKWZ93C3LpRs2Y5nYdn6gEOpL/WfQFThtfnat
HNc7fNs5vjotaYpBl5H8+VCautKbGOs219uQbhGZLYTv6okuKcY8W+4EJEtK0xB08vqr9Jd0
FS9MGjQE++GWo+5eQxFt6nUENHbVYnsr6bYPQsZH0CRNycgTG9MwY/UIXOf4W034UpR82TBG
1LiMsYfb8ahQJhs3wdf1nzipIjRwoZKT1vGXh/cj3gwSr64GfenURBxaFZA5O1acOZUjPrRT
n3ci4McYW/0WVVA3lDGCBRMwggUPAgEBMIGWMIGQMQswCQYDVQQGEwJVUzEQMA4GA1UECBMH
RmxvcmlkYTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExD
MRwwGgYDVQQDExNDdWRhIFN5c3RlbXMgTExDIENBMSIwIAYJKoZIhvcNAQkBFhNDdWRhIFN5
c3RlbXMgTExDIENBAgEpMA0GCWCGSAFlAwQCAwUAoIICTTAYBgkqhkiG9w0BCQMxCwYJKoZI
hvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNjA3MDMxNTQzMTlaME8GCSqGSIb3DQEJBDFCBECY
/gnxWw2Ru9QcdkEP45S3vFDHKc0DCTSTjQ0/rDnq0wnGcZ7nZvzOcYwUObgkXsJxxiNj3mAW
k5yFS2ELzpI5MGwGCSqGSIb3DQEJDzFfMF0wCwYJYIZIAWUDBAEqMAsGCWCGSAFlAwQBAjAK
BggqhkiG9w0DBzAOBggqhkiG9w0DAgICAIAwDQYIKoZIhvcNAwICAUAwBwYFKw4DAgcwDQYI
KoZIhvcNAwICASgwgacGCSsGAQQBgjcQBDGBmTCBljCBkDELMAkGA1UEBhMCVVMxEDAOBgNV
BAgTB0Zsb3JpZGExEjAQBgNVBAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1z
IExMQzEcMBoGA1UEAxMTQ3VkYSBTeXN0ZW1zIExMQyBDQTEiMCAGCSqGSIb3DQEJARYTQ3Vk
YSBTeXN0ZW1zIExMQyBDQQIBKTCBqQYLKoZIhvcNAQkQAgsxgZmggZYwgZAxCzAJBgNVBAYT
AlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1
ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExIjAgBgkqhkiG
9w0BCQEWE0N1ZGEgU3lzdGVtcyBMTEMgQ0ECASkwDQYJKoZIhvcNAQEBBQAEggIAKEFxS8zS
ezGQK4SJsAWSr1expcp8Abo06jjXZRPsJMlJPu+Pc7LKrjOQzlAtiqq5jhw0X42nmY/NC85y
8hOrB4PBxor36GgWp5+2v/mIgyA1xsE87UGedFZ7WKT9DtlJszM9zqd2uvDpFXK6tsj2ye3K
8XvRi6cfY5HnBwnqhi0Qr8e+60K7QXY1YEnKKeABFRpIRLBB2IzHihRcoL/AhpUnoZzUqUYc
ZVOvI+xK7L7sw0nw95ovvYBOwuxKOTj6CVki58uTiKDpF4rV/SK+v4wXeD+N7dyNH/HR6T6i
uZn2jiLwVVGbluAJHpOKrHBS0/NeD34wCX1QIB3mWVELPRHQpoALwwsBBMEUGyrCVld8siSL
tZM0eq/YLl+7ruc9+dbKcKCKOKYfWZzyy97Y0VAzj/4RDgUJstb6xzRouaMJdFHXCDAWBByn
DQxNCkObmSh8sKtGEJfbLihS0qbEvCZW5f54HkaKLE8i8B1tIAKzpaEFrcI63zpYXAnFx8ZL
UNykcm06JPE9N0BtFkrcj/a1KWdqYxK+m4N10l8UAAaNj6e0rhfhqB7TPap/XNUSTbfwMWMD
+58m5iLOV7WJil+p+rY6KoAkenGkF6tB5sx2ut2dBhOl4evPa5/5KVo0ngsMsxKgMdxS6Txe
D8tDQcpYB4bS7nMKOtX8qoxGVqIAAAAAAAA=
--------------ms010805070601040608020308--

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation