FreeBSD Mail Archives

Date:      Tue, 5 Jul 2016 14:08:55 -0500
From:      Karl Denninger <karl@denninger.net>
To:        freebsd-hackers@freebsd.org
Subject:   Re: ZFS ARC and mmap/page cache coherency question
Message-ID:  <2be70811-add4-d630-7f5a-a5a53ee2a5d4@denninger.net>
In-Reply-To: <CAPJSo4VtJ1%2Btxt4s13nKSWrj9fDTv5VsLVyMsX%2BDarBUVYMbOQ@mail.gmail.com>
References:  <20160630140625.3b4aece3@splash.akips.com> <CALXu0UfxRMnaamh%2Bpo5zp=iXdNUNuyj%2B7e_N1z8j46MtJmvyVA@mail.gmail.com> <20160703123004.74a7385a@splash.akips.com> <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org> <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net> <155b84da0aa.ad3af0e6139335.8627172617037605875@nextbsd.org> <7e00af5a-86cd-25f8-a4c6-2d946b507409@denninger.net> <155bc1260e6.12001bf18198857.6272515207330027022@nextbsd.org> <31f4d30f-4170-0d04-bd23-1b998474a92e@denninger.net> <CAPJSo4VtJ1%2Btxt4s13nKSWrj9fDTv5VsLVyMsX%2BDarBUVYMbOQ@mail.gmail.com>

This is a cryptographically signed message in MIME format.

--------------ms070700000603020002030909
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

You'd get most of the way to what Oracle did, I suspect, if the system:

1. Dynamically resized the write cache on a per-vdev basis so as to
prevent a flush from stalling all write I/O for a material amount of
time (which can and *does* happen now)

2. Made VM aware of UMA "committed-but-free" on an ongoing basis and
policed it on a sliding basis (that is, as RAM pressure rises VM
considers it more-important to reap UMA so as to prevent
marked-used-but-in-fact-free RAM from accumulating when RAM is under
pressure.)

3. Bi-directionally hooked VM so that it initiates and cooperates with
ZFS on ARC size management.  Specifically, if ZFS decides ARC is to be
reaped then it must notify VM so that (1) UMA can be reaped first, if
necessary and then if ARC *still* needs to be reaped it occurs *before*
VM pages anything out.  If and only if ARC is at minimum should the VM
system evict working set to the pagefile.

#1 is entirely within ZFS but is fairly hard to do well, and neither
Illumos or FreeBSD's team have taken a serious crack at it.

#2 I've taken a fairly decent look at but not implemented code on the VM
side to do it.  What I *have* done is implemented code on the ZFS side
to do it within the ZFS paradigm, which is technically in the wrong
place but works pretty well -- so long as the UMA fragmentation is
coming from ZFS.

#3 is a bear, especially if you don't move that code into VM (which
intimately "marries" the ZFS and VM code; that's very bad from a
maintainability perspective.)  What I've implemented is somewhat of a
hack in that regard in that it has ZFS triggering before VM does, it
gets aggressive with reaping its own UMA areas and the writeback cache
when there is RAM pressure and thus *most* of the time avoids the paging
pathology while allowing the ARC to use the truly-free RAM.  It ought to
be in the VM code however, because the pressure sometimes does not come
from ZFS.

This is why one of my production machines looks like right now with the
patch in -- this system runs a quite-active Postgres database along with
a material number of other things at the same time; this doesn't look
bad at all in terms of efficiency.

[karl@NewFS ~]$ zfs-stats -A

------------------------------------------------------------------------
ZFS Subsystem Report                            Tue Jul  5 14:05:06 2016
------------------------------------------------------------------------

ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                29.11m
        Recycle Misses:                         0
        Mutex Misses:                           67.14k
        Evict Skips:                            72.84m

ARC Size:                               72.10%  16.10   GiB
        Target Size: (Adaptive)         83.00%  18.53   GiB
        Min Size (Hard Limit):          12.50%  2.79    GiB
        Max Size (High Water):          8:1     22.33   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       81.84%  15.17   GiB
        Frequently Used Cache Size:     18.16%  3.37    GiB

ARC Hash Breakdown:
        Elements Max:                           1.84m
        Elements Current:               33.47%  614.39k
        Collisions:                             41.78m
        Chain Max:                              6
        Chains:                                 39.45k

------------------------------------------------------------------------

ARC Efficiency:                                 1.88b
        Cache Hit Ratio:                78.45%  1.48b
        Cache Miss Ratio:               21.55%  405.88m
        Actual Hit Ratio:               77.46%  1.46b

        Data Demand Efficiency:         77.97%  1.45b
        Data Prefetch Efficiency:       24.82%  9.07m

        CACHE HITS BY CACHE LIST:
          Anonymously Used:             0.52%   7.62m
          Most Recently Used:           8.38%   123.87m
          Most Frequently Used:         90.36%  1.34b
          Most Recently Used Ghost:     0.18%   2.65m
          Most Frequently Used Ghost:   0.56%   8.30m

        CACHE HITS BY DATA TYPE:
          Demand Data:                  76.71%  1.13b
          Prefetch Data:                0.15%   2.25m
          Demand Metadata:              21.82%  322.33m
          Prefetch Metadata:            1.33%   19.58m

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  78.91%  320.29m
          Prefetch Data:                1.68%   6.82m
          Demand Metadata:              16.70%  67.79m
          Prefetch Metadata:            2.70%   10.97m

------------------------------------------------------------------------

The system currently has 20Gb wired, ~3Gb free and ~1Gb inactive with a
tiny amount in the cache bucket (~46mb)

On 7/5/2016 13:40, Lionel Cons wrote:
> So what Oracle did (based on work done by SUN for Opensolaris) was to:
> 1. Modify ZFS to prevent *ANY* double/multi caching [this is
> considered a design defect]
> 2. Introduce a new VM subsystem which scales a lot better and provides
> hooks for [1] so there are never two or more copies of the same data
> in the system
>
> Given that this was a huge, paid, multiyear effort its not likely
> going to happen that the design defects in opensource ZFS will ever go
> away.
>
> Lionel
>
> On 5 July 2016 at 19:50, Karl Denninger <karl@denninger.net> wrote:
>> On 7/5/2016 12:19, Matthew Macy wrote:
>>>
>>>  ---- On Mon, 04 Jul 2016 19:26:06 -0700 Karl Denninger <karl@denning=
er.net> wrote ----
>>>  >
>>>  >
>>>  > On 7/4/2016 18:45, Matthew Macy wrote:
>>>  > >
>>>  > >
>>>  > >  ---- On Sun, 03 Jul 2016 08:43:19 -0700 Karl Denninger <karl@de=
nninger.net> wrote ----
>>>  > >  >
>>>  > >  > On 7/3/2016 02:45, Matthew Macy wrote:
>>>  > >  > >
>>>  > >  > >             Cedric greatly overstates the intractability of=
 resolving it. Nonetheless, since the initial import very little has been=
 done to improve integration, and I don't know of anyone who is up to the=
 task taking an interest in it. Consequently, mmap() performance is likel=
y "doomed" for the foreseeable future.-M----
>>>  > >  >
>>>  > >  > Wellllll....
>>>  > >  >
>>>  > >  > I've done a fair bit of work here (see
>>>  > >  > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D187594) a=
nd the
>>>  > >  > political issues are at least as bad as the coding ones.
>>>  > >  >
>>>  > >
>>>  > >
>>>  > > Strictly speaking, the root of the problem is the ARC. Not ZFS p=
er se. Have you ever tried disabling MFU caching to see how much worse LR=
U only is? I'm not really convinced the ARC's benefits justify its cost.
>>>  > >
>>>  > > -M
>>>  > >
>>>  >
>>>  > The ARC is very useful when it gets a hit as it avoid an I/O that =
would
>>>  > otherwise take place.
>>>  >
>>>  > Where it sucks is when the system evicts working set to preserve A=
RC.
>>>  > That's always wrong in that you're trading a speculative I/O (if t=
he
>>>  > cache is hit later) for a *guaranteed* one (to page out) and maybe=
 *two*
>>>  > (to page back in.)
>>>
>>> The question wasn't ARC vs. no-caching. It was LRU only vs LRU + MFU.=
 There are a lot of issues stemming from the fact that ZFS is a transacti=
onal object store with a POSIX FS on top. One is that it caches disk bloc=
ks as opposed to file blocks. However, if one could resolve that and have=
 the page cache manage these blocks life would be much much better. Howev=
er, you'd lose MFU. Hence my question.
>>>
>>> -M
>>>
>> I suspect there's an argument to be made there but the present problem=
s
>> make determining the impact of that difficult or impossible as those
>> effects are swamped by the other issues.
>>
>> I can fairly-easily create workloads on the base code where simply
>> typing "vi <some file>", making a change and hitting ":w" will result =
in
>> a stall of tens of seconds or more while the cache flush that gets
>> requested is run down.  I've resolved a good part (but not all
>> instances) of this through my work.
>>
>> My understanding is that 11- has had additional work done to the base
>> code, but three underlying issues are not, from what I can see in the
>> commit logs and discussions, addressed: The VM system will page out
>> working set while leaving ARC alone, UMA reserved-but-not-in-use space=

>> is not policed adequately when memory pressure exists *before* the pag=
er
>> starts considering evicting working set and the write-back cache is fo=
r
>> many machine configurations grossly inappropriate and cannot be tuned
>> adequately by hand (particularly being true on a system with vdevs tha=
t
>> have materially-varying performance levels.)
>>
>> I have more-or-less stopped work on the tree on a forward basis since =
I
>> got to a place with 10.2 that (1) works for my production requirements=
,
>> resolving the problems and (2) ran into what I deemed to be intractabl=
e
>> political issues within core on progress toward eradicating the root o=
f
>> the problem.
>>
>> I will probably revisit the situation with 11- at some point, as I'll
>> want to roll my production systems forward.  However, I don't know whe=
n
>> that will be -- right now 11- is stable enough for some of my embedded=

>> work (e.g. on the Raspberry Pi2) but is not on my server and
>> client-class machines.  Indeed just yesterday I got a lock-order
>> reversal panic while doing a shutdown after a kernel update on one of =
my
>> lab boxes running a just-updated 11- codebase.
>>
>> --
>> Karl Denninger
>> karl@denninger.net <mailto:karl@denninger.net>
>> /The Market Ticker/
>> /[S/MIME encrypted email preferred]/
>
>

--=20
Karl Denninger
karl@denninger.net <mailto:karl@denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

--------------ms070700000603020002030909
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature

MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgMFADCABgkqhkiG9w0BBwEAAKCC
Bl8wggZbMIIEQ6ADAgECAgEpMA0GCSqGSIb3DQEBCwUAMIGQMQswCQYDVQQGEwJVUzEQMA4G
A1UECBMHRmxvcmlkYTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3Rl
bXMgTExDMRwwGgYDVQQDExNDdWRhIFN5c3RlbXMgTExDIENBMSIwIAYJKoZIhvcNAQkBFhND
dWRhIFN5c3RlbXMgTExDIENBMB4XDTE1MDQyMTAyMjE1OVoXDTIwMDQxOTAyMjE1OVowWjEL
MAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM
TEMxHjAcBgNVBAMTFUthcmwgRGVubmluZ2VyIChPQ1NQKTCCAiIwDQYJKoZIhvcNAQEBBQAD
ggIPADCCAgoCggIBALmEWPhAdphrWd4K5VTvE5pxL3blRQPyGF3ApjUjgtavqU1Y8pbI3Byg
XDj2/Uz9Si8XVj/kNbKEjkRh5SsNvx3Fc0oQ1uVjyCq7zC/kctF7yLzQbvWnU4grAPZ3IuAp
3/fFxIVaXpxEdKmyZAVDhk9az+IgHH43rdJRIMzxJ5vqQMb+n2EjadVqiGPbtG9aZEImlq7f
IYDTnKyToi23PAnkPwwT+q1IkI2DTvf2jzWrhLR5DTX0fUYC0nxlHWbjgpiapyJWtR7K2YQO
aevQb/3vN9gSojT2h+cBem7QIj6U69rEYcEDvPyCMXEV9VcXdcmW42LSRsPvZcBHFkWAJqMZ
Myiz4kumaP+s+cIDaXitR/szoqDKGSHM4CPAZV9Yh8asvxQL5uDxz5wvLPgS5yS8K/o7zDR5
vNkMCyfYQuR6PAJxVOk5Arqvj9lfP3JSVapwbr01CoWDBkpuJlKfpQIEeC/pcCBKknllbMYq
yHBO2TipLyO5Ocd1nhN/nOsO+C+j31lQHfOMRZaPQykXVPWG5BbhWT7ttX4vy5hOW6yJgeT/
o3apynlp1cEavkQRS8uJHoQszF6KIrQMID/JfySWvVQ4ksnfzwB2lRomrdrwnQ4eG/HBS+0l
eozwOJNDIBlAP+hLe8A5oWZgooIIK/SulUAsfI6Sgd8dTZTTYmlhAgMBAAGjgfQwgfEwNwYI
KwYBBQUHAQEEKzApMCcGCCsGAQUFBzABhhtodHRwOi8vY3VkYXN5c3RlbXMubmV0Ojg4ODgw
CQYDVR0TBAIwADARBglghkgBhvhCAQEEBAMCBaAwCwYDVR0PBAQDAgXgMCwGCWCGSAGG+EIB
DQQfFh1PcGVuU1NMIEdlbmVyYXRlZCBDZXJ0aWZpY2F0ZTAdBgNVHQ4EFgQUxRyULenJaFwX
RtT79aNmIB/u5VkwHwYDVR0jBBgwFoAUJHGbnYV9/N3dvbDKkpQDofrTbTUwHQYDVR0RBBYw
FIESa2FybEBkZW5uaW5nZXIubmV0MA0GCSqGSIb3DQEBCwUAA4ICAQBPf3cYtmKowmGIYsm6
eBinJu7QVWvxi1vqnBz3KE+HapqoIZS8/PolB/hwiY0UAE1RsjBJ7yEjihVRwummSBvkoOyf
G30uPn4yg4vbJkR9lTz8d21fPshWETa6DBh2jx2Qf13LZpr3Pj2fTtlu6xMYKzg7cSDgd2bO
sJGH/rcvva9Spkx5Vfq0RyOrYph9boshRN3D4tbWgBAcX9POdXCVfJONDxhfBuPHsJ6vEmPb
An+XL5Yl26XYFPiODQ+Qbk44Ot1kt9s7oS3dVUrh92Qv0G3J3DF+Vt6C15nED+f+bk4gScu+
JHT7RjEmfa18GT8DcT//D1zEke1Ymhb41JH+GyZchDRWtjxsS5OBFMzrju7d264zJUFtX7iJ
3xvpKN7VcZKNtB6dLShj3v/XDsQVQWXmR/1YKWZ93C3LpRs2Y5nYdn6gEOpL/WfQFThtfnat
HNc7fNs5vjotaYpBl5H8+VCautKbGOs219uQbhGZLYTv6okuKcY8W+4EJEtK0xB08vqr9Jd0
FS9MGjQE++GWo+5eQxFt6nUENHbVYnsr6bYPQsZH0CRNycgTG9MwY/UIXOf4W034UpR82TBG
1LiMsYfb8ahQJhs3wdf1nzipIjRwoZKT1vGXh/cj3gwSr64GfenURBxaFZA5O1acOZUjPrRT
n3ci4McYW/0WVVA3lDGCBRMwggUPAgEBMIGWMIGQMQswCQYDVQQGEwJVUzEQMA4GA1UECBMH
RmxvcmlkYTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExD
MRwwGgYDVQQDExNDdWRhIFN5c3RlbXMgTExDIENBMSIwIAYJKoZIhvcNAQkBFhNDdWRhIFN5
c3RlbXMgTExDIENBAgEpMA0GCWCGSAFlAwQCAwUAoIICTTAYBgkqhkiG9w0BCQMxCwYJKoZI
hvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNjA3MDUxOTA4NTVaME8GCSqGSIb3DQEJBDFCBED7
nX0i7Tq1EG+NZ07b4ciG2M2dlPOJhMp7qWSAVTBZF7zpU1fWik5soqXY+W3tvS8F1b0AM0fw
AItIRDGAAGjnMGwGCSqGSIb3DQEJDzFfMF0wCwYJYIZIAWUDBAEqMAsGCWCGSAFlAwQBAjAK
BggqhkiG9w0DBzAOBggqhkiG9w0DAgICAIAwDQYIKoZIhvcNAwICAUAwBwYFKw4DAgcwDQYI
KoZIhvcNAwICASgwgacGCSsGAQQBgjcQBDGBmTCBljCBkDELMAkGA1UEBhMCVVMxEDAOBgNV
BAgTB0Zsb3JpZGExEjAQBgNVBAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1z
IExMQzEcMBoGA1UEAxMTQ3VkYSBTeXN0ZW1zIExMQyBDQTEiMCAGCSqGSIb3DQEJARYTQ3Vk
YSBTeXN0ZW1zIExMQyBDQQIBKTCBqQYLKoZIhvcNAQkQAgsxgZmggZYwgZAxCzAJBgNVBAYT
AlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1
ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExIjAgBgkqhkiG
9w0BCQEWE0N1ZGEgU3lzdGVtcyBMTEMgQ0ECASkwDQYJKoZIhvcNAQEBBQAEggIAMzDbYeTj
MuQjFIFwt58V8f59IO003Oz6kMDf17uEqhVFg8mr+fd8x01kbb/PVdl5JdY7Yao3xGNUHl3X
/Sy/yAdQQlgCtrpycO/GBrycnkK5tLh8DlluKisxWIarwaHiwwXIwl8xwAgc0KevBkqVuuiW
VYTJMToLwnMbkXFZbLY6AovBUX6aPucjhlROXvUXWl7wG8/+g96rpDZHoHmE6DNK9bhZhekj
UQHcDARuhYa/0aQGZcAPzndpba8RVnPOgY+OqxnL1XJrsTPbVi4pvymcYz4oSKNVdps8vt9L
aZDJUh1vcWTVh+4rDXQWHTPDtarJBUiYKUpErzIQtgPzfClvBtfm0VMm3aGCCFDciD1gndVo
nqo5cH4dyUmxxivWVniLU14CuWBcL/fEbSljRp+Gd5BgGk7/QD8UdAU3uiby6TolZvQ5S0Sk
k0p3edFUQc8OeerZ5BoFU5jD5ogwjzgF+A8ot6qmisq9CcB+2cLHF3L6l+sCDz2grmVu8kGB
iVmIdXrc4qKdIB/yOzjjluNCywvUSrjFsL3FCAJObc/ydEoymDFfSfY2rfyFs120DkNkaQry
3JCriuwUYqfV7ZzEvSK7yjp4fXRhVhi9Ez56iuFJRH/y9A1Ydv7xxyCtCcVzqWB7xHkj43Hg
lrv1CUp++UIuFbt9XCRI1tgAQwoAAAAAAAA=
--------------ms070700000603020002030909--

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2be70811-add4-d630-7f5a-a5a53ee2a5d4>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation