From owner-freebsd-fs@freebsd.org Fri Aug 19 21:52:09 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 416BABC0DE2 for ; Fri, 19 Aug 2016 21:52:09 +0000 (UTC) (envelope-from karl@denninger.net) Received: from mail.denninger.net (denninger.net [70.169.168.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id D48311BC7 for ; Fri, 19 Aug 2016 21:52:08 +0000 (UTC) (envelope-from karl@denninger.net) Received: from [192.168.1.40] (Karl-Desktop.Denninger.net [192.168.1.40]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.denninger.net (Postfix) with ESMTPSA id 53614208A4C for ; Fri, 19 Aug 2016 16:52:06 -0500 (CDT) Subject: Re: ZFS ARC under memory pressure References: <20160816193416.GM8192@zxy.spb.ru> <8dbf2a3a-da64-f7f8-5463-bfa23462446e@FreeBSD.org> <20160818202657.GS8192@zxy.spb.ru> <20160819201840.GA12519@zxy.spb.ru> <20160819213446.GT8192@zxy.spb.ru> To: freebsd-fs@freebsd.org From: Karl Denninger Message-ID: <05ba785a-c86f-1ec8-fcf3-71d22551f4f3@denninger.net> Date: Fri, 19 Aug 2016 16:52:00 -0500 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 In-Reply-To: <20160819213446.GT8192@zxy.spb.ru> Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg=sha-512; boundary="------------ms040808020408080903000805" X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Aug 2016 21:52:09 -0000 This is a cryptographically signed message in MIME format. --------------ms040808020408080903000805 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable On 8/19/2016 16:34, Slawa Olhovchenkov wrote: > On Fri, Aug 19, 2016 at 03:38:55PM -0500, Karl Denninger wrote: > >> On 8/19/2016 15:18, Slawa Olhovchenkov wrote: >>> On Thu, Aug 18, 2016 at 03:31:26PM -0500, Karl Denninger wrote: >>> >>>> On 8/18/2016 15:26, Slawa Olhovchenkov wrote: >>>>> On Thu, Aug 18, 2016 at 11:00:28PM +0300, Andriy Gapon wrote: >>>>> >>>>>> On 16/08/2016 22:34, Slawa Olhovchenkov wrote: >>>>>>> I see issuses with ZFS ARC inder memory pressure. >>>>>>> ZFS ARC size can be dramaticaly reduced, up to arc_min. >>>>>>> >>>>>>> As I see memory pressure event cause call arc_lowmem and set need= free: >>>>>>> >>>>>>> arc.c:arc_lowmem >>>>>>> >>>>>>> needfree =3D btoc(arc_c >> arc_shrink_shift); >>>>>>> >>>>>>> After this, arc_available_memory return negative vaules (PAGESIZE= * >>>>>>> (-needfree)) until needfree is zero. Independent how too much mem= ory >>>>>>> freed. needfree set to 0 in arc_reclaim_thread(), when arc_size <= =3D >>>>>>> arc_c. Until arc_size don't drop below arc_c (arc_c deceased at e= very >>>>>>> loop interation). >>>>>>> >>>>>>> arc_c droped to minimum value if arc_size fast enough droped. >>>>>>> >>>>>>> No control current to initial memory allocation. >>>>>>> >>>>>>> As result, I can see needless arc reclaim, from 10x to 100x times= =2E >>>>>>> >>>>>>> Can some one check me and comment this? >>>>>> You might have found a real problem here, but I am short of time r= ight now to >>>>>> properly analyze the issue. I think that on illumos 'needfree' is= a variable >>>>>> that's managed by the virtual memory system and it is akin to our >>>>>> vm_pageout_deficit. But during the porting it became an artificia= l value and >>>>>> its handling might be sub-optimal. >>>>> As I see, totaly not optimal. >>>>> I am create some patch for sub-optimal handling and now test it. >>>>> _______________________________________________ >>>>> freebsd-fs at freebsd.org mailing list >>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs >>>>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd= =2Eorg" >>>> You might want to look at the code contained in here: >>>> >>>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D187594 >>> In may case arc.c issuse cused by revision r286625 in HEAD (and >>> r288562 in STABLE) -- all in 2015, not touch in 2014. >>> >>>> There are some ugly interactions with the VM system you can run into= if >>>> you're not careful; I've chased this issue before and while I haven'= t >>>> yet done the work to integrate it into 11.x (and the underlying code= >>>> *has* changed since the 10.x patches I developed) if you wind up dri= ving >>>> the VM system to evict pages to swap rather than pare back ARC you'r= e >>>> probably making the wrong choice. >>>> >>>> In addition UMA can come into the picture too and (at least previous= ly) >>>> was a severe contributor to pathological behavior. >>> I am only do less aggresive (and more controlled) shrink of ARC size.= >>> Now ARC just collapsed. >>> >>> Pointed PR is realy BIG. I am can't read and understund all of this. >>> r286625 change behaivor of interaction between ARC and VM. >>> You problem still exist? Can you explain (in list)? >>> >> Essentially ZFS is a "bolt-on" and unlike UFS which uses the unified >> buffer cache (which the VM system manages) ZFS does not. ARC is >> allocated out of kernel memory and (by default) also uses UMA; the VM >> system is not involved in its management. >> >> When the VM system gets constrained (low memory) it thus cannot tell t= he >> ARC to pare back. So when the VM system gets low on RAM it will start= > Currently VM generate event and ARC listen for this event, handle it > by arc.c:arc_lowmem(). > >> to page. The problem with this is that if the VM system is low on RAM= >> because the ARC is consuming memory you do NOT want to page, you want = to >> evict some of the ARC. > Now by event `lowmem` ARC try to evict 1/128 of ARC. > >> Unfortunately the VM system has another interaction that causes troubl= e >> too. The VM system will "demote" a page to inactive or cache status b= ut >> not actually free it. It only starts to go through those pages and fr= ee >> them when the vm system wakes up, and that only happens when free spac= e >> gets low enough to trigger it. > >> Finally, there's another problem that comes into play; UMA. Kernel >> memory allocation is fairly expensive. UMA grabs memory from the kern= el >> allocation system in big chunks and manages it, and by doing so gains = a >> pretty-significant performance boost. But this means that you can hav= e >> large amounts of RAM that are allocated, not in use, and yet the VM >> system cannot reclaim them on its own. The ZFS code has to reap those= >> caches, but reaping them is a moderately expensive operation too, thus= >> you don't want to do it unnecessarily. > Not sure, but some code in ZFS may be handle this. > arc.c:arc_kmem_reap_now(). > Not sure. > >> I've not yet gone through the 11.x code to see what changed from 10.x;= >> what I do know is that it is materially better-behaved than it used to= >> be, in that prior to 11.x I would have (by now) pretty much been force= d >> into rolling that forward and testing it because the misbehavior in on= e >> of my production systems was severe enough to render it basically >> unusable without the patch in that PR inline, with the most-serious >> misbehavior being paging-induced stalls that could reach 10s of second= s >> or more in duration. >> >> 11.x hasn't exhibited the severe problems, unpatched, that 10.x was >> known to do on my production systems -- but it is far less than great = in >> that it sure as heck does have UMA coherence issues..... >> >> ARC Size: 38.58% 8.61 GiB >> Target Size: (Adaptive) 70.33% 15.70 GiB >> Min Size (Hard Limit): 12.50% 2.79 GiB >> Max Size (High Water): 8:1 22.32 GiB >> >> I have 20GB out in kernel memory on this machine right now but only 8.= 6 >> of it in ARC; the rest is (mostly) sitting in UMA allocated-but-unused= >> -- so despite the belief expressed by some that the 11.x code is >> "better" at reaping UMA I'm sure not seeing it here. > I see. > In my case: > > ARC Size: 79.65% 98.48 GiB > Target Size: (Adaptive) 79.60% 98.42 GiB > Min Size (Hard Limit): 12.50% 15.46 GiB > Max Size (High Water): 8:1 123.64 GiB > > System Memory: > > 2.27% 2.83 GiB Active, 9.58% 11.94 GiB Inact > 86.34% 107.62 GiB Wired, 0.00% 0 Cache > 1.80% 2.25 GiB Free, 0.00% 0 Gap > > Real Installed: 128.00 GiB > Real Available: 99.96% 127.95 GiB > Real Managed: 97.41% 124.64 GiB > > Logical Total: 128.00 GiB > Logical Used: 88.92% 113.81 GiB > Logical Free: 11.08% 14.19 GiB > > Kernel Memory: 758.25 MiB > Data: 97.81% 741.61 MiB > Text: 2.19% 16.64 MiB > > Kernel Memory Map: 124.64 GiB > Size: 81.84% 102.01 GiB > Free: 18.16% 22.63 GiB > > Mem: 2895M Active, 12G Inact, 108G Wired, 528K Buf, 2303M Free > ARC: 98G Total, 89G MFU, 9535M MRU, 35M Anon, 126M Header, 404M Other > Swap: 32G Total, 394M Used, 32G Free, 1% Inuse > > Is this 12G Inactive as 'UMA allocated-but-unused'? > This is also may be freed but not reclaimed network bufs. > >> I'll get around to rolling forward and modifying that PR since that >> particular bit of jackassery with UMA is a definite performance >> problem. I suspect a big part of what you're seeing lies there as >> well. When I do get that code done and tested I suspect it may solve >> your problems as well. > No. May problem is completly different: under memory pressure, after ar= c_lowmem() > set needfree to non-zero arc_reclaim_thread() start to shrink ARC. But > arc_reclaim_thread (in FreeBSD case) don't correctly control this proce= ss > and shrink stoped in random time (when after next iteration arc_size <=3D= arc_c), > mostly after drop to Min Size (Hard Limit). > > I am just resore control of shrink process. Not quite due to the UMA issue, among other things. There's also a potential "stall" issue that can arise also having to do with dirty_max sizing, especially if you are using rotating media. The PR patch scaled that back dynamically as well under memory pressure and eliminated that issue as well. I won't have time to look at this for at least another week on my test machine as I'm unfortunately buried with unrelated work at present, but I should be able to put some effort into this within the next couple weeks and see if I can quickly roll forward the important parts of the previous PR patch. I think you'll find that it stops the behavior you're seeing - I'm just pointing out that this was more-complex internally than it first appeared in the 10.x branch and I have no reason to believe the interactions that lead to bad behavior are not still in play given what you're describing for symptoms. --=20 Karl Denninger karl@denninger.net /The Market Ticker/ /[S/MIME encrypted email preferred]/ --------------ms040808020408080903000805 Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" Content-Description: S/MIME Cryptographic Signature MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgMFADCABgkqhkiG9w0BBwEAAKCC Bl8wggZbMIIEQ6ADAgECAgEpMA0GCSqGSIb3DQEBCwUAMIGQMQswCQYDVQQGEwJVUzEQMA4G A1UECBMHRmxvcmlkYTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3Rl bXMgTExDMRwwGgYDVQQDExNDdWRhIFN5c3RlbXMgTExDIENBMSIwIAYJKoZIhvcNAQkBFhND dWRhIFN5c3RlbXMgTExDIENBMB4XDTE1MDQyMTAyMjE1OVoXDTIwMDQxOTAyMjE1OVowWjEL MAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM TEMxHjAcBgNVBAMTFUthcmwgRGVubmluZ2VyIChPQ1NQKTCCAiIwDQYJKoZIhvcNAQEBBQAD ggIPADCCAgoCggIBALmEWPhAdphrWd4K5VTvE5pxL3blRQPyGF3ApjUjgtavqU1Y8pbI3Byg XDj2/Uz9Si8XVj/kNbKEjkRh5SsNvx3Fc0oQ1uVjyCq7zC/kctF7yLzQbvWnU4grAPZ3IuAp 3/fFxIVaXpxEdKmyZAVDhk9az+IgHH43rdJRIMzxJ5vqQMb+n2EjadVqiGPbtG9aZEImlq7f IYDTnKyToi23PAnkPwwT+q1IkI2DTvf2jzWrhLR5DTX0fUYC0nxlHWbjgpiapyJWtR7K2YQO aevQb/3vN9gSojT2h+cBem7QIj6U69rEYcEDvPyCMXEV9VcXdcmW42LSRsPvZcBHFkWAJqMZ Myiz4kumaP+s+cIDaXitR/szoqDKGSHM4CPAZV9Yh8asvxQL5uDxz5wvLPgS5yS8K/o7zDR5 vNkMCyfYQuR6PAJxVOk5Arqvj9lfP3JSVapwbr01CoWDBkpuJlKfpQIEeC/pcCBKknllbMYq yHBO2TipLyO5Ocd1nhN/nOsO+C+j31lQHfOMRZaPQykXVPWG5BbhWT7ttX4vy5hOW6yJgeT/ o3apynlp1cEavkQRS8uJHoQszF6KIrQMID/JfySWvVQ4ksnfzwB2lRomrdrwnQ4eG/HBS+0l eozwOJNDIBlAP+hLe8A5oWZgooIIK/SulUAsfI6Sgd8dTZTTYmlhAgMBAAGjgfQwgfEwNwYI KwYBBQUHAQEEKzApMCcGCCsGAQUFBzABhhtodHRwOi8vY3VkYXN5c3RlbXMubmV0Ojg4ODgw CQYDVR0TBAIwADARBglghkgBhvhCAQEEBAMCBaAwCwYDVR0PBAQDAgXgMCwGCWCGSAGG+EIB DQQfFh1PcGVuU1NMIEdlbmVyYXRlZCBDZXJ0aWZpY2F0ZTAdBgNVHQ4EFgQUxRyULenJaFwX RtT79aNmIB/u5VkwHwYDVR0jBBgwFoAUJHGbnYV9/N3dvbDKkpQDofrTbTUwHQYDVR0RBBYw FIESa2FybEBkZW5uaW5nZXIubmV0MA0GCSqGSIb3DQEBCwUAA4ICAQBPf3cYtmKowmGIYsm6 eBinJu7QVWvxi1vqnBz3KE+HapqoIZS8/PolB/hwiY0UAE1RsjBJ7yEjihVRwummSBvkoOyf G30uPn4yg4vbJkR9lTz8d21fPshWETa6DBh2jx2Qf13LZpr3Pj2fTtlu6xMYKzg7cSDgd2bO sJGH/rcvva9Spkx5Vfq0RyOrYph9boshRN3D4tbWgBAcX9POdXCVfJONDxhfBuPHsJ6vEmPb An+XL5Yl26XYFPiODQ+Qbk44Ot1kt9s7oS3dVUrh92Qv0G3J3DF+Vt6C15nED+f+bk4gScu+ JHT7RjEmfa18GT8DcT//D1zEke1Ymhb41JH+GyZchDRWtjxsS5OBFMzrju7d264zJUFtX7iJ 3xvpKN7VcZKNtB6dLShj3v/XDsQVQWXmR/1YKWZ93C3LpRs2Y5nYdn6gEOpL/WfQFThtfnat HNc7fNs5vjotaYpBl5H8+VCautKbGOs219uQbhGZLYTv6okuKcY8W+4EJEtK0xB08vqr9Jd0 FS9MGjQE++GWo+5eQxFt6nUENHbVYnsr6bYPQsZH0CRNycgTG9MwY/UIXOf4W034UpR82TBG 1LiMsYfb8ahQJhs3wdf1nzipIjRwoZKT1vGXh/cj3gwSr64GfenURBxaFZA5O1acOZUjPrRT n3ci4McYW/0WVVA3lDGCBRMwggUPAgEBMIGWMIGQMQswCQYDVQQGEwJVUzEQMA4GA1UECBMH RmxvcmlkYTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExD MRwwGgYDVQQDExNDdWRhIFN5c3RlbXMgTExDIENBMSIwIAYJKoZIhvcNAQkBFhNDdWRhIFN5 c3RlbXMgTExDIENBAgEpMA0GCWCGSAFlAwQCAwUAoIICTTAYBgkqhkiG9w0BCQMxCwYJKoZI hvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNjA4MTkyMTUyMDBaME8GCSqGSIb3DQEJBDFCBEAK hJi5/8ptyPvenRhWie/BSME8lhs9BQnHdC6flidXNcBCWBhTvA0NrlqjIYn/ORlwXesJRByf t14fEPqQtrVaMGwGCSqGSIb3DQEJDzFfMF0wCwYJYIZIAWUDBAEqMAsGCWCGSAFlAwQBAjAK BggqhkiG9w0DBzAOBggqhkiG9w0DAgICAIAwDQYIKoZIhvcNAwICAUAwBwYFKw4DAgcwDQYI KoZIhvcNAwICASgwgacGCSsGAQQBgjcQBDGBmTCBljCBkDELMAkGA1UEBhMCVVMxEDAOBgNV BAgTB0Zsb3JpZGExEjAQBgNVBAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1z IExMQzEcMBoGA1UEAxMTQ3VkYSBTeXN0ZW1zIExMQyBDQTEiMCAGCSqGSIb3DQEJARYTQ3Vk YSBTeXN0ZW1zIExMQyBDQQIBKTCBqQYLKoZIhvcNAQkQAgsxgZmggZYwgZAxCzAJBgNVBAYT AlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1 ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExIjAgBgkqhkiG 9w0BCQEWE0N1ZGEgU3lzdGVtcyBMTEMgQ0ECASkwDQYJKoZIhvcNAQEBBQAEggIAM2mjKv7n smv9SiI6bPPW708oruljYXQpJPRsM0HD8/hYLn5TPsVysnWZwuZCUrNikEBrQI5qqMmpYt9n o/DrVAhOiupZ2Jz8/oO7KJ+EEdMCABFdY9LRowdpJTHOhYUkaJ5D4YFg/EKP3a8RWGZ6av07 Iy4WZliVOVAV8147Pqxc/YJRxqEM225WV4riC2KkGgskNmYzB9M/nsNNTJiT0EhGxJIq/qfS k5WwkSAMOpUj8M3dI6pOCyIDIqjSUc4wxoVa4UXrdgx5VvXIZCsaatC8USfjCi9j1UE0aACe /CiPQFNIoesa+yMGszJ5jmHQAt1Wv/95nTQlfN6hEnZw015hGq6Wh3IPb4ajBVyy5TzEOiCV qiql3Z8ccHGaBjQDlSqK+CM/8ApZSeXE/CpThaGRPdUyZBQ51XRLvYzqVnAAM2bPOAgrd2kw ND8Ez7O2N3dpQJlc9pNKM7k7M0bfBSNp+bnjj3bLiiTFNA0fHnCKB2a1Eowucw3jDuVZ2jy3 OTUnBlWN48cE94fsMZ8hh2jYRZ7PHDLrveWUsCTkh8zPoN3rnWOrgw+SDBTREGp4rtJn7nQo popEmhSAR5ZZ6txJ65XAhISwOcHaTJMTn5CitAAG03koJjHK244t64e9P2BiB0LqMrxehM5T tOFRe/TzVhNIxUwq26xOfvP/pDYAAAAAAAA= --------------ms040808020408080903000805--