Date: Fri, 29 Aug 2014 12:26:57 -0700 From: Peter Wemm <peter@wemm.org> To: Alan Cox <alc@rice.edu> Cc: src-committers@freebsd.org, svn-src-all@freebsd.org, Dmitry Morozovsky <marck@rinet.ru>, "Matthew D. Fuller" <fullermd@over-yonder.net>, svn-src-head@freebsd.org, Steven Hartland <smh@freebsd.org> Subject: Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm Message-ID: <1592506.xpuae4IYcM@overcee.wemm.org> In-Reply-To: <5400B052.6030103@rice.edu> References: <201408281950.s7SJo90I047213@svn.freebsd.org> <4A4B2C2D36064FD9840E3603D39E58E0@multiplay.co.uk> <5400B052.6030103@rice.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
--nextPart3473061.QZNGgrCJeQ Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="us-ascii" On Friday 29 August 2014 11:54:42 Alan Cox wrote: > On 08/29/2014 03:32, Steven Hartland wrote: > >> On Thursday 28 August 2014 17:30:17 Alan Cox wrote: > >> > On 08/28/2014 16:15, Matthew D. Fuller wrote: > >> > > On Thu, Aug 28, 2014 at 10:11:39PM +0100 I heard the voice of > >> > >=20 > >> > > Steven Hartland, and lo! it spake thus: > >> > >> Its very likely applicable to stable/9 although I've never us= ed 9 > >> > >> myself, we jumped from 9 direct to 10. > >> > >=20 > >> > > This is actually hitting two different issues from the two bug= s: > >> > >=20 > >> > > - 191510 is about "ARC isn't greedy enough" on huge-memory > >= > >>=20 > >> machines, > >>=20 > >> > > and from the osreldate that bug was filed on 9.2, so presuma= bly > >> > >=20 > >> > > is > >> > >=20 > >> > > applicable. > >> > >=20 > >> > > - 187594 is about "ARC is too greedy" (probably mostly on > > > >>=20 > >> not-so-huge > >>=20 > >> > > machines) and starves/drives the rest of the system into swa= p. > >> > >=20 > >> > > That > >> > >=20 > >> > > I believe came about as a result of some unrelated change in= the > >> > > 10.x stream that upset the previous balance between ARC and = the > >> > >=20 > >> > > rest > >> > >=20 > >> > > of the VM, so isn't a problem on 9.x. > >> >=20 > >> > 10.0 had a bug in the page daemon that was fixed in 10-STABLE ab= out > >> > three months ago (r265945). The ARC was not the only thing > >>=20 > >> affected > by > >> this bug. > >>=20 > >> I'm concerned about potential unintended consequences of this chan= ge. > >>=20 > >> Before, arc reclaim was driven by vm_paging_needed(), which was: > >> vm_paging_needed(void) > >> { > >>=20 > >> return (vm_cnt.v_free_count + vm_cnt.v_cache_count < > >> =20 > >> vm_pageout_wakeup_thresh); > >>=20 > >> } > >>=20 > >> Now it's ignoring the v_cache_count and looking exclusively at > >> v_free_count. > >> "cache" pages are free pages that just happen to have known conten= ts. > >> If I > >> read this change right, zfs arc will now discard checksummed cache= > >> pages to > >=20 > >> make room for non-checksummed pages: > > That test is still there so if it needs to it will still trigger. > >=20 > > However that often a lower level as vm_pageout_wakeup_thresh is onl= y 110% > > of min free, where as zfs_arc_free_target is based of target free > > which is > > 4 * (min free + reserved). > >=20 > >> + if (kmem_free_count() < zfs_arc_free_target) { > >> + return (1); > >> + } > >> ... > >> +kmem_free_count(void) > >> +{ > >> + return (vm_cnt.v_free_count); > >> +} > >>=20 > >> This seems like a pretty substantial behavior change. I'm concern= ed > >> that it > >> doesn't appear to count all the forms of "free" pages. > >>=20 > >> I haven't seen the problems with the over-aggressive ARC since the= > >> page daemon > >> bug was fixed. It's been working fine under pretty abusive loads = in > >> the freebsd > >> cluster after that fix. > >=20 > > Others have also confirmed that even with r265945 they can still tr= igger > > performance issue. > >=20 > > In addition without it we still have loads of RAM sat their unused,= in my > > particular experience we have 40GB of 192GB sitting their unused an= d that > > was with a stable build from last weekend. >=20 > The Solaris code only imposed this limit on 32-bit machines where the= > available kernel virtual address space may be much less than the > available physical memory. Previously, FreeBSD imposed this limit on= > both 32-bit and 64-bit machines. Now, it imposes it on neither. Why= > continue to do this differently from Solaris? Since the question was asked below, we don't have zfs machines in the c= luster=20 running i386. We can barely get them to boot as it is due to kva press= ure. =20 We have to reduce/cap physical memory and change the user/kernel virtua= l split=20 from=203:1 to 2.5:1.5.=20 We do run zfs on small amd64 machines with 2G of ram, but I can't imagi= ne it=20 working on the 10G i386 PAE machines that we have. > > With the patch we confirmed that both RAM usage and performance for= those > > seeing that issue are resolved, with no reported regressions. > >=20 > >> (I should know better than to fire a reply off before full fact > >> checking, but > >> this commit worries me..) > >=20 > > Not a problem, its great to know people pay attention to changes, a= nd > > raise > > their concerns. Always better to have a discussion about potential = issues > > than to wait for a problem to occur. > >=20 > > Hopefully the above gives you some piece of mind, but if you still > > have any > > concerns I'm all ears. >=20 > You didn't really address Peter's initial technical issue. Peter > correctly observed that cache pages are just another flavor of free > pages. Whenever the VM system is checking the number of free pages > against any of the thresholds, it always uses the sum of v_cache_coun= t > and v_free_count. So, to anyone familiar with the VM system, like > Peter, what you've done, which is to derive a threshold from > v_free_target but only compare v_free_count to that threshold, looks > highly suspect. I think I'd like to see something like this: Index: cddl/compat/opensolaris/kern/opensolaris_kmem.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =2D-- cddl/compat/opensolaris/kern/opensolaris_kmem.c=09(revision 270824)= +++ cddl/compat/opensolaris/kern/opensolaris_kmem.c=09(working copy) @@ -152,7 +152,8 @@ kmem_free_count(void) { =20 =2D=09return (vm_cnt.v_free_count); +=09/* "cache" is just a flavor of free pages in FreeBSD */ +=09return (vm_cnt.v_free_count + vm_cnt.v_cache_count); } =20 u_int The rest of the system looks at the "big picture" it would be happy to = let the=20 "free" pool run quite a way down so long as there's "cache" pages avail= able to=20 satisfy the free space requirements. This would lead ZFS to mistakenly= =20 sacrifice ARC for no reason. I'm not sure how big a deal this is, but = I can't=20 imagine many scenarios where I want ARC to be discarded in order to sav= e some=20 effectively free pages. > That said, I can easily believe that your patch works better than the= > existing code, because it is closer in spirit to my interpretation of= > what the Solaris code does. Specifically, I believe that the Solaris= > code starts trimming the ARC before the Solaris page daemon starts > writing dirty pages to secondary storage. Now, you've made FreeBSD d= o > the same. However, you've expressed it in a way that looks broken. >=20 > To wrap up, I think that you can easily write this in a way that > simultaneously behaves like Solaris and doesn't look wrong to a VM ex= pert. >=20 > > Out of interest would it be possible to update machines in the clus= ter to > > see how their workload reacts to the change? > >=20 > > Regards > > Steve I'd like to see the free vs cache thing resolved first but it's going t= o be=20 tricky to get a comparison. For the first few months of the year, things were really troublesome. = It was=20 quite easy to overtax the machines and run them into the ground. This is not the case now - things are working pretty well under pressur= e=20 (prior to the commit). Its got to the point that we feel comfortable=20= thrashing the machines really hard again. Getting a comparison when it= =20 already works well is going to be tricky. We don't have large memory machines that aren't already tuned for=20 vfs.zfs.arc_max caps for tmpfs use. For context to the wider audience, we do not run -release or -pN in the= =20 freebsd cluster. We mostly run -current, and some -stable. I am well= aware=20 that there is significant discomfort in 10.0-R with zfs but we already = have the=20 fixes for that. =2D-=20 Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI= 6FJV UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246 --nextPart3473061.QZNGgrCJeQ Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABAgAGBQJUANQFAAoJEDXWlwnsgJ4Ej54H/i3fKVW4nwYRZW4rQdRQp4k3 yXAHOptr8jh2BU3OLkB9BFHj2OTllfGxMNo2wiephc3Hg2NelKrNQyGVTMCXVU7p m5DTboznV1xPA5oawVnkuJglPPuV+cID2AgUCaZVrUheWN5Yrs0b+S1TWHXrPoU2 CXAj5u8fd0YlMRVGc8PPBBIWCthbqb7B+GHRoFGfjRJ2gMFvuqi/ls8U5rvmHmwI NYPzgc6zE+RaLIRR0yRlfAz3eWRQ35WFBG4W0jWxdV/o6oPsbma2w6qOysSJB9Rn IZGVgYa+SwiZa2wPvUdm4E+oky1Y+SnACdHbIptynIWEczVAuLY9fGS0yk1bZVk= =6Njw -----END PGP SIGNATURE----- --nextPart3473061.QZNGgrCJeQ--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1592506.xpuae4IYcM>