Date: Fri, 29 Aug 2014 13:33:14 -0700 From: Peter Wemm <peter@wemm.org> To: Steven Hartland <smh@freebsd.org> Cc: src-committers@freebsd.org, Alan Cox <alc@rice.edu>, svn-src-all@freebsd.org, Dmitry Morozovsky <marck@rinet.ru>, "Matthew D. Fuller" <fullermd@over-yonder.net>, svn-src-head@freebsd.org Subject: Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm Message-ID: <64121723.0IFfex9X4X@overcee.wemm.org> In-Reply-To: <5A300D962A1B458B951D521EA2BE35E8@multiplay.co.uk> References: <201408281950.s7SJo90I047213@svn.freebsd.org> <1592506.xpuae4IYcM@overcee.wemm.org> <5A300D962A1B458B951D521EA2BE35E8@multiplay.co.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
--nextPart1802697.MTHLGv2zuo Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="us-ascii" On Friday 29 August 2014 20:51:03 Steven Hartland wrote: > > On Friday 29 August 2014 11:54:42 Alan Cox wrote: > snip... > > > > With the patch we confirmed that both RAM usage and performance= > > > > for those > > > > seeing that issue are resolved, with no reported regressions. > > > >=20 > > > >> (I should know better than to fire a reply off before full fac= t > > > >> checking, but > > > >> this commit worries me..) > > > >=20 > > > > Not a problem, its great to know people pay attention to change= s, > > > > and > > > > raise > > > > their concerns. Always better to have a discussion about potent= ial > > > > issues > > > > than to wait for a problem to occur. > > > >=20 > > > > Hopefully the above gives you some piece of mind, but if you st= ill > > > > have any > > > > concerns I'm all ears. > > >=20 > > > You didn't really address Peter's initial technical issue. Peter= > > > correctly observed that cache pages are just another flavor of fr= ee > > > pages. Whenever the VM system is checking the number of free pag= es > > > against any of the thresholds, it always uses the sum of > > > v_cache_count > > > and v_free_count. So, to anyone familiar with the VM system, lik= e > > > Peter, what you've done, which is to derive a threshold from > > > v_free_target but only compare v_free_count to that threshold, lo= oks > > > highly suspect. > >=20 > > I think I'd like to see something like this: > >=20 > > Index: cddl/compat/opensolaris/kern/opensolaris_kmem.c > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > --- cddl/compat/opensolaris/kern/opensolaris_kmem.c (revision 27082= 4) > > +++ cddl/compat/opensolaris/kern/opensolaris_kmem.c (working copy) > > @@ -152,7 +152,8 @@ > >=20 > > kmem_free_count(void) > > { > >=20 > > - return (vm_cnt.v_free_count); > > + /* "cache" is just a flavor of free pages in FreeBSD */ > > + return (vm_cnt.v_free_count + vm_cnt.v_cache_count); > >=20 > > } > > =20 > > u_int >=20 > This has apparently already been tried and the response from Karl was= : >=20 > - No, because memory in "cache" is subject to being either reallocate= d > or freed. > - When I was developing this patch that was my first impression as we= ll > and how > - I originally coded it, and it turned out to be wrong. > - > - The issue here is that you have two parts of the system contending = for > RAM -- > - the VM system generally, and the ARC cache. If the ARC cache frees= > space before > - the VM system activates and starts pruning then you wind up with th= e > ARC pinned > - at the minimum after some period of time, because it releases "earl= y." >=20 > I've asked him if he would retest just to be sure. >=20 > > The rest of the system looks at the "big picture" it would be happy= to > > let the > > "free" pool run quite a way down so long as there's "cache" pages > > available to > > satisfy the free space requirements. This would lead ZFS to > > mistakenly > > sacrifice ARC for no reason. I'm not sure how big a deal this is, = but > > I can't > > imagine many scenarios where I want ARC to be discarded in order to= > > save some > > effectively free pages. >=20 > From Karl's response from the original PR (above) it seems like this > causes > unexpected behaviour due to the two systems being seperate. >=20 > > > That said, I can easily believe that your patch works better than= > > > the > > > existing code, because it is closer in spirit to my interpretatio= n > > > of > > > what the Solaris code does. Specifically, I believe that the > > > Solaris > > > code starts trimming the ARC before the Solaris page daemon start= s > > > writing dirty pages to secondary storage. Now, you've made FreeB= SD > > > do > > > the same. However, you've expressed it in a way that looks broke= n. > > >=20 > > > To wrap up, I think that you can easily write this in a way that > > > simultaneously behaves like Solaris and doesn't look wrong to a V= M > > > expert. > > >=20 > > > > Out of interest would it be possible to update machines in the > > > > cluster to > > > > see how their workload reacts to the change? > >=20 > > I'd like to see the free vs cache thing resolved first but it's goi= ng > > to be > > tricky to get a comparison. >=20 > Does Karl's explaination as to why this doesn't work above change you= r > mind? Actually no, I would expect the code as committed would *cause* the=20 undesirable behavior that Karl described. ie: access a few large files and cause them to reside in cache. Say 50= GB or so=20 on a 200G ram machine. We now have the state where: v_cache =3D 50GB v_free =3D 1MB The rest of the vm system looks at vm_paging_needed(), which is: do we= have=20 enough "v_cache + v_free"? Since there's 50.001GB free, the answer is = no. =20 It'll let v_free run right down to v_free_min because of the giant pool= of=20 v_cache just sitting there, waiting to be used. The zfs change, as committed will ignore all the free memory in the for= m of=20 v_cache.. and will be freaking out about how low v_free is getting and = will be=20 sacrificing ARC in order to put more memory into the v_free pool. As long as ARC keeps sacrificing itself this way, the free pages in the= v_cache=20 pool won't get used. When ARC finally runs out of pages to give up to = v_free,=20 the kernel will start using the free pages from v_cache. Eventually it= 'll run=20 down that v_cache free pool and arc will be in a bare minimum state whi= le this=20 is happening. Meanwhile, ZFS ARC will be crippled. This has consequences - it does R= CU like=20 things from ARC to keep fragmentation under control. With ARC crippled= ,=20 fragmentation will increase because there's less opportunistic gatherin= g of=20 data from ARC. Granted, you have to get things freed from active/inactive to the cache= state,=20 but once it's there, depending on the worlkload, it'll mess with ARC. =2D-=20 Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI= 6FJV UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246 --nextPart1802697.MTHLGv2zuo Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABAgAGBQJUAOOOAAoJEDXWlwnsgJ4EqMsIAIv9XuSgFOzCDXM/OA51a6+j tLdbq2+uROIC1ptgnxSFpSYED164ZqfMHKl0NJo4ph9pOGepeAiCBz7OZ4pCxYrq P/jeDho4IlQ788RehfQ4gz5olY4enREXjpJ5cMuSnWjbAMV6wJMUitNWdaFwxNuf QcEkqTQ7rkIttWyL838/83YHuiGLxvcNscGUzLKkIq0tcWluJDbJ1NRrVPUKowJQ SMUg6aK7q1JZ2S+ZMwFVQKl93PvJQW4YqhtEEYcDuR9AZvLHjvgoh1qCZheK3ep3 Ai+7CX3ngsEZ/jAp16R2H4DsvloaHYiQqhCY9hOFpbMWrmvuhbnx1HbsjDO3nUo= =M2wJ -----END PGP SIGNATURE----- --nextPart1802697.MTHLGv2zuo--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?64121723.0IFfex9X4X>