From owner-svn-src-all@FreeBSD.ORG Fri Aug 29 20:33:20 2014 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A54C169B; Fri, 29 Aug 2014 20:33:20 +0000 (UTC) Received: from smtp2.wemm.org (smtp2.wemm.org [IPv6:2001:470:67:39d::78]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "smtp2.wemm.org", Issuer "StartCom Class 1 Primary Intermediate Server CA" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 6EC141381; Fri, 29 Aug 2014 20:33:20 +0000 (UTC) Received: from overcee.wemm.org (canning.wemm.org [192.203.228.65]) by smtp2.wemm.org (Postfix) with ESMTP id 0494C12A; Fri, 29 Aug 2014 13:33:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wemm.org; s=m20140428; t=1409344399; bh=MlNSuTEy85DmCFa5YMVZk0KbtO+uSkH1xmaXUgRw4ok=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=F7HHL2IfogIRWHOm5A7MP4LBamK/hweFKbXU4chvddRnhPAgysYvp6UJw7p8XO9bt jWk3DWKRABFvqbYqqVEaObitnyvjzPwx4rapL0dSgqsnyn9K5Vxlf4TS0YGJlKNCfS 2FMgFFiCwB7ZNtv1wA8Et5jmKXpq7/S3BAnn64yE= From: Peter Wemm To: Steven Hartland Subject: Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm Date: Fri, 29 Aug 2014 13:33:14 -0700 Message-ID: <64121723.0IFfex9X4X@overcee.wemm.org> User-Agent: KMail/4.12.5 (FreeBSD/11.0-CURRENT; KDE/4.12.5; amd64; ; ) In-Reply-To: <5A300D962A1B458B951D521EA2BE35E8@multiplay.co.uk> References: <201408281950.s7SJo90I047213@svn.freebsd.org> <1592506.xpuae4IYcM@overcee.wemm.org> <5A300D962A1B458B951D521EA2BE35E8@multiplay.co.uk> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart1802697.MTHLGv2zuo"; micalg="pgp-sha1"; protocol="application/pgp-signature" Cc: src-committers@freebsd.org, Alan Cox , svn-src-all@freebsd.org, Dmitry Morozovsky , "Matthew D. Fuller" , svn-src-head@freebsd.org X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 29 Aug 2014 20:33:20 -0000 --nextPart1802697.MTHLGv2zuo Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="us-ascii" On Friday 29 August 2014 20:51:03 Steven Hartland wrote: > > On Friday 29 August 2014 11:54:42 Alan Cox wrote: > snip... > > > > With the patch we confirmed that both RAM usage and performance= > > > > for those > > > > seeing that issue are resolved, with no reported regressions. > > > >=20 > > > >> (I should know better than to fire a reply off before full fac= t > > > >> checking, but > > > >> this commit worries me..) > > > >=20 > > > > Not a problem, its great to know people pay attention to change= s, > > > > and > > > > raise > > > > their concerns. Always better to have a discussion about potent= ial > > > > issues > > > > than to wait for a problem to occur. > > > >=20 > > > > Hopefully the above gives you some piece of mind, but if you st= ill > > > > have any > > > > concerns I'm all ears. > > >=20 > > > You didn't really address Peter's initial technical issue. Peter= > > > correctly observed that cache pages are just another flavor of fr= ee > > > pages. Whenever the VM system is checking the number of free pag= es > > > against any of the thresholds, it always uses the sum of > > > v_cache_count > > > and v_free_count. So, to anyone familiar with the VM system, lik= e > > > Peter, what you've done, which is to derive a threshold from > > > v_free_target but only compare v_free_count to that threshold, lo= oks > > > highly suspect. > >=20 > > I think I'd like to see something like this: > >=20 > > Index: cddl/compat/opensolaris/kern/opensolaris_kmem.c > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > --- cddl/compat/opensolaris/kern/opensolaris_kmem.c (revision 27082= 4) > > +++ cddl/compat/opensolaris/kern/opensolaris_kmem.c (working copy) > > @@ -152,7 +152,8 @@ > >=20 > > kmem_free_count(void) > > { > >=20 > > - return (vm_cnt.v_free_count); > > + /* "cache" is just a flavor of free pages in FreeBSD */ > > + return (vm_cnt.v_free_count + vm_cnt.v_cache_count); > >=20 > > } > > =20 > > u_int >=20 > This has apparently already been tried and the response from Karl was= : >=20 > - No, because memory in "cache" is subject to being either reallocate= d > or freed. > - When I was developing this patch that was my first impression as we= ll > and how > - I originally coded it, and it turned out to be wrong. > - > - The issue here is that you have two parts of the system contending = for > RAM -- > - the VM system generally, and the ARC cache. If the ARC cache frees= > space before > - the VM system activates and starts pruning then you wind up with th= e > ARC pinned > - at the minimum after some period of time, because it releases "earl= y." >=20 > I've asked him if he would retest just to be sure. >=20 > > The rest of the system looks at the "big picture" it would be happy= to > > let the > > "free" pool run quite a way down so long as there's "cache" pages > > available to > > satisfy the free space requirements. This would lead ZFS to > > mistakenly > > sacrifice ARC for no reason. I'm not sure how big a deal this is, = but > > I can't > > imagine many scenarios where I want ARC to be discarded in order to= > > save some > > effectively free pages. >=20 > From Karl's response from the original PR (above) it seems like this > causes > unexpected behaviour due to the two systems being seperate. >=20 > > > That said, I can easily believe that your patch works better than= > > > the > > > existing code, because it is closer in spirit to my interpretatio= n > > > of > > > what the Solaris code does. Specifically, I believe that the > > > Solaris > > > code starts trimming the ARC before the Solaris page daemon start= s > > > writing dirty pages to secondary storage. Now, you've made FreeB= SD > > > do > > > the same. However, you've expressed it in a way that looks broke= n. > > >=20 > > > To wrap up, I think that you can easily write this in a way that > > > simultaneously behaves like Solaris and doesn't look wrong to a V= M > > > expert. > > >=20 > > > > Out of interest would it be possible to update machines in the > > > > cluster to > > > > see how their workload reacts to the change? > >=20 > > I'd like to see the free vs cache thing resolved first but it's goi= ng > > to be > > tricky to get a comparison. >=20 > Does Karl's explaination as to why this doesn't work above change you= r > mind? Actually no, I would expect the code as committed would *cause* the=20 undesirable behavior that Karl described. ie: access a few large files and cause them to reside in cache. Say 50= GB or so=20 on a 200G ram machine. We now have the state where: v_cache =3D 50GB v_free =3D 1MB The rest of the vm system looks at vm_paging_needed(), which is: do we= have=20 enough "v_cache + v_free"? Since there's 50.001GB free, the answer is = no. =20 It'll let v_free run right down to v_free_min because of the giant pool= of=20 v_cache just sitting there, waiting to be used. The zfs change, as committed will ignore all the free memory in the for= m of=20 v_cache.. and will be freaking out about how low v_free is getting and = will be=20 sacrificing ARC in order to put more memory into the v_free pool. As long as ARC keeps sacrificing itself this way, the free pages in the= v_cache=20 pool won't get used. When ARC finally runs out of pages to give up to = v_free,=20 the kernel will start using the free pages from v_cache. Eventually it= 'll run=20 down that v_cache free pool and arc will be in a bare minimum state whi= le this=20 is happening. Meanwhile, ZFS ARC will be crippled. This has consequences - it does R= CU like=20 things from ARC to keep fragmentation under control. With ARC crippled= ,=20 fragmentation will increase because there's less opportunistic gatherin= g of=20 data from ARC. Granted, you have to get things freed from active/inactive to the cache= state,=20 but once it's there, depending on the worlkload, it'll mess with ARC. =2D-=20 Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI= 6FJV UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246 --nextPart1802697.MTHLGv2zuo Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABAgAGBQJUAOOOAAoJEDXWlwnsgJ4EqMsIAIv9XuSgFOzCDXM/OA51a6+j tLdbq2+uROIC1ptgnxSFpSYED164ZqfMHKl0NJo4ph9pOGepeAiCBz7OZ4pCxYrq P/jeDho4IlQ788RehfQ4gz5olY4enREXjpJ5cMuSnWjbAMV6wJMUitNWdaFwxNuf QcEkqTQ7rkIttWyL838/83YHuiGLxvcNscGUzLKkIq0tcWluJDbJ1NRrVPUKowJQ SMUg6aK7q1JZ2S+ZMwFVQKl93PvJQW4YqhtEEYcDuR9AZvLHjvgoh1qCZheK3ep3 Ai+7CX3ngsEZ/jAp16R2H4DsvloaHYiQqhCY9hOFpbMWrmvuhbnx1HbsjDO3nUo= =M2wJ -----END PGP SIGNATURE----- --nextPart1802697.MTHLGv2zuo--