From owner-svn-src-head@FreeBSD.ORG Fri Aug 29 19:27:03 2014 Return-Path: Delivered-To: svn-src-head@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 468C7F29; Fri, 29 Aug 2014 19:27:03 +0000 (UTC) Received: from smtp2.wemm.org (smtp2.wemm.org [IPv6:2001:470:67:39d::78]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "smtp2.wemm.org", Issuer "StartCom Class 1 Primary Intermediate Server CA" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 10C2A1ADB; Fri, 29 Aug 2014 19:27:03 +0000 (UTC) Received: from overcee.wemm.org (canning.wemm.org [192.203.228.65]) by smtp2.wemm.org (Postfix) with ESMTP id 24706D6; Fri, 29 Aug 2014 12:27:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wemm.org; s=m20140428; t=1409340422; bh=8daikglGZEkyD+1yO3wy9bd+2YukM5wJrpkhA/S+918=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=LGUEw9J8t+TNbaShtpD4b+brnPbP/FlrPkRrL9paD1LhmIVPcM3oUYRtqTaMgEXHr FGihXTCatHtiYf0PziwevBSlzZTONAUWQy6O9TNTK9itziUDM18jS36HuM0NkAbRJY dZyH0IE2VU0uVHQDuCud+qWXkRJIUyE72qpq4ljs= From: Peter Wemm To: Alan Cox Subject: Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm Date: Fri, 29 Aug 2014 12:26:57 -0700 Message-ID: <1592506.xpuae4IYcM@overcee.wemm.org> User-Agent: KMail/4.12.5 (FreeBSD/11.0-CURRENT; KDE/4.12.5; amd64; ; ) In-Reply-To: <5400B052.6030103@rice.edu> References: <201408281950.s7SJo90I047213@svn.freebsd.org> <4A4B2C2D36064FD9840E3603D39E58E0@multiplay.co.uk> <5400B052.6030103@rice.edu> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart3473061.QZNGgrCJeQ"; micalg="pgp-sha1"; protocol="application/pgp-signature" Cc: src-committers@freebsd.org, svn-src-all@freebsd.org, Dmitry Morozovsky , "Matthew D. Fuller" , svn-src-head@freebsd.org, Steven Hartland X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 29 Aug 2014 19:27:03 -0000 --nextPart3473061.QZNGgrCJeQ Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="us-ascii" On Friday 29 August 2014 11:54:42 Alan Cox wrote: > On 08/29/2014 03:32, Steven Hartland wrote: > >> On Thursday 28 August 2014 17:30:17 Alan Cox wrote: > >> > On 08/28/2014 16:15, Matthew D. Fuller wrote: > >> > > On Thu, Aug 28, 2014 at 10:11:39PM +0100 I heard the voice of > >> > >=20 > >> > > Steven Hartland, and lo! it spake thus: > >> > >> Its very likely applicable to stable/9 although I've never us= ed 9 > >> > >> myself, we jumped from 9 direct to 10. > >> > >=20 > >> > > This is actually hitting two different issues from the two bug= s: > >> > >=20 > >> > > - 191510 is about "ARC isn't greedy enough" on huge-memory > >= > >>=20 > >> machines, > >>=20 > >> > > and from the osreldate that bug was filed on 9.2, so presuma= bly > >> > >=20 > >> > > is > >> > >=20 > >> > > applicable. > >> > >=20 > >> > > - 187594 is about "ARC is too greedy" (probably mostly on > > > >>=20 > >> not-so-huge > >>=20 > >> > > machines) and starves/drives the rest of the system into swa= p. > >> > >=20 > >> > > That > >> > >=20 > >> > > I believe came about as a result of some unrelated change in= the > >> > > 10.x stream that upset the previous balance between ARC and = the > >> > >=20 > >> > > rest > >> > >=20 > >> > > of the VM, so isn't a problem on 9.x. > >> >=20 > >> > 10.0 had a bug in the page daemon that was fixed in 10-STABLE ab= out > >> > three months ago (r265945). The ARC was not the only thing > >>=20 > >> affected > by > >> this bug. > >>=20 > >> I'm concerned about potential unintended consequences of this chan= ge. > >>=20 > >> Before, arc reclaim was driven by vm_paging_needed(), which was: > >> vm_paging_needed(void) > >> { > >>=20 > >> return (vm_cnt.v_free_count + vm_cnt.v_cache_count < > >> =20 > >> vm_pageout_wakeup_thresh); > >>=20 > >> } > >>=20 > >> Now it's ignoring the v_cache_count and looking exclusively at > >> v_free_count. > >> "cache" pages are free pages that just happen to have known conten= ts. > >> If I > >> read this change right, zfs arc will now discard checksummed cache= > >> pages to > >=20 > >> make room for non-checksummed pages: > > That test is still there so if it needs to it will still trigger. > >=20 > > However that often a lower level as vm_pageout_wakeup_thresh is onl= y 110% > > of min free, where as zfs_arc_free_target is based of target free > > which is > > 4 * (min free + reserved). > >=20 > >> + if (kmem_free_count() < zfs_arc_free_target) { > >> + return (1); > >> + } > >> ... > >> +kmem_free_count(void) > >> +{ > >> + return (vm_cnt.v_free_count); > >> +} > >>=20 > >> This seems like a pretty substantial behavior change. I'm concern= ed > >> that it > >> doesn't appear to count all the forms of "free" pages. > >>=20 > >> I haven't seen the problems with the over-aggressive ARC since the= > >> page daemon > >> bug was fixed. It's been working fine under pretty abusive loads = in > >> the freebsd > >> cluster after that fix. > >=20 > > Others have also confirmed that even with r265945 they can still tr= igger > > performance issue. > >=20 > > In addition without it we still have loads of RAM sat their unused,= in my > > particular experience we have 40GB of 192GB sitting their unused an= d that > > was with a stable build from last weekend. >=20 > The Solaris code only imposed this limit on 32-bit machines where the= > available kernel virtual address space may be much less than the > available physical memory. Previously, FreeBSD imposed this limit on= > both 32-bit and 64-bit machines. Now, it imposes it on neither. Why= > continue to do this differently from Solaris? Since the question was asked below, we don't have zfs machines in the c= luster=20 running i386. We can barely get them to boot as it is due to kva press= ure. =20 We have to reduce/cap physical memory and change the user/kernel virtua= l split=20 from=203:1 to 2.5:1.5.=20 We do run zfs on small amd64 machines with 2G of ram, but I can't imagi= ne it=20 working on the 10G i386 PAE machines that we have. > > With the patch we confirmed that both RAM usage and performance for= those > > seeing that issue are resolved, with no reported regressions. > >=20 > >> (I should know better than to fire a reply off before full fact > >> checking, but > >> this commit worries me..) > >=20 > > Not a problem, its great to know people pay attention to changes, a= nd > > raise > > their concerns. Always better to have a discussion about potential = issues > > than to wait for a problem to occur. > >=20 > > Hopefully the above gives you some piece of mind, but if you still > > have any > > concerns I'm all ears. >=20 > You didn't really address Peter's initial technical issue. Peter > correctly observed that cache pages are just another flavor of free > pages. Whenever the VM system is checking the number of free pages > against any of the thresholds, it always uses the sum of v_cache_coun= t > and v_free_count. So, to anyone familiar with the VM system, like > Peter, what you've done, which is to derive a threshold from > v_free_target but only compare v_free_count to that threshold, looks > highly suspect. I think I'd like to see something like this: Index: cddl/compat/opensolaris/kern/opensolaris_kmem.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =2D-- cddl/compat/opensolaris/kern/opensolaris_kmem.c=09(revision 270824)= +++ cddl/compat/opensolaris/kern/opensolaris_kmem.c=09(working copy) @@ -152,7 +152,8 @@ kmem_free_count(void) { =20 =2D=09return (vm_cnt.v_free_count); +=09/* "cache" is just a flavor of free pages in FreeBSD */ +=09return (vm_cnt.v_free_count + vm_cnt.v_cache_count); } =20 u_int The rest of the system looks at the "big picture" it would be happy to = let the=20 "free" pool run quite a way down so long as there's "cache" pages avail= able to=20 satisfy the free space requirements. This would lead ZFS to mistakenly= =20 sacrifice ARC for no reason. I'm not sure how big a deal this is, but = I can't=20 imagine many scenarios where I want ARC to be discarded in order to sav= e some=20 effectively free pages. > That said, I can easily believe that your patch works better than the= > existing code, because it is closer in spirit to my interpretation of= > what the Solaris code does. Specifically, I believe that the Solaris= > code starts trimming the ARC before the Solaris page daemon starts > writing dirty pages to secondary storage. Now, you've made FreeBSD d= o > the same. However, you've expressed it in a way that looks broken. >=20 > To wrap up, I think that you can easily write this in a way that > simultaneously behaves like Solaris and doesn't look wrong to a VM ex= pert. >=20 > > Out of interest would it be possible to update machines in the clus= ter to > > see how their workload reacts to the change? > >=20 > > Regards > > Steve I'd like to see the free vs cache thing resolved first but it's going t= o be=20 tricky to get a comparison. For the first few months of the year, things were really troublesome. = It was=20 quite easy to overtax the machines and run them into the ground. This is not the case now - things are working pretty well under pressur= e=20 (prior to the commit). Its got to the point that we feel comfortable=20= thrashing the machines really hard again. Getting a comparison when it= =20 already works well is going to be tricky. We don't have large memory machines that aren't already tuned for=20 vfs.zfs.arc_max caps for tmpfs use. For context to the wider audience, we do not run -release or -pN in the= =20 freebsd cluster. We mostly run -current, and some -stable. I am well= aware=20 that there is significant discomfort in 10.0-R with zfs but we already = have the=20 fixes for that. =2D-=20 Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI= 6FJV UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246 --nextPart3473061.QZNGgrCJeQ Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABAgAGBQJUANQFAAoJEDXWlwnsgJ4Ej54H/i3fKVW4nwYRZW4rQdRQp4k3 yXAHOptr8jh2BU3OLkB9BFHj2OTllfGxMNo2wiephc3Hg2NelKrNQyGVTMCXVU7p m5DTboznV1xPA5oawVnkuJglPPuV+cID2AgUCaZVrUheWN5Yrs0b+S1TWHXrPoU2 CXAj5u8fd0YlMRVGc8PPBBIWCthbqb7B+GHRoFGfjRJ2gMFvuqi/ls8U5rvmHmwI NYPzgc6zE+RaLIRR0yRlfAz3eWRQ35WFBG4W0jWxdV/o6oPsbma2w6qOysSJB9Rn IZGVgYa+SwiZa2wPvUdm4E+oky1Y+SnACdHbIptynIWEczVAuLY9fGS0yk1bZVk= =6Njw -----END PGP SIGNATURE----- --nextPart3473061.QZNGgrCJeQ--