Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 29 Aug 2014 12:26:57 -0700
From:      Peter Wemm <peter@wemm.org>
To:        Alan Cox <alc@rice.edu>
Cc:        src-committers@freebsd.org, svn-src-all@freebsd.org, Dmitry Morozovsky <marck@rinet.ru>, "Matthew D. Fuller" <fullermd@over-yonder.net>, svn-src-head@freebsd.org, Steven Hartland <smh@freebsd.org>
Subject:   Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
Message-ID:  <1592506.xpuae4IYcM@overcee.wemm.org>
In-Reply-To: <5400B052.6030103@rice.edu>
References:  <201408281950.s7SJo90I047213@svn.freebsd.org> <4A4B2C2D36064FD9840E3603D39E58E0@multiplay.co.uk> <5400B052.6030103@rice.edu>

next in thread | previous in thread | raw e-mail | index | archive | help

--nextPart3473061.QZNGgrCJeQ
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="us-ascii"

On Friday 29 August 2014 11:54:42 Alan Cox wrote:
> On 08/29/2014 03:32, Steven Hartland wrote:
> >> On Thursday 28 August 2014 17:30:17 Alan Cox wrote:
> >> > On 08/28/2014 16:15, Matthew D. Fuller wrote:
> >> > > On Thu, Aug 28, 2014 at 10:11:39PM +0100 I heard the voice of
> >> > >=20
> >> > > Steven Hartland, and lo! it spake thus:
> >> > >> Its very likely applicable to stable/9 although I've never us=
ed 9
> >> > >> myself, we jumped from 9 direct to 10.
> >> > >=20
> >> > > This is actually hitting two different issues from the two bug=
s:
> >> > >=20
> >> > > - 191510 is about "ARC isn't greedy enough" on huge-memory > >=

> >>=20
> >> machines,
> >>=20
> >> > >   and from the osreldate that bug was filed on 9.2, so presuma=
bly
> >> > >=20
> >> > > is
> >> > >=20
> >> > >   applicable.
> >> > >=20
> >> > > - 187594 is about "ARC is too greedy" (probably mostly on > >
> >>=20
> >> not-so-huge
> >>=20
> >> > >   machines) and starves/drives the rest of the system into swa=
p.
> >> > >=20
> >> > > That
> >> > >=20
> >> > >   I believe came about as a result of some unrelated change in=
 the
> >> > >   10.x stream that upset the previous balance between ARC and =
the
> >> > >=20
> >> > > rest
> >> > >=20
> >> > >   of the VM, so isn't a problem on 9.x.
> >> >=20
> >> > 10.0 had a bug in the page daemon that was fixed in 10-STABLE ab=
out
> >> > three months ago (r265945).  The ARC was not the only thing
> >>=20
> >> affected > by
> >> this bug.
> >>=20
> >> I'm concerned about potential unintended consequences of this chan=
ge.
> >>=20
> >> Before, arc reclaim was driven by vm_paging_needed(), which was:
> >> vm_paging_needed(void)
> >> {
> >>=20
> >>     return (vm_cnt.v_free_count + vm_cnt.v_cache_count <
> >>    =20
> >>         vm_pageout_wakeup_thresh);
> >>=20
> >> }
> >>=20
> >> Now it's ignoring the v_cache_count and looking exclusively at
> >> v_free_count.
> >> "cache" pages are free pages that just happen to have known conten=
ts.
> >> If I
> >> read this change right, zfs arc will now discard checksummed cache=

> >> pages to
> >=20
> >> make room for non-checksummed pages:
> > That test is still there so if it needs to it will still trigger.
> >=20
> > However that often a lower level as vm_pageout_wakeup_thresh is onl=
y 110%
> > of min free, where as zfs_arc_free_target is based of target free
> > which is
> > 4 * (min free + reserved).
> >=20
> >> +       if (kmem_free_count() < zfs_arc_free_target) {
> >> +               return (1);
> >> +       }
> >> ...
> >> +kmem_free_count(void)
> >> +{
> >> +       return (vm_cnt.v_free_count);
> >> +}
> >>=20
> >> This seems like a pretty substantial behavior change.  I'm concern=
ed
> >> that it
> >> doesn't appear to count all the forms of "free" pages.
> >>=20
> >> I haven't seen the problems with the over-aggressive ARC since the=

> >> page daemon
> >> bug was fixed.  It's been working fine under pretty abusive loads =
in
> >> the freebsd
> >> cluster after that fix.
> >=20
> > Others have also confirmed that even with r265945 they can still tr=
igger
> > performance issue.
> >=20
> > In addition without it we still have loads of RAM sat their unused,=
 in my
> > particular experience we have 40GB of 192GB sitting their unused an=
d that
> > was with a stable build from last weekend.
>=20
> The Solaris code only imposed this limit on 32-bit machines where the=

> available kernel virtual address space may be much less than the
> available physical memory.  Previously, FreeBSD imposed this limit on=

> both 32-bit and 64-bit machines.  Now, it imposes it on neither.  Why=

> continue to do this differently from Solaris?

Since the question was asked below, we don't have zfs machines in the c=
luster=20
running i386.  We can barely get them to boot as it is due to kva press=
ure. =20
We have to reduce/cap physical memory and change the user/kernel virtua=
l split=20
from=203:1 to 2.5:1.5.=20

We do run zfs on small amd64 machines with 2G of ram, but I can't imagi=
ne it=20
working on the 10G i386 PAE machines that we have.


> > With the patch we confirmed that both RAM usage and performance for=
 those
> > seeing that issue are resolved, with no reported regressions.
> >=20
> >> (I should know better than to fire a reply off before full fact
> >> checking, but
> >> this commit worries me..)
> >=20
> > Not a problem, its great to know people pay attention to changes, a=
nd
> > raise
> > their concerns. Always better to have a discussion about potential =
issues
> > than to wait for a problem to occur.
> >=20
> > Hopefully the above gives you some piece of mind, but if you still
> > have any
> > concerns I'm all ears.
>=20
> You didn't really address Peter's initial technical issue.  Peter
> correctly observed that cache pages are just another flavor of free
> pages.  Whenever the VM system is checking the number of free pages
> against any of the thresholds, it always uses the sum of v_cache_coun=
t
> and v_free_count.  So, to anyone familiar with the VM system, like
> Peter, what you've done, which is to derive a threshold from
> v_free_target but only compare v_free_count to that threshold, looks
> highly suspect.

I think I'd like to see something like this:

Index: cddl/compat/opensolaris/kern/opensolaris_kmem.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
=2D-- cddl/compat/opensolaris/kern/opensolaris_kmem.c=09(revision 270824)=

+++ cddl/compat/opensolaris/kern/opensolaris_kmem.c=09(working copy)
@@ -152,7 +152,8 @@
 kmem_free_count(void)
 {
=20
=2D=09return (vm_cnt.v_free_count);
+=09/* "cache" is just a flavor of free pages in FreeBSD */
+=09return (vm_cnt.v_free_count + vm_cnt.v_cache_count);
 }
=20
 u_int


The rest of the system looks at the "big picture" it would be happy to =
let the=20
"free" pool run quite a way down so long as there's "cache" pages avail=
able to=20
satisfy the free space requirements.  This would lead ZFS to mistakenly=
=20
sacrifice ARC for no reason.  I'm not sure how big a deal this is, but =
I can't=20
imagine many scenarios where I want ARC to be discarded in order to sav=
e some=20
effectively free pages.

> That said, I can easily believe that your patch works better than the=

> existing code, because it is closer in spirit to my interpretation of=

> what the Solaris code does.  Specifically, I believe that the Solaris=

> code starts trimming the ARC before the Solaris page daemon starts
> writing dirty pages to secondary storage.  Now, you've made FreeBSD d=
o
> the same.  However, you've expressed it in a way that looks broken.
>=20
> To wrap up, I think that you can easily write this in a way that
> simultaneously behaves like Solaris and doesn't look wrong to a VM ex=
pert.
>=20
> > Out of interest would it be possible to update machines in the clus=
ter to
> > see how their workload reacts to the change?
> >=20
> >    Regards
> >    Steve

I'd like to see the free vs cache thing resolved first but it's going t=
o be=20
tricky to get a comparison.

For the first few months of the year, things were really troublesome.  =
It was=20
quite easy to overtax the machines and run them into the ground.

This is not the case now - things are working pretty well under pressur=
e=20
(prior to the commit).  Its got to the point that we feel comfortable=20=

thrashing the machines really hard again.  Getting a comparison when it=
=20
already works well is going to be tricky.

We don't have large memory machines that aren't already tuned for=20
vfs.zfs.arc_max caps for tmpfs use.

For context to the wider audience, we do not run -release or -pN in the=
=20
freebsd cluster.  We mostly run -current, and some -stable.   I am well=
 aware=20
that there is significant discomfort in 10.0-R with zfs but we already =
have the=20
fixes for that.
=2D-=20
Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI=
6FJV
UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246
--nextPart3473061.QZNGgrCJeQ
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part.
Content-Transfer-Encoding: 7Bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAABAgAGBQJUANQFAAoJEDXWlwnsgJ4Ej54H/i3fKVW4nwYRZW4rQdRQp4k3
yXAHOptr8jh2BU3OLkB9BFHj2OTllfGxMNo2wiephc3Hg2NelKrNQyGVTMCXVU7p
m5DTboznV1xPA5oawVnkuJglPPuV+cID2AgUCaZVrUheWN5Yrs0b+S1TWHXrPoU2
CXAj5u8fd0YlMRVGc8PPBBIWCthbqb7B+GHRoFGfjRJ2gMFvuqi/ls8U5rvmHmwI
NYPzgc6zE+RaLIRR0yRlfAz3eWRQ35WFBG4W0jWxdV/o6oPsbma2w6qOysSJB9Rn
IZGVgYa+SwiZa2wPvUdm4E+oky1Y+SnACdHbIptynIWEczVAuLY9fGS0yk1bZVk=
=6Njw
-----END PGP SIGNATURE-----

--nextPart3473061.QZNGgrCJeQ--




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1592506.xpuae4IYcM>