Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 29 Aug 2014 13:33:14 -0700
From:      Peter Wemm <peter@wemm.org>
To:        Steven Hartland <smh@freebsd.org>
Cc:        src-committers@freebsd.org, Alan Cox <alc@rice.edu>, svn-src-all@freebsd.org, Dmitry Morozovsky <marck@rinet.ru>, "Matthew D. Fuller" <fullermd@over-yonder.net>, svn-src-head@freebsd.org
Subject:   Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
Message-ID:  <64121723.0IFfex9X4X@overcee.wemm.org>
In-Reply-To: <5A300D962A1B458B951D521EA2BE35E8@multiplay.co.uk>
References:  <201408281950.s7SJo90I047213@svn.freebsd.org> <1592506.xpuae4IYcM@overcee.wemm.org> <5A300D962A1B458B951D521EA2BE35E8@multiplay.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help

--nextPart1802697.MTHLGv2zuo
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="us-ascii"

On Friday 29 August 2014 20:51:03 Steven Hartland wrote:
> > On Friday 29 August 2014 11:54:42 Alan Cox wrote:
> snip...
> > > > With the patch we confirmed that both RAM usage and performance=

> > > > for those
> > > > seeing that issue are resolved, with no reported regressions.
> > > >=20
> > > >> (I should know better than to fire a reply off before full fac=
t
> > > >> checking, but
> > > >> this commit worries me..)
> > > >=20
> > > > Not a problem, its great to know people pay attention to change=
s,
> > > > and
> > > > raise
> > > > their concerns. Always better to have a discussion about potent=
ial
> > > > issues
> > > > than to wait for a problem to occur.
> > > >=20
> > > > Hopefully the above gives you some piece of mind, but if you st=
ill
> > > > have any
> > > > concerns I'm all ears.
> > >=20
> > > You didn't really address Peter's initial technical issue.  Peter=

> > > correctly observed that cache pages are just another flavor of fr=
ee
> > > pages.  Whenever the VM system is checking the number of free pag=
es
> > > against any of the thresholds, it always uses the sum of
> > > v_cache_count
> > > and v_free_count.  So, to anyone familiar with the VM system, lik=
e
> > > Peter, what you've done, which is to derive a threshold from
> > > v_free_target but only compare v_free_count to that threshold, lo=
oks
> > > highly suspect.
> >=20
> > I think I'd like to see something like this:
> >=20
> > Index: cddl/compat/opensolaris/kern/opensolaris_kmem.c
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > --- cddl/compat/opensolaris/kern/opensolaris_kmem.c (revision 27082=
4)
> > +++ cddl/compat/opensolaris/kern/opensolaris_kmem.c (working copy)
> > @@ -152,7 +152,8 @@
> >=20
> >  kmem_free_count(void)
> >  {
> >=20
> > - return (vm_cnt.v_free_count);
> > + /* "cache" is just a flavor of free pages in FreeBSD */
> > + return (vm_cnt.v_free_count + vm_cnt.v_cache_count);
> >=20
> >  }
> > =20
> >  u_int
>=20
> This has apparently already been tried and the response from Karl was=
:
>=20
> - No, because memory in "cache" is subject to being either reallocate=
d
> or freed.
> - When I was developing this patch that was my first impression as we=
ll
> and how
> - I originally coded it, and it turned out to be wrong.
> -
> - The issue here is that you have two parts of the system contending =
for
> RAM --
> - the VM system generally, and the ARC cache.  If the ARC cache frees=

> space before
> - the VM system activates and starts pruning then you wind up with th=
e
> ARC pinned
> - at the minimum after some period of time, because it releases "earl=
y."
>=20
> I've asked him if he would retest just to be sure.
>=20
> > The rest of the system looks at the "big picture" it would be happy=
 to
> > let the
> > "free" pool run quite a way down so long as there's "cache" pages
> > available to
> > satisfy the free space requirements.  This would lead ZFS to
> > mistakenly
> > sacrifice ARC for no reason.  I'm not sure how big a deal this is, =
but
> > I can't
> > imagine many scenarios where I want ARC to be discarded in order to=

> > save some
> > effectively free pages.
>=20
> From Karl's response from the original PR (above) it seems like this
> causes
> unexpected behaviour due to the two systems being seperate.
>=20
> > > That said, I can easily believe that your patch works better than=

> > > the
> > > existing code, because it is closer in spirit to my interpretatio=
n
> > > of
> > > what the Solaris code does.  Specifically, I believe that the
> > > Solaris
> > > code starts trimming the ARC before the Solaris page daemon start=
s
> > > writing dirty pages to secondary storage.  Now, you've made FreeB=
SD
> > > do
> > > the same.  However, you've expressed it in a way that looks broke=
n.
> > >=20
> > > To wrap up, I think that you can easily write this in a way that
> > > simultaneously behaves like Solaris and doesn't look wrong to a V=
M
> > > expert.
> > >=20
> > > > Out of interest would it be possible to update machines in the
> > > > cluster to
> > > > see how their workload reacts to the change?
> >=20
> > I'd like to see the free vs cache thing resolved first but it's goi=
ng
> > to be
> > tricky to get a comparison.
>=20
> Does Karl's explaination as to why this doesn't work above change you=
r
> mind?

Actually no, I would expect the code as committed would *cause* the=20
undesirable behavior that Karl described.

ie: access a few large files and cause them to reside in cache.  Say 50=
GB or so=20
on a 200G ram machine.  We now have the state where:

v_cache =3D 50GB
v_free =3D 1MB

The rest of the vm system looks at vm_paging_needed(), which is:  do we=
 have=20
enough "v_cache + v_free"?  Since there's 50.001GB free, the answer is =
no. =20
It'll let v_free run right down to v_free_min because of the giant pool=
 of=20
v_cache just sitting there, waiting to be used.

The zfs change, as committed will ignore all the free memory in the for=
m of=20
v_cache.. and will be freaking out about how low v_free is getting and =
will be=20
sacrificing ARC in order to put more memory into the v_free pool.

As long as ARC keeps sacrificing itself this way, the free pages in the=
 v_cache=20
pool won't get used.  When ARC finally runs out of pages to give up to =
v_free,=20
the kernel will start using the free pages from v_cache.  Eventually it=
'll run=20
down that v_cache free pool and arc will be in a bare minimum state whi=
le this=20
is happening.

Meanwhile, ZFS ARC will be crippled.  This has consequences - it does R=
CU like=20
things from ARC to keep fragmentation under control.  With ARC crippled=
,=20
fragmentation will increase because there's less opportunistic gatherin=
g of=20
data from ARC.

Granted, you have to get things freed from active/inactive to the cache=
 state,=20
but once it's there, depending on the worlkload, it'll mess with ARC.

=2D-=20
Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI=
6FJV
UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246
--nextPart1802697.MTHLGv2zuo
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part.
Content-Transfer-Encoding: 7Bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAABAgAGBQJUAOOOAAoJEDXWlwnsgJ4EqMsIAIv9XuSgFOzCDXM/OA51a6+j
tLdbq2+uROIC1ptgnxSFpSYED164ZqfMHKl0NJo4ph9pOGepeAiCBz7OZ4pCxYrq
P/jeDho4IlQ788RehfQ4gz5olY4enREXjpJ5cMuSnWjbAMV6wJMUitNWdaFwxNuf
QcEkqTQ7rkIttWyL838/83YHuiGLxvcNscGUzLKkIq0tcWluJDbJ1NRrVPUKowJQ
SMUg6aK7q1JZ2S+ZMwFVQKl93PvJQW4YqhtEEYcDuR9AZvLHjvgoh1qCZheK3ep3
Ai+7CX3ngsEZ/jAp16R2H4DsvloaHYiQqhCY9hOFpbMWrmvuhbnx1HbsjDO3nUo=
=M2wJ
-----END PGP SIGNATURE-----

--nextPart1802697.MTHLGv2zuo--




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?64121723.0IFfex9X4X>