Date: Fri, 29 Aug 2014 14:20:44 -0700 From: Peter Wemm <peter@wemm.org> To: Steven Hartland <killing@multiplay.co.uk> Cc: src-committers@freebsd.org, Alan Cox <alc@rice.edu>, svn-src-all@freebsd.org, Dmitry Morozovsky <marck@rinet.ru>, "Matthew D. Fuller" <fullermd@over-yonder.net>, svn-src-head@freebsd.org Subject: Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm Message-ID: <2714752.cWQfguSlQD@overcee.wemm.org> In-Reply-To: <0B77E782B5004AEBA77E6A5D16924D83@multiplay.co.uk> References: <201408281950.s7SJo90I047213@svn.freebsd.org> <64121723.0IFfex9X4X@overcee.wemm.org> <0B77E782B5004AEBA77E6A5D16924D83@multiplay.co.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
--nextPart15720028.E9rAG9uuRh Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="us-ascii" On Friday 29 August 2014 21:42:15 Steven Hartland wrote: > ----- Original Message ----- > From: "Peter Wemm" <peter@wemm.org> >=20 > > On Friday 29 August 2014 20:51:03 Steven Hartland wrote: > snip.. >=20 > > > Does Karl's explaination as to why this doesn't work above change= > > > your > > > mind? > >=20 > > Actually no, I would expect the code as committed would *cause* the= > > undesirable behavior that Karl described. > >=20 > > ie: access a few large files and cause them to reside in cache. Sa= y > > 50GB or so > > on a 200G ram machine. We now have the state where: > >=20 > > v_cache =3D 50GB > > v_free =3D 1MB > >=20 > > The rest of the vm system looks at vm_paging_needed(), which is: d= o > > we have > > enough "v_cache + v_free"? Since there's 50.001GB free, the answer= is > > no. > > It'll let v_free run right down to v_free_min because of the giant > > pool of > > v_cache just sitting there, waiting to be used. > >=20 > > The zfs change, as committed will ignore all the free memory in the= > > form of > > v_cache.. and will be freaking out about how low v_free is getting = and > > will be > > sacrificing ARC in order to put more memory into the v_free pool. > >=20 > > As long as ARC keeps sacrificing itself this way, the free pages in= > > the v_cache > > pool won't get used. When ARC finally runs out of pages to give up= to > > v_free, > > the kernel will start using the free pages from v_cache. Eventuall= y > > it'll run > > down that v_cache free pool and arc will be in a bare minimum state= > > while this > > is happening. > >=20 > > Meanwhile, ZFS ARC will be crippled. This has consequences - it do= es > > RCU like > > things from ARC to keep fragmentation under control. With ARC > > crippled, > > fragmentation will increase because there's less opportunistic > > gathering of > > data from ARC. > >=20 > > Granted, you have to get things freed from active/inactive to the > > cache state, > > but once it's there, depending on the worlkload, it'll mess with AR= C. >=20 > There's already a vm_paging_needed() check in there below so this wil= l > already > be dealt with will it not? No. If you read the code that you changed, you won't get that far. The v_fr= ee test=20 comes before vm_paging_needed(), and if the v_free test triggers then A= RC will=20 return pages and not look at the rest of the function. If this function returns non-zerp, ARC is given back: static int arc_reclaim_needed(void) { if (kmem_free_count() < zfs_arc_free_target) { return (1); } /* * Cooperate with pagedaemon when it's time for it to scan * and reclaim some pages. */ if (vm_paging_needed()) { return (1); } ie: if v_free (ignoring v_cache free pages) gets below the threshold, s= top=20 evertyhing and discard ARC pages.=20 The vm_paging_needed() code is a NO-OP at this point. It can never retu= rn=20 true. Consider: vm_cnt.v_free_target =3D 4 * vm_cnt.v_free_min + vm_cnt.v_free_= reserved; vs vm_pageout_wakeup_thresh =3D (vm_cnt.v_free_min / 10) * 11; zfs_arc_free_target defaults to vm_cnt.v_free_target, which is 400% of=20= v_free_min, and compares it against the smaller v_free pool. vm_paging_needed() compares the total free pool (v_free + v_cache) agai= nst the=20 smaller wakeup threshold - 110% of v_free_min. Comparing a larger value against a smaller target than the previous tes= t will=20 never succeed unless you manually change the arc_free_target sysctl. Also, what about the magic numbers here: u_int zfs_arc_free_target =3D (1 << 19); /* default before pagedaemon i= nit only=20 */ That's half a million pages, or 2GB of physical ram on a 4K page size s= ystem =20 How is this going to work on early boot in the machines in the cluster = with=20 less than 2GB of ram? =2D-=20 Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI= 6FJV UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246 --nextPart15720028.E9rAG9uuRh Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABAgAGBQJUAO6wAAoJEDXWlwnsgJ4EWGsH/25GwipkDGNwf9n3q5+CK8ri jLK2Bs5kXAlz9w6lnd5QxlxHmOT4s/X2BTleepYZkDdDCSyyBftHBrOzzLzQ9Sh5 T/ZZWcC2ofkY6ih7QTrE6asgG8E1VZtOo70fCLwJ/b9kmWqI/TnEov/aVafu76cx RJXTMHVju8pdbUzTSG77PHuCwCfl78T3MnW45tJgQrbLFHlUrR4ICT404fq0jbUA gxNKj1ONUZJApS/sesPqI+ueLtBwaJbNwtKM03zXc29FTmJmg393SAlG9nrfVWvZ J8Jhv809XhsRt2x0sAnyIlIdGy2mQ67cK17FYiaXQWJEjt5oTIGOghve8C7IqFU= =T44y -----END PGP SIGNATURE----- --nextPart15720028.E9rAG9uuRh--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2714752.cWQfguSlQD>