Date: Wed, 22 May 2019 10:47:00 -0500 From: Karl Denninger <karl@denninger.net> To: freebsd-fs@freebsd.org Subject: Re: Commit r345200 (new ARC reclamation threads) looks suspicious to me - second potential problem Message-ID: <89064e9c-251a-a065-3a72-ac65c884d51d@denninger.net> In-Reply-To: <28c7430b-fb7c-6472-5c1b-fa3ff63a9e73@FreeBSD.org> References: <369cb1e9-f36a-a558-6941-23b9b811825a@FreeBSD.org> <20190520164202.GA2130@spy> <28c7430b-fb7c-6472-5c1b-fa3ff63a9e73@FreeBSD.org>
index | next in thread | previous in thread | raw e-mail
[-- Attachment #1 --]
On 5/22/2019 10:19 AM, Alexander Motin wrote:
> On 20.05.2019 12:42, Mark Johnston wrote:
>> On Mon, May 20, 2019 at 07:05:07PM +0300, Lev Serebryakov wrote:
>>> I'm looking at last commit to
>>> 'sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c' (r345200) and
>>> have another question.
>>>
>>> Here are such code:
>>>
>>> 4960 /*
>>> 4961 * Kick off asynchronous kmem_reap()'s of all our caches.
>>> 4962 */
>>> 4963 arc_kmem_reap_soon();
>>> 4964
>>> 4965 /*
>>> 4966 * Wait at least arc_kmem_cache_reap_retry_ms between
>>> 4967 * arc_kmem_reap_soon() calls. Without this check it is
>>> possible to
>>> 4968 * end up in a situation where we spend lots of time reaping
>>> 4969 * caches, while we're near arc_c_min. Waiting here also
>>> gives the
>>> 4970 * subsequent free memory check a chance of finding that the
>>> 4971 * asynchronous reap has already freed enough memory, and
>>> we don't
>>> 4972 * need to call arc_reduce_target_size().
>>> 4973 */
>>> 4974 delay((hz * arc_kmem_cache_reap_retry_ms + 999) / 1000);
>>> 4975
>>>
>>> But looks like `arc_kmem_reap_soon()` is synchronous on FreeBSD! So,
>>> this `delay()` looks very wrong. Am I right?
> Why is it wrong?
>
>>> Looks like it should be `#ifdef illumos`.
>> See also r338142, which I believe was reverted by the update.
> My r345200 indeed reverted that value, but I don't see a problem there.
> When OS need more RAM, pagedaemon will drain UMA caches by itself. I
> don't see a point in re-draining UMA caches in a tight loop without
> delay. If caches are not sufficient to sustain one second of workload,
> then usefulness of such caches is not very clear and shrinking ARC to
> free some space may be a right move. Also making ZFS drain its caches
> more active then anything else in a system looks unfair to me.
There is a long-lasting pathology with the older implementation. The
short answer is that if you have cache in UMA but unallocated to current
working set it's completely wasted -- unless quickly re-used. So a
small buffer between current and allocation is ok, but the UMA system
will leave large amounts out but unused. Reclaiming that after a
reasonable amount of time is a very good thing.
The other problem is that disk cache should NEVER be preferred over
working set space. It's always wrong to do so because a working set
page-out is 1 *guaranteed* I/O (to page it out) and possibly 2 I/Os (if
required again and thus must be recalled) while a disk cache page is 1
*possible* I/O avoided (if the disk cache block is requested again)
It is never the right move to intentionally take an I/O in order to
avoid a *possible* I/O. Under certain workloads making that choice leads
to severe pathological behavior (~30 second "pauses" where the system is
doing I/O like crazy but a desired process -- such as a database, or
shell, does nothing waiting on working set to be paged back in) when
there are gigabytes (or 10s of gigabytes) of ARC outstanding.
--
-- Karl Denninger
/The Market-Ticker/
S/MIME Email accepted and preferred
[-- Attachment #2 --]
0 *H
010
`He 0 *H
00 H^Ōc!5
H0
*H
010 UUS10UFlorida10U Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA0
170817164217Z
270815164217Z0{10 UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA0"0
*H
0
h-5B>[;olӴ0~͎O9}9Ye*$g!ukvʶLzN`jL>MD'7U 45CB+kY`bd~b*c3Ny-78ju]9HeuέsӬDؽmgwER?&UURj'}9nWD i`XcbGz \gG=u%\Oi13ߝ4
K44pYQr]Ie/r0+eEޝݖ0C15Mݚ@JSZ(zȏ NTa(25DD5.l<g[[ZarQQ%Buȴ~~`IohRbʳڟu2MS8EdFUClCMaѳ !}ș+2k/bųE,n当ꖛ\(8WV8 d]b yXw ܊:I39
00U]^§Q\ӎ0U#0T039N0b010 UUS10UFlorida10U Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA @Ui0U0 0U0
*H
:P U!>vJnio-#ן]WyujǑR̀Q
nƇ!GѦFg\yLxgw=OPycehf[}ܷ['4ڝ\[p 6\o.B&JF"ZC{;*o*mcCcLY߾`
t*S!(`]DHP5A~/NPp6=mhk밣'doA$86hm5ӚS@jެEgl
)0JG`%k35PaC?σ
׳HEt}!P㏏%*BxbQwaKG$6h¦Mve;[o-Iی&
I,Tcߎ#t wPA@l0P+KXBպT zGv;NcI3&JĬUPNa?/%W6G۟N000 k#Xd\=0
*H
0{10 UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA0
170817212120Z
220816212120Z0W10 UUS10UFlorida10U
Cuda Systems LLC10Ukarl@denninger.net0"0
*H
0
T[I-ΆϏ dn;Å@שy.us~_ZG%<MYd\gvfnsa1'6Egyjs"C [{~_K Pn+<*pv#Q+H/7[-vqDV^U>f%GX)H.|l`M(Cr>е͇6#odc"YljҦln8@5SA0&ۖ"OGj?UDWZ5 dDB7k-)9Izs-JAv
J6L$Ն1SmY.Lqw*SH;EF'DĦH]MOgQQ|Mٙג2Z9y@y]}6ٽeY9Y2xˆ$T=eCǺǵbn֛{j|@LLt1[Dk5:$= ` M 00<+00.0,+0 http://ocsp.cudasystems.net:88880 U0 0 `HB0U0U%0++03 `HB
&$OpenSSL Generated Client Certificate0U%՞V=;bzQ0U#0]^§Q\ӎϡ010 UUS10UFlorida10U Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA H^Ōc!5
H0U0karl@denninger.net0
*H
۠A0-j%--$%g2#ޡ1^>{K+uGEv1ş7Af&b&O;.;A5*U)ND2bF|\=]<sˋL!wrw٧>YMÄ3\mWR hSv!_zvl? 3_ xU%\^#O*Gk̍YI_&Fꊛ@&1n } ͬ:{hTP3B.;bU8:Z=^Gw8!k-@xE@i,+'Iᐚ:fhztX7/(hY` O.1}a`%RW^akǂpCAufgDix UTЩ/7}%=jnVZvcF<M=
2^GKH5魉
_O4ެByʈySkw=5@h.0z>
W1000{10 UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA k#Xd\=0
`He E0 *H
1 *H
0 *H
1
190522154700Z0O *H
1B@\=7RikU,/M{tU'Ps';웦 Ο"FR?gX͚0l *H
1_0]0 `He*0 `He0
*H
0*H
0
*H
@0+0
*H
(0 +7100{10 UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA k#Xd\=0*H
10{10 UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA k#Xd\=0
*H
5A~
*S}1×85{F45%CVe9ʼ[h~{?W 5Fن{#Łt`F0iV~kz{$ JV0aQC%J3!JQܽ"h#Te^e鵓P s>&z6Hɐks7we%0Y$wW%=ԈAU%4bFyXyO`%>Y;fsTz@Z`E;1F/
O?>Y<+{ωojdF5.nɽG">L(ϗ5n/%I1Y
*x<i->+Za0Fd[G&pBǔF'\vfYT,II`Kvk̵
A,,Y~7iHSw<y PJנR5 j[}@.gV9k, C NXd7
help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?89064e9c-251a-a065-3a72-ac65c884d51d>
