Date: Wed, 12 Mar 2014 13:01:12 -0500 From: Karl Denninger <karl@denninger.net> To: freebsd-fs@freebsd.org Subject: Re: Reoccurring ZFS performance problems [[Possible Analysis]] Message-ID: <5320A0E8.2070406@denninger.net> In-Reply-To: <531E2406.8010301@denninger.net> References: <531E2406.8010301@denninger.net>
next in thread | previous in thread | raw e-mail | index | archive | help
[-- Attachment #1 --]
On 3/10/2014 2:38 PM, Adrian Gschwend wrote:
> On 10.03.14 18:40, Adrian Gschwend wrote:
>
>> It looks like finally my MySQL process finished and now the system is
>> back to completely fine:
> ok it doesn't look it's only MySQL, stopped the process a while ago and
> while it got calmer, I still have the issue.
ZFS can be convinced to engage in what I can only surmise is
pathological behavior, and I've seen no fix for it when it happens --
but there are things you can do to mitigate it.
What IMHO _*should*_ happen is that the ARC cache should shrink as
necessary to prevent paging, subject to vfs.zfs.arc_min. To prevent
pathological problems with segments that have been paged off hours (or
more!) ago and never get paged back in because that particular piece of
code never executes again (but the process is also still alive so the
system cannot reclaim it and thus it shows "committed" in pstat -s but
unless it is paged back in has no impact on system performance) the
policing on this would have to apply a "reasonableness" filter to those
pages (e.g. if it has been out on the page file for longer than "X",
ignore that particular allocation unit for this purpose.)
This would cause the ARC cache to flush itself down automatically as
executable and data segment RAM commitments increase.
The documentation says that this is the case and how it should work but
it doesn't appear to actually be this way in practice for many
workloads. I have seen "wired" RAM pinned at 20GB on one of my servers
here with a fairly large DBMS running -- with pieces of its working set
and even the a user's shell (!) getting paged off, yet the ARC cache is
not pared down to release memory. Indeed you can let the system run for
hours under these conditions and the ARC wired memory will not
decrease. Cutting back the DBMS's internal buffering does not help.
What I've done here is restrict the ARC cache size in an attempt to
prevent this particular bit of bogosity from biting me, and it appears
to (sort of) work. Unfortunately you cannot tune this while the system
is running (otherwise a user daemon could conceivably slash away at the
arc_max sysctl and force the deallocation of wired memory if it detected
paging -- or near-paging, such as free memory below some user-configured
threshold), only at boot time in /boot/loader.conf.
This is something that, should I get myself a nice hunk of free time, I
may dive into and attempt to fix. It would likely take me quite a while
to get up to speed on this as I've not gotten into the zfs code at all
-- and mistakes in there could easily corrupt files.... (in other words
definitely NOT something to play with on a production system!)
I have to assume there's a pretty-good reason why you can't change
arc_max while the system is running; it _*can*_ be changed on a running
system on some other implementations (e.g. Solaris.) It is marked with
CTLFLAG_RDTUN in the arc management file which prohibits run-time
changes and the only place I see it referenced with a quick look is in
the arc_init code.
Note that the test in arc.c for "arc_reclaim_needed" appears to be
pretty basic -- essentially the system will not aggressively try to
reclaim memory unless used kmem > 3/4 of its size.
(snippet from arc.c around line 2494 of arc.c in 10-STABLE; path
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs)
#else /* !sun */
if (kmem_used() > (kmem_size() * 3) / 4)
return (1);
#endif /* sun */
Up above that there's a test for "vm_paging_needed()" that would
(theoretically) appear to trigger first in these situations, but it
doesn't in many cases.
IMHO this is too-basic of a test and leads to pathological situations in
that the system may wind up paging things off as opposed to paring back
the ARC cache. As soon as the working set of something that's actually
getting cycles gets paged out in most cases system performance goes
straight in the trash.
On sun machines (from reading the code) it will allegedly try to pare
any time the "lotsfree" (plus "needfree" + "extra") amount of free
memory is invaded.
As an example this is what a server I own that is exhibiting this
behavior now shows:
20202500 wire
1414052 act
2323280 inact
110340 cache
414484 free
1694896 buf
Of that "wired" mem 15.7G of it is ARC cache (with a target of 15.81, so
it's essentially right up against it.)
That "free" number would be ok if it didn't result in the system having
trashy performance -- but it does on occasion. Incidentally the
allocated swap is about 195k blocks (~200 Megabytes) which isn't much
all-in, but it's enough to force actual fetches of recently-used
programs (e.g. your shell!) from paged-off space. The thing is that if
the test in the code (75% of kmem available consumed) was looking only
at "free" the system should be aggressively trying to free up ARC
cache. It clearly is not; the included code calls this:
uint64_t
kmem_used(void)
{
return (vmem_size(kmem_arena, VMEM_ALLOC));
}
I need to dig around and see exactly what that's measuring, because
what's quite clear is that the system _*thinks*_ it has plenty of free
memory when it very-clearly is essentially out! In fact free memory at
the moment (~400MB) is 1.7% of the total, _*not*_ 25%. From this I
surmise that the "vmem_size" call is not returning the sum of all the
above "in use" sizes (except perhaps "inact"); were it to do so that
would be essentially 100% of installed RAM and the ARC cache should be
actively under shrinkage, but it clearly is not.
I'll keep this one on my "to-do" list somewhere and if I get the chance
see if I can come up with a better test. What might be interesting is
to change the test to be "pare if free space less (pagefile space in use
plus some modest margin) < 0"
Fixing this tidbit of code could potentially be pretty significant in
terms of resolving the occasional but very annoying "freeze" problems
that people sometimes run into, along with some mildly-pathological but
very-significant behavior in terms of how the ARC cache auto-scales and
its impact on performance. I'm nowhere near up-to-speed enough on the
internals of the kernel when it comes to figuring out what it has
committed (e.g. how much swap is out, etc) and thus there's going to be
a lot of code-reading involved before I can attempt something useful.
--
-- Karl
karl@denninger.net
[-- Attachment #2 --]
0 *H
010 + 0 *H
O0K030
*H
010 UUS10UFlorida10U Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1/0- *H
customer-service@cudasystems.net0
130824190344Z
180823190344Z0[10 UUS10UFlorida10UKarl Denninger1!0 *H
karl@denninger.net0"0
*H
0
bi՞]MNԿawx?`)'ҴcWgR@BlWh+ u}ApdCF JVй~FOL}EW^bچYp3K&ׂ(R
lxڝ.xz?6&nsJ +1v9v/( kqĪp[vjcK%fϻe?iq]z
lyzFO'ppdX//Lw(3JIA*S#՟H[f|CGqJKooy.oEuOw$/섀$삻J9b|AP~8]D1YI<"""Y^T2iQ2b yH)] Ƶ0y$_N6XqMC 9 XgώjGTP"#nˋ"Bk1 00 U0 0 `HB0U0, `HB
OpenSSL Generated Certificate0U|8 ˴d[20U#0]Af4U3x&^"408 `HB+)https://cudasystems.net:11443/revoked.crl0
*H
gBwH]j\x`( &gW32"Uf^. ^Iϱ
k!DQA g{(w/)\N'[oRW@CHO>)XrTNɘ!u`xt5(=f\-l3<@C6mnhv##1ŃbH͍_Nq
aʷ?rk$^9TIa!kh,D -ct1
00010 UUS10UFlorida10U Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1/0- *H
customer-service@cudasystems.net0 + ;0 *H
1 *H
0 *H
1
140312180112Z0# *H
1ӰqͰĈ8*Ŭ0l *H
1_0]0 `He*0 `He0
*H
0*H
0
*H
@0+0
*H
(0 +710010 UUS10UFlorida10U Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1/0- *H
customer-service@cudasystems.net0*H
1010 UUS10UFlorida10U Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1/0- *H
customer-service@cudasystems.net0
*H
|%Xq@3w[BT[J<af*"/VvBQ`JyOĆ#0tsGE.M;?#lok0.RX^ヲY0hK^3ay9}{a{fhd GtiҺ'spÁtcqm )Әg$-jh(7jB,llMΈ>$5ޥ;;Mf{@SW!Xc6lr5[S)"9)1~`*Z r,-Ĥz_zqv$uE2i0TٍbS)k'ˣLMKYZ;[>-k3S`ow|2/<7u[av+a"N\fHn|Cg)^wPcˤuXL|\
1!]_އI4`vM_*h'qTh@A!&FbԱg
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5320A0E8.2070406>
