Date: Fri, 14 Mar 2014 06:21:50 -0500 From: Karl Denninger <karl@denninger.net> To: freebsd-fs@freebsd.org Subject: Re: Reoccurring ZFS performance problems [RESOLVED] Message-ID: <5322E64E.8020009@denninger.net> In-Reply-To: <5320A0E8.2070406@denninger.net> References: <531E2406.8010301@denninger.net> <5320A0E8.2070406@denninger.net>
index | next in thread | previous in thread | raw e-mail
[-- Attachment #1 --]
On 3/12/2014 1:01 PM, Karl Denninger wrote:
>
> On 3/10/2014 2:38 PM, Adrian Gschwend wrote:
>> On 10.03.14 18:40, Adrian Gschwend wrote:
>>
>>> It looks like finally my MySQL process finished and now the system is
>>> back to completely fine:
>> ok it doesn't look it's only MySQL, stopped the process a while ago and
>> while it got calmer, I still have the issue.
> ZFS can be convinced to engage in what I can only surmise is
> pathological behavior, and I've seen no fix for it when it happens --
> but there are things you can do to mitigate it.
>
> What IMHO _*should*_ happen is that the ARC cache should shrink as
> necessary to prevent paging, subject to vfs.zfs.arc_min. To prevent
> pathological problems with segments that have been paged off hours (or
> more!) ago and never get paged back in because that particular piece
> of code never executes again (but the process is also still alive so
> the system cannot reclaim it and thus it shows "committed" in pstat -s
> but unless it is paged back in has no impact on system performance)
> the policing on this would have to apply a "reasonableness" filter to
> those pages (e.g. if it has been out on the page file for longer than
> "X", ignore that particular allocation unit for this purpose.)
>
> This would cause the ARC cache to flush itself down automatically as
> executable and data segment RAM commitments increase.
>
> The documentation says that this is the case and how it should work
> but it doesn't appear to actually be this way in practice for many
> workloads. I have seen "wired" RAM pinned at 20GB on one of my
> servers here with a fairly large DBMS running -- with pieces of its
> working set and even the a user's shell (!) getting paged off, yet the
> ARC cache is not pared down to release memory. Indeed you can let the
> system run for hours under these conditions and the ARC wired memory
> will not decrease. Cutting back the DBMS's internal buffering does
> not help.
>
> What I've done here is restrict the ARC cache size in an attempt to
> prevent this particular bit of bogosity from biting me, and it appears
> to (sort of) work. Unfortunately you cannot tune this while the
> system is running (otherwise a user daemon could conceivably slash
> away at the arc_max sysctl and force the deallocation of wired memory
> if it detected paging -- or near-paging, such as free memory below
> some user-configured threshold), only at boot time in /boot/loader.conf.
>
> This is something that, should I get myself a nice hunk of free time,
> I may dive into and attempt to fix. It would likely take me quite a
> while to get up to speed on this as I've not gotten into the zfs code
> at all -- and mistakes in there could easily corrupt files.... (in
> other words definitely NOT something to play with on a production
> system!)
>
> I have to assume there's a pretty-good reason why you can't change
> arc_max while the system is running; it _*can*_ be changed on a
> running system on some other implementations (e.g. Solaris.) It is
> marked with CTLFLAG_RDTUN in the arc management file which prohibits
> run-time changes and the only place I see it referenced with a quick
> look is in the arc_init code.
>
> Note that the test in arc.c for "arc_reclaim_needed" appears to be
> pretty basic -- essentially the system will not aggressively try to
> reclaim memory unless used kmem > 3/4 of its size.
>
> (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs)
>
> #else /* !sun */
> if (kmem_used() > (kmem_size() * 3) / 4)
> return (1);
> #endif /* sun */
>
> Up above that there's a test for "vm_paging_needed()" that would
> (theoretically) appear to trigger first in these situations, but it
> doesn't in many cases.
>
> IMHO this is too-basic of a test and leads to pathological situations
> in that the system may wind up paging things off as opposed to paring
> back the ARC cache. As soon as the working set of something that's
> actually getting cycles gets paged out in most cases system
> performance goes straight in the trash.
>
> On sun machines (from reading the code) it will allegedly try to pare
> any time the "lotsfree" (plus "needfree" + "extra") amount of free
> memory is invaded.
>
> As an example this is what a server I own that is exhibiting this
> behavior now shows:
> 20202500 wire
> 1414052 act
> 2323280 inact
> 110340 cache
> 414484 free
> 1694896 buf
>
> Of that "wired" mem 15.7G of it is ARC cache (with a target of 15.81,
> so it's essentially right up against it.)
>
> That "free" number would be ok if it didn't result in the system
> having trashy performance -- but it does on occasion. Incidentally the
> allocated swap is about 195k blocks (~200 Megabytes) which isn't much
> all-in, but it's enough to force actual fetches of recently-used
> programs (e.g. your shell!) from paged-off space. The thing is that if
> the test in the code (75% of kmem available consumed) was looking only
> at "free" the system should be aggressively trying to free up ARC
> cache. It clearly is not; the included code calls this:
>
> uint64_t
> kmem_used(void)
> {
>
> return (vmem_size(kmem_arena, VMEM_ALLOC));
> }
>
> I need to dig around and see exactly what that's measuring, because
> what's quite clear is that the system _*thinks*_ it has plenty of free
> memory when it very-clearly is essentially out! In fact free memory
> at the moment (~400MB) is 1.7% of the total, _*not*_ 25%. From this I
> surmise that the "vmem_size" call is not returning the sum of all the
> above "in use" sizes (except perhaps "inact"); were it to do so that
> would be essentially 100% of installed RAM and the ARC cache should be
> actively under shrinkage, but it clearly is not.
>
> I'll keep this one on my "to-do" list somewhere and if I get the
> chance see if I can come up with a better test. What might be
> interesting is to change the test to be "pare if free space less
> (pagefile space in use plus some modest margin) < 0"
>
> Fixing this tidbit of code could potentially be pretty significant in
> terms of resolving the occasional but very annoying "freeze" problems
> that people sometimes run into, along with some mildly-pathological
> but very-significant behavior in terms of how the ARC cache
> auto-scales and its impact on performance. I'm nowhere near
> up-to-speed enough on the internals of the kernel when it comes to
> figuring out what it has committed (e.g. how much swap is out, etc)
> and thus there's going to be a lot of code-reading involved before I
> can attempt something useful.
>
In the context of the above, here's a fix. Enjoy.
http://www.freebsd.org/cgi/query-pr.cgi?pr=187572
> Category: kern
> Responsible: freebsd-bugs
> Synopsis: ZFS ARC cache code does not properly handle low memory
> Arrival-Date: Fri Mar 14 11:20:00 UTC 2014
--
-- Karl
karl@denninger.net
[-- Attachment #2 --]
0 *H
010 + 0 *H
O0K030
*H
010 UUS10UFlorida10U Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1/0- *H
customer-service@cudasystems.net0
130824190344Z
180823190344Z0[10 UUS10UFlorida10UKarl Denninger1!0 *H
karl@denninger.net0"0
*H
0
bi՞]MNԿawx?`)'ҴcWgR@BlWh+ u}ApdCF JVй~FOL}EW^bچYp3K&ׂ(R
lxڝ.xz?6&nsJ +1v9v/( kqĪp[vjcK%fϻe?iq]z
lyzFO'ppdX//Lw(3JIA*S#՟H[f|CGqJKooy.oEuOw$/섀$삻J9b|AP~8]D1YI<"""Y^T2iQ2b yH)] Ƶ0y$_N6XqMC 9 XgώjGTP"#nˋ"Bk1 00 U0 0 `HB0U0, `HB
OpenSSL Generated Certificate0U|8 ˴d[20U#0]Af4U3x&^"408 `HB+)https://cudasystems.net:11443/revoked.crl0
*H
gBwH]j\x`( &gW32"Uf^. ^Iϱ
k!DQA g{(w/)\N'[oRW@CHO>)XrTNɘ!u`xt5(=f\-l3<@C6mnhv##1ŃbH͍_Nq
aʷ?rk$^9TIa!kh,D -ct1
00010 UUS10UFlorida10U Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1/0- *H
customer-service@cudasystems.net0 + ;0 *H
1 *H
0 *H
1
140314112150Z0# *H
1ݳj(ύq7`@@ 0l *H
1_0]0 `He*0 `He0
*H
0*H
0
*H
@0+0
*H
(0 +710010 UUS10UFlorida10U Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1/0- *H
customer-service@cudasystems.net0*H
1010 UUS10UFlorida10U Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1/0- *H
customer-service@cudasystems.net0
*H
U8Rp,EGZRk֟s`N1hTP
z%j
X=F#Nz0y+M:1%=xӺ?`oܞ
Ffs@Cd
*Svl
rqGKiȅ|А4 +xyRV.mQїq6J~Pol,pl^M>_hL}c:oÁ~!B99
@D ٵ
D$g+N܍TJҸLӴt$\<Ջ3[4~V3=fPýe˦5w13I@*`0!A`[RŬ:?Z[aNf4CbyQT
A"(91q
3ȳJH?I++ܼ;+D)A=P~bCc0qp䫟HϗvQ_LgYBH~&a7S~<HnM!
help
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5322E64E.8020009>
