Date: Thu, 09 Apr 2015 09:20:40 -0500 From: Karl Denninger <karl@denninger.net> To: freebsd-fs@freebsd.org Subject: Re: FreeBSD/ZFS on [HEAD] chews up memory Message-ID: <55268AB8.8010202@denninger.net> In-Reply-To: <728627c71bbc88bc9a454eda3370e485@mailbox.ijs.si> References: <CAD2Ti2_4S_yPgJdKxfb=_eQq5RezSTAa_M0V-EHf=y60k30RBQ@mail.gmail.com> <alpine.GSO.2.01.1504090814560.4186@freddy.simplesystems.org> <728627c71bbc88bc9a454eda3370e485@mailbox.ijs.si>
next in thread | previous in thread | raw e-mail | index | archive | help
[-- Attachment #1 --] On 4/9/2015 08:53, Mark Martinec wrote: > 2015-04-09 15:19, Bob Friesenhahn wrote: >> On Thu, 9 Apr 2015, grarpamp wrote: >>>> RAM amount might matter too. 12GB vs 32GB is a bit of a difference. >>> Allow me to bitch hypothetically... >>> We, and I, get that some FS need memory, just like kernel and >>> userspace need memory to function. But to be honest, things >>> should fail or slow gracefully. Why in the world, regardless of >>> directory size, should I ever need to feed ZFS 10GB of RAM? >> >> From my reading of this list in the past month or so, I have seen >> other complaints about memory usage, but also regarding UFS and NFS >> and not just ZFS. One is lead to think that the way the system uses >> memory for filesystems has changed. >> >> As others have said, ZFS ARC should automatically diminish, but >> perhaps ZFS ARC is not responsible for the observed memory issues. >> >> Bob > > I'd really like to see the: > > [Bug 187594] [zfs] [patch] ZFS ARC behavior problem and fix > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594 > > find its way into 10-STABLE. Things behaved much more > sanely some time in 9.*, before the great UMA change > took place. Not everyone has dozens of gigabytes of memory. > With 16 GB mem even when memory is tight (poudriere build), > the wired size seems excessive, most of which is ARC. > There are a number of intertwined issues related to how the VM system interacts with ZFS' use of memory for ARC; the patch listed above IMHO resolves most -- but not all -- of them. The one big one remaining, that I do not have a patch to fix at present, is the dmu_tx write cache (exposed in sysctl as vfs.zfs.dirty_data_max*) It is sized based on available RAM at boot with both a minimum and maximum size and is across all pools. This initializes to allow up to 10% of RAM to be used for this on boot with a cap of 4Gb. That can be a problem because in a moderately-large RAM configuration machine with spinning rust it is entirely possible for that write cache to represent /*tens of seconds or even more than a minute */of actual I/O time to flush. (The maximum full-track sequential I/O speed of a 7200RPM 4TB drive is in the ~200Mb/sec range; 10% of 32Gb is 3Gb, so this is ~15 seconds of time in a typical 4-unit RaidZ2 zVol -- and it gets worse, much worse, with smaller-capacity disks that have less areal density under the head and thus are slower due to the basic physics of the matter.) The write cache is a very good thing for performance in most circumstances because it allows ZFS to optimize writes to minimize the number of seeks and latency required but there are some pathological cases where having it too large is very bad for performance. Specifically, it becomes a problem when the operation you wish to perform on the filesystem requires coherency with something _*in*_ that cache, and thus the cache must flush and complete before that operation can succeed. This manifests as you doing something as benign as typing "vi some-file" and your terminal session locks up for tens of seconds to, in some cases, more than a minute! If _*all*_ the disks on your machine are of a given type and reasonably coherent in I/O throughput (e.g. all SSDs, all rotating rust of the same approximate size and throughput, etc) then you can tune this as the code stands to get good performance and avoid the problem. But if you have some volumes comprised of high-performance SSD storage (say, for often-modified or accessed database tables) and other volumes comprised of high-capacity spinning rust (because SSD for storage of that data makes no economic sense) then you've got a problem because dirty_data_max is system-wide and not per-pool. The irony is that with the patch I developed in under heavy load the pathology tends to not happen because the dmu_tx cache gets cut back automatically under heavy load as part of the UMA reuse mitigation strategy that I implemented in that patch. But under light load it still can and sometimes does bite you. The best (and I argue proper) means for eliminating that is for the dmu_tx cache to be sized per-pool and to be computed based on the pool's actual write I/O performance; in other words, it should be sized to represent a maximum latency-to-coherence time that is acceptable (and that should be able to be tuned.) Doing so appears to be quite non-trivial though or I would have already taken it on and addressed it. -- Karl Denninger karl@denninger.net <mailto:karl@denninger.net> /The Market Ticker/ [-- Attachment #2 --] 0 *H 010 + 0 *H Y0U0=0 *H 010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA0 150325131020Z 200323131020Z0S10 UUS10UFlorida10U Cuda Systems LLC10UKarl Denninger0"0 *H 0 q7ۇT:teߋ{h ~]!+Z&r˳ނo8Ie?VHMu<{y;l0_^VЌ>[5 $ibIRv\M@Y ɐ6 ܷO&p$-C.oC^$ӂFϤid $Rgϵ|Y&{ȌKz Ę$1bz踲?Dr_F' PS~h{!eFO'ґ,G m ]nM|4ye1*6v@(x1T)ދ$\Q C,P=f fėT|JF#:emMܕ(C=~A{M\a]5<3O*h ML=xI}YXYbDsߌ=l**ج;"g\# 00 U0 0 `HB0U0, `HB OpenSSL Generated Certificate0U#/6%Ѹ'*ZqE#p0U#0$q}ݽʒm50U0karl@denninger.net08 `HB+)https://cudasystems.net:11443/revoked.crl0 *H p 7-2^,"nl5N>H*p&v' suS[d\}E!2"4YLvGM( hlNon.U#Q#enEȶ?GQhM=Tl_RiF f@~=ᄄoB0!Om`x҂:G26XF}?>խ !15`0Zŝ-ȩo{J,lLY~nc.oA]y+f,K諩 [mg]7tB%^f,-ɽ.,?BCP#Ta;)mҸ=dY/_#%7559C6М0.+xP&+qLݚF+}7P\ S|ƪpF%OU[-^Mю* \3^P"swIlYmP9{,rG'abݹMA˛ Ϸ8_XI®!jF&j')E1c%100010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA0 + !0 *H 1 *H 0 *H 1 150409142040Z0# *H 1z@,^20l *H 1_0]0 `He*0 `He0 *H 0*H 0 *H @0+0 *H (0 +710010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA0*H 1010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA0 *H ~<p$H+k1>04,dT% OHg]X bKȶmޜ/Eh7EP˵cأS9R5nqi$Lyy6mo'\yXx*\x{`@OO!7d1}*m˯2gN|Ry7 иi]۱OhkKGVhWᴢX"=bSй̵gi acn 5`="&Z4i b |2%f*H^X`ah!WhLoH~mֲe嫀s=[MFIG`jE$gܻ)U㜜ծo<q˝@?翻mKK+KQQKjy=NSv.Zx8?.wzBP34 0d:_ lEw)t+`/stEy^*;j!G
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?55268AB8.8010202>
