Date: Thu, 09 Apr 2015 09:20:40 -0500 From: Karl Denninger <> To: Subject: Re: FreeBSD/ZFS on [HEAD] chews up memory Message-ID: <> In-Reply-To: <> References: <> <> <>
This is a cryptographically signed message in MIME format. --------------ms060906010304000502060807 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: quoted-printable On 4/9/2015 08:53, Mark Martinec wrote: > 2015-04-09 15:19, Bob Friesenhahn wrote: >> On Thu, 9 Apr 2015, grarpamp wrote: >>>> RAM amount might matter too. 12GB vs 32GB is a bit of a difference. >>> Allow me to bitch hypothetically... >>> We, and I, get that some FS need memory, just like kernel and >>> userspace need memory to function. But to be honest, things >>> should fail or slow gracefully. Why in the world, regardless of >>> directory size, should I ever need to feed ZFS 10GB of RAM? >> >> From my reading of this list in the past month or so, I have seen >> other complaints about memory usage, but also regarding UFS and NFS >> and not just ZFS. One is lead to think that the way the system uses >> memory for filesystems has changed. >> >> As others have said, ZFS ARC should automatically diminish, but >> perhaps ZFS ARC is not responsible for the observed memory issues. >> >> Bob > > I'd really like to see the: > > [Bug 187594] [zfs] [patch] ZFS ARC behavior problem and fix > > > find its way into 10-STABLE. Things behaved much more > sanely some time in 9.*, before the great UMA change > took place. Not everyone has dozens of gigabytes of memory. > With 16 GB mem even when memory is tight (poudriere build), > the wired size seems excessive, most of which is ARC. > There are a number of intertwined issues related to how the VM system=20 interacts with ZFS' use of memory for ARC; the patch listed above IMHO=20 resolves most -- but not all -- of them. The one big one remaining, that I do not have a patch to fix at present, = is the dmu_tx write cache (exposed in sysctl as=20 vfs.zfs.dirty_data_max*) It is sized based on available RAM at boot=20 with both a minimum and maximum size and is across all pools. This=20 initializes to allow up to 10% of RAM to be used for this on boot with a = cap of 4Gb. That can be a problem because in a moderately-large RAM=20 configuration machine with spinning rust it is entirely possible for=20 that write cache to represent /*tens of seconds or even more than a=20 minute */of actual I/O time to flush. (The maximum full-track=20 sequential I/O speed of a 7200RPM 4TB drive is in the ~200Mb/sec range;=20 10% of 32Gb is 3Gb, so this is ~15 seconds of time in a typical 4-unit=20 RaidZ2 zVol -- and it gets worse, much worse, with smaller-capacity=20 disks that have less areal density under the head and thus are slower=20 due to the basic physics of the matter.) The write cache is a very=20 good thing for performance in most circumstances because it allows ZFS=20 to optimize writes to minimize the number of seeks and latency required=20 but there are some pathological cases where having it too large is very=20 bad for performance. Specifically, it becomes a problem when the operation you wish to=20 perform on the filesystem requires coherency with something _*in*_ that=20 cache, and thus the cache must flush and complete before that operation=20 can succeed. This manifests as you doing something as benign as typing=20 "vi some-file" and your terminal session locks up for tens of seconds=20 to, in some cases, more than a minute! If _*all*_ the disks on your machine are of a given type and reasonably=20 coherent in I/O throughput (e.g. all SSDs, all rotating rust of the same = approximate size and throughput, etc) then you can tune this as the code = stands to get good performance and avoid the problem. But if you have=20 some volumes comprised of high-performance SSD storage (say, for=20 often-modified or accessed database tables) and other volumes comprised=20 of high-capacity spinning rust (because SSD for storage of that data=20 makes no economic sense) then you've got a problem because=20 dirty_data_max is system-wide and not per-pool. The irony is that with the patch I developed in under heavy load the=20 pathology tends to not happen because the dmu_tx cache gets cut back=20 automatically under heavy load as part of the UMA reuse mitigation=20 strategy that I implemented in that patch. But under light load it=20 still can and sometimes does bite you. The best (and I argue proper)=20 means for eliminating that is for the dmu_tx cache to be sized per-pool=20 and to be computed based on the pool's actual write I/O performance; in=20 other words, it should be sized to represent a maximum=20 latency-to-coherence time that is acceptable (and that should be able to = be tuned.) The best (and I argue proper)=20 means for eliminating that is for the dmu_tx cache to be sized per-pool=20 and to be computed based on the pool's actual write I/O performance; in=20 other words, it should be sized to represent a maximum=20 latency-to-coherence time that is acceptable (and that should be able to = be tuned.) Doing so appears to be quite non-trivial though or I would=20 have already taken it on and addressed it. --=20 Karl Denninger <> /The Market Ticker/
