Date: Mon, 22 Apr 2024 10:46:01 -0600 From: Alan Somers <asomers@freebsd.org> Cc: Mark Johnston <markj@freebsd.org>, FreeBSD Hackers <freebsd-hackers@freebsd.org> Subject: Re: Stressing malloc(9) Message-ID: <CAOtMX2hDfX-T90x9Fb2Wh%2BvgLvw9fUGmaDxh-FWaYwBTPwFY6Q@mail.gmail.com> In-Reply-To: <CAOtMX2j=yaYeE%2B-fycg2mRRC_Jb9p74cn_dcenhH2xRRxz1shg@mail.gmail.com> References: <CAOtMX2jeDHS15bGgzD89AOAd1SzS_=FikorkCdv9-eAxCZ2P5w@mail.gmail.com> <ZiPaFw0q17RGE7cS@nuc> <CAOtMX2jk6%2BSvqMP7Cbmdk0KQCFZ34yWuir7n_8ewZYJF2MwPSg@mail.gmail.com> <ZiU6IZ29syVsg61p@nuc> <CAOtMX2j=yaYeE%2B-fycg2mRRC_Jb9p74cn_dcenhH2xRRxz1shg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Apr 21, 2024 at 5:47=E2=80=AFPM Alan Somers <asomers@freebsd.org> w= rote: > > On Sun, Apr 21, 2024 at 10:09=E2=80=AFAM Mark Johnston <markj@freebsd.org= > wrote: > > > > On Sat, Apr 20, 2024 at 11:23:41AM -0600, Alan Somers wrote: > > > On Sat, Apr 20, 2024 at 9:07=E2=80=AFAM Mark Johnston <markj@freebsd.= org> wrote: > > > > > > > > On Fri, Apr 19, 2024 at 04:23:51PM -0600, Alan Somers wrote: > > > > > TLDR; > > > > > How can I create a workload that causes malloc(9)'s performance t= o plummet? > > > > > > > > > > Background: > > > > > I recently witnessed a performance problem on a production server= . > > > > > Overall throughput dropped by over 30x. dtrace showed that 60% o= f the > > > > > CPU time was dominated by lock_delay as called by three functions= : > > > > > printf (via ctl_worker_thread), g_eli_alloc_data, and > > > > > g_eli_write_done. One thing those three have in common is that t= hey > > > > > all use malloc(9). Fixing the problem was as simple as telling C= TL to > > > > > stop printing so many warnings, by tuning > > > > > kern.cam.ctl.time_io_secs=3D100000. > > > > > > > > > > But even with CTL quieted, dtrace still reports ~6% of the CPU cy= cles > > > > > in lock_delay via g_eli_alloc_data. So I believe that malloc is > > > > > limiting geli's performance. I would like to try replacing it wi= th > > > > > uma(9). > > > > > > > > What is the size of the allocations that g_eli_alloc_data() is doin= g? > > > > malloc() is a pretty thin layer over UMA for allocations <=3D 64KB. > > > > Larger allocations are handled by a different path (malloc_large()) > > > > which goes directly to the kmem_* allocator functions. Those funct= ions > > > > are very expensive: they're serialized by global locks and need to > > > > update the pmap (and perform TLB shootdowns when memory is freed). > > > > They're not meant to be used at a high rate. > > > > > > In my benchmarks so far, 512B. In the real application the size is > > > mostly between 4k and 16k, and it's always a multiple of 4k. But it's > > > sometimes great enough to use malloc_large, and it's those > > > malloc_large calls that account for the majority of the time spent in > > > g_eli_alloc_data. lockstat shows that malloc_large, as called by > > > g_elI_alloc_data, sometimes blocks for multiple ms. > > > > > > But oddly, if I change the parameters so that g_eli_alloc_data > > > allocates 128kB, I still don't see malloc_large getting called. And > > > both dtrace and vmstat show that malloc is mostly operating on 512B > > > allocations. But dtrace does confirm that g_eli_alloc_data is being > > > called with 128kB arguments. Maybe something is getting inlined? > > > > malloc_large() is annotated __noinline, for what it's worth. > > > > > I > > > don't understand how this is happening. I could probably figure out > > > if I recompile with some extra SDT probes, though. > > > > What is g_eli_alloc_sz on your system? > > 33kiB. That's larger than I expected. When I use a larger blocksize > in my benchmark, then I do indeed see malloc_large activity, and 11% > of the CPU is spend in g_eli_alloc_data. > > I guess I'll add some UMA zones for this purpose. I'll try 256k and > 512k zones, rounding up allocations as necessary. Thanks for the tip. When I said "33kiB" I meant "33 pages", or 132 kB. And the solution turns out to be very easy. Since I'm using ZFS on top of geli, with the default recsize of 128kB, I'll just set vfs.zfs.vdev.aggregation_limit to 128 kB. That way geli will never need to allocate more than 128kB contiguously. ZFS doesn't even need those big allocations to be contiguous; it's just aggregating smaller operations to reduce disk IOPs. But aggregating up to 1MB (the default) is overkill; any rotating HDD should easily be able to max out its consecutive write IOPs with 128kB operation size. I'll add a read-only sysctl for g_eli_alloc_sz too. Thanks Mark. -Alan
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2hDfX-T90x9Fb2Wh%2BvgLvw9fUGmaDxh-FWaYwBTPwFY6Q>