Date: Sun, 5 Mar 2023 17:27:16 +0000 From: David Chisnall <theraven@FreeBSD.org> To: Mateusz Guzik <mjguzik@gmail.com> Cc: freebsd-hackers <freebsd-hackers@freebsd.org> Subject: Re: CFT: snmalloc as libc malloc Message-ID: <D7199237-B252-4172-B2C3-C111DFA4FC57@FreeBSD.org> In-Reply-To: <CAGudoHG3m1OixvuYGMGC_9fFBoOEXD-feUB2BUiTA66u0mdSfw@mail.gmail.com> References: <2f3dcda0-5135-290a-2dff-683b2e9fe271@FreeBSD.org> <CAGudoHFYMLk6EDrSxLiWFNBoYyTKXfHLAUhZC%2BRF4eUE-rip8Q@mail.gmail.com> <E140A3A2-5C4A-4458-B365-AD693AB853E8@FreeBSD.org> <CAGudoHFeFm1K%2BJSBXaxt2SNTv-tGFgT0onyErOCmBdVjmHaxUg@mail.gmail.com> <2836EF0C-8B4B-441C-927B-3796457A3865@FreeBSD.org> <CAGudoHG3m1OixvuYGMGC_9fFBoOEXD-feUB2BUiTA66u0mdSfw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi, Thanks for the comments. =20 The FlagLock is used in two places: - Early init, before anything in libc can be guaranteed to exist. - Some *very* slow paths, where contention is incredibly improbable. We could replace them with _umtx_op / futex locks, but I doubt you=E2=80=99= d see a difference. Do you have any workloads where locks introduce = contention? The locks are all abstracted away in a single place, so it = would be fairly simple to add a couple of lock functions to the PAL. = Probably a 10-line change in total, but since it hasn=E2=80=99t shown up = in any benchmarking that we=E2=80=99ve done yet it hasn=E2=80=99t been a = priority. With respect to NUMA, if a thread is pinned to a core and the kernel is = handling it local memory, then you should see good performance = initially. All allocations are satisfied from a pool that is local to = the core. If they are freed in other threads then they are batched and = returned to the allocating thread. It is possible for chunks that are = originally created on one core to be recycled and used on another core. = This will happen at >page granularity and so is better handled in the = kernel. If you have a benchmark for memcpy that shows a big slowdown with the = bounds checking, I would like to see it so that we can use it to = optimise the implementation. 12-byte memcpys are common, but also fit = well in the store queues overlapped with the bounds checks. I=E2=80=99d = be *very* interested in workloads where this is not the case. I broke my shoulder a week ago and so am typing very slowly with one = hand, but patches are very welcome. There is a PR under review at the = moment to add memmove, which could benefit from some more tuning before = it=E2=80=99s merged. David > On 5 Mar 2023, at 16:47, Mateusz Guzik <mjguzik@gmail.com> wrote: >=20 > Apologies for significant delay, got other stuff and this fell through > the cracks. >=20 > I'm going to start with snmalloc remarks, memcpy-related response is = down below. >=20 > I find it quite weird there are absolutely no mentions of NUMA in a > general-purpose allocator made in recent years. While nothing can be > done for an application which does not use any affinity, what *can* be > done is not negatively affecting programs which *do* use it. Most > notably making sure that allocations backed by one page don't land in > threads from distinct domains. It is unclear to me what, if anything, > is the allocator doing on this front, but I also don't know if this > would be a regression vs jemalloc. >=20 > One major issue I found is the use of hand-rolled spinlocks -- while > they may be fine for specialized uses, they are a non-starter for > general purpose. This is the age-old problem of priority problems > where you can end up yield()ing all you want, while the lock owner > nowhere near CPU or the scheduler does not think you should give way > to it. The least bad thing to do would be to add umtx support to > remedy this problem, but that's a lot of work and it is probably not > warranted. Then the fallback is pthread locks -- while pessimal, they > don't suffer the problem. The lock operation not being in the fast > path is not a factor for this problem. >=20 > I tried to do a sensible benchmark of the thing, but ran into snags to > sort out first. The idea was to 'tinderbox' it, except: > 1. make's use of poll is a massive scalability bottleneck, addressed > here in an experimental manner: > https://github.com/macdice/freebsd/tree/bmake-shmem > 2. with that out of the way there is still massive contention in the > vm subsystem. I had a patch for most of it here: > https://reviews.freebsd.org/D36038 . it rebases cleanly but no longer > works, I don't know why yet >=20 > There is other issues. >=20 > Ultimately if snmalloc is to ship, it would be nice ot have it for 14 > and we are nearing the time where it will be too late, so I'll try to > sort it all out next week. >=20 > Meanwhile there is the lock issue to sort out. >=20 > On to memcpy >=20 > On 2/9/23, David Chisnall <theraven@freebsd.org> wrote: >> On 9 Feb 2023, at 20:53, Mateusz Guzik <mjguzik@gmail.com> wrote: >>>=20 >>> First and foremost, perhaps I should clear up that I have no opinion >>> on snmalloc or it replacing jemalloc. I'm here only about the memcpy >>> thing. >>>=20 >>> Inspecting assembly is what I intended to do, but got the = compilation >>> problem. >>=20 >> If you just want to look at the memcpy, you might do better using the >> upstream (CMake) build system, which builds a shim library that you = can >> disassemble and look at. >>=20 >>> So, as someone who worked on memcpy previously, I note the variant >>> currently implemented in libc is pessimal for sizes > 16 bytes = because >>> it does not use SIMD. I do have plans to rectify that, but ENOTIME. >>=20 >> That=E2=80=99s one of the benefits of the C++ implementation. We = were able to try a >> few dozen variations in a morning. Writing a single one of those in >> assembly would take a similar amount of time. >>=20 >> We were inspired by the automemcpy paper, which demonstrated that you = can >> write memcpy implementations tuned to specific workloads in = higher-level >> languages and get better performance. >>=20 >>> I also note CPUs are incredibly picky when it comes to starting = point >>> of the routine. The officialy recommended alignment of 16 bytes is >>> basically a tradeoff between total binary size and performance. = Should >>> you move the routine at different 16 bytes intervals you will easily >>> find big variation in performance, depending on how big of an >>> alignment you ended up with. >>=20 >> Yes, that=E2=80=99s what our test does. It does a large number of = different copies >> with different sizes and alignments. >>=20 >=20 > I'm not talking about placement of buffers passed to memcpy, I'm > talking about placement of *code implementing memcpy*. >=20 >>> I'm trying to say that if you end up with different funcs depending = on >>> bounds checking and you only align them to 16 bytes, the benchmark = is >>> most likely inaccurate if only for this reason. >>=20 >> I=E2=80=99m not sure I follow this, sorry. >>=20 >=20 > If the address of the *func itself* is divisible by 16 but not 32, it > may end up with a different performance profile depending on uarch. > And the code may randomly move up and down as you change the source, > thus stabilizing it at some value (say 32) would be the best thing to > do. >=20 >>>> The fastest on x86 is roughly: >>>>=20 >>>> - A jump table of power for small sizes that do power-of-two-sized = small >>>> copies (double-word, word, half-word, and byte) to perform the = copy. >>>=20 >>> Last I checked this was not true. The last slow thing to do was to >>> branch on few sizes and handle them with overlapping stores. Roughly >>> what memcpy in libc is doing now. >>>=20 >>> Jump table aside, you got all sizes "spelled out", that is just >>> avoidable impact on icache. While overlapping stores do come with = some >>> penalty, it is cheaper than the above combined. >>=20 >> They=E2=80=99re not all spelled out, because a lot of them are = subsets of the >> others. The compiler does a *very* good job of optimising this. The >> generated assembly was a lot more complex than anything I could write = by >> hand. >>=20 >=20 > I said they are spelled out in the code which ends up being generated, > it looks like this: > [snip] > mov (%rsi),%rax > mov %rax,(%rdi) > pop %rbp > ret > mov (%rsi),%rax > mov %rax,(%rdi) > movzbl 0x8(%rsi),%eax > mov %al,0x8(%rdi) > pop %rbp > ret > mov (%rsi),%rax > mov %rax,(%rdi) > movzwl 0x8(%rsi),%eax > mov %ax,0x8(%rdi) > pop %rbp > ret > [/snip] >=20 > That section is *massive* and it is pessimal, once more the thing to > do is to do a few branches to land in sections handling stuff with > overlapping stores, akin to what bionic is doing here: > = https://android.googlesource.com/platform/bionic/+/refs/heads/master/libc/= arch-x86_64/string/sse2-memmove-slm.S >=20 >>> I don't have numbers nor bench code handy, but if you insist on >>> contesting the above I'll be glad to provide them. >>>=20 >>>> - A vectorised copy for medium-sized copies using a loop of SSE = copies. >>>=20 >>> Depends on what you mean by medium and which simd instructions, but >>> yes, they should be used here. >>=20 >> This could probably do with more tuning, but all of this is = configurable >> with a couple of template parameters. If it=E2=80=99s useful to have = more variants >> for different microarchitectures then it=E2=80=99s a trivial tweak to = generate >> them. >>=20 >>>> - rep movsb for large copies. >>>=20 >>> There are still old cpus here and there which don't support ERMS. = They >>> are very negatively impacted the above and should roll with rep = stosq >>> instead. >>=20 >> We tested on some microarchitectures without FRMS but didn=E2=80=99t = compare with >> rep stosq as an alternative. The rep movsb is inline assembly >> = (https://github.com/microsoft/snmalloc/blob/4370a23f3e5f1d1d06967b1e0d6162= bf1ee81196/src/snmalloc/global/memcpy.h#L351) >> so it would be quite easy to compare. Very happy to take patches = that make >> things faster! >>=20 >=20 >> Each tuned version is a dozen lines of code (plus a lot of comments = to >> explain *why* it does things a particular way), so it=E2=80=99s very = easy to add >> different versions with different tuning. >>=20 >>> I see the code takes some care to align the target buffer. That's >>> good, but not necessary on CPUs with FSRM. >>=20 >> It doesn=E2=80=99t hurt though, on the sizes where it=E2=80=99s = actually beneficial to use >> the rep sequence the cost of alignment is negligible. >>=20 >>> All that said, I have hard time believing the impact of bounds >>> checking on short copies is around 20% or so. The code to do it = looks >>> super slow (not that I know to do better). >>=20 >> The last time I looked at the disassembly, the code for the bounds = check >> touched three cache lines and has two branches. It fits well in a >> superscalar pipeline so adds a handful of cycles. This is basically = in the >> noise for large copies but is double-digit percentage overhead for = small >> ones. >>=20 >=20 > I'll be blunt here: that's misleading at best and intellectually > dishonest at worst. >=20 > By your own admission very small copies are most common and the very > paper you refer to (automemcpy) gives a figure of 96% real-world > copies being 128 bytes of less. These are also the copies which are > most affected. >=20 > As I tried to explain super short copies have very little to do, and > the impact stated above can't be an accurate estimation on top of an > ~optimal routine. >=20 > Your benchmark code artificially pessimizes comparison to whatever is > in libc because calls go through PLT, compared to calls to your own > routine which don't. The difference does matter for short copies. >=20 > I ran a quick check against glibc as shipped with Ubuntu 22. Your > routine was massively slower for sizes < 12 and then started evening > up with glibc, after that it was going back and forth due to what's > likely just measurement error. >=20 >> My original version extended the FreeBSD assembly memcpy with a call = to a >> function that did the bounds checks, but that had *much* worse = performance. >> We could insert assembly to do the bounds checks, but that=E2=80=99s = hard to update >> if we change the metadata layout in malloc. >>=20 >=20 > I'm not saying the variant as shipped now is any good or should be > used, I'm saying overlapping stores need to be used as opposed to a > jump table. The current code should be whacked. >=20 > Even then of course it had to get slower -- you added a func call in = there. >=20 >> We could modify the ifunc resolvers to use the current memcpy if you = don=E2=80=99t >> care about bounds checks. I did a tiny amount of testing (mostly = clang and >> lld) in this configuration and it was slower than using the snmalloc = memcpy, >> but I didn=E2=80=99t do enough benchmarking to be confident of that. = Hence the >> CFT. >>=20 >=20 > Once more see previous remarks of the func currently implemented not = being fast. >=20 > The thing *to do* in the area is to implement sensible memcpy, it may > be the automemcpy paper would be good basis to do it. >=20 > --=20 > Mateusz Guzik <mjguzik gmail.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?D7199237-B252-4172-B2C3-C111DFA4FC57>