Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 5 Mar 2023 17:27:16 +0000
From:      David Chisnall <theraven@FreeBSD.org>
To:        Mateusz Guzik <mjguzik@gmail.com>
Cc:        freebsd-hackers <freebsd-hackers@freebsd.org>
Subject:   Re: CFT: snmalloc as libc malloc
Message-ID:  <D7199237-B252-4172-B2C3-C111DFA4FC57@FreeBSD.org>
In-Reply-To: <CAGudoHG3m1OixvuYGMGC_9fFBoOEXD-feUB2BUiTA66u0mdSfw@mail.gmail.com>
References:  <2f3dcda0-5135-290a-2dff-683b2e9fe271@FreeBSD.org> <CAGudoHFYMLk6EDrSxLiWFNBoYyTKXfHLAUhZC%2BRF4eUE-rip8Q@mail.gmail.com> <E140A3A2-5C4A-4458-B365-AD693AB853E8@FreeBSD.org> <CAGudoHFeFm1K%2BJSBXaxt2SNTv-tGFgT0onyErOCmBdVjmHaxUg@mail.gmail.com> <2836EF0C-8B4B-441C-927B-3796457A3865@FreeBSD.org> <CAGudoHG3m1OixvuYGMGC_9fFBoOEXD-feUB2BUiTA66u0mdSfw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi,

Thanks for the comments. =20

The FlagLock is used in two places:

 - Early init, before anything in libc can be guaranteed to exist.
 - Some *very* slow paths, where contention is incredibly improbable.

We could replace them with _umtx_op / futex locks, but I doubt you=E2=80=99=
d see a difference.  Do you have any workloads where locks introduce =
contention?  The locks are all abstracted away in a single place, so it =
would be fairly simple to add a couple of lock functions to the PAL.  =
Probably a 10-line change in total, but since it hasn=E2=80=99t shown up =
in any benchmarking that we=E2=80=99ve done yet it hasn=E2=80=99t been a =
priority.

With respect to NUMA, if a thread is pinned to a core and the kernel is =
handling it local memory, then you should see good performance =
initially.  All allocations are satisfied from a pool that is local to =
the core.  If they are freed in other threads then they are batched and =
returned to the allocating thread.  It is possible for chunks that are =
originally created on one core to be recycled and used on another core.  =
This will happen at >page granularity and so is better handled in the =
kernel.

If you have a benchmark for memcpy that shows a big slowdown with the =
bounds checking, I would like to see it so that we can use it to =
optimise the implementation.  12-byte memcpys are common, but also fit =
well in the store queues overlapped with the bounds checks.  I=E2=80=99d =
be *very* interested in workloads where this is not the case.

I broke my shoulder a week ago and so am typing very slowly with one =
hand, but patches are very welcome.  There is a PR under review at the =
moment to add memmove, which could benefit from some more tuning before =
it=E2=80=99s merged.

David


> On 5 Mar 2023, at 16:47, Mateusz Guzik <mjguzik@gmail.com> wrote:
>=20
> Apologies for significant delay, got other stuff and this fell through
> the cracks.
>=20
> I'm going to start with snmalloc remarks, memcpy-related response is =
down below.
>=20
> I find it quite weird there are absolutely no mentions of NUMA in a
> general-purpose allocator made in recent years. While nothing can be
> done for an application which does not use any affinity, what *can* be
> done is not negatively affecting programs which *do* use it. Most
> notably making sure that allocations backed by one page don't land in
> threads from distinct domains. It is unclear to me what, if anything,
> is the allocator doing on this front, but I also don't know if this
> would be a regression vs jemalloc.
>=20
> One major issue I found is the use of hand-rolled spinlocks -- while
> they may be fine for specialized uses, they are a non-starter for
> general purpose. This is the age-old problem of priority problems
> where you can end up yield()ing all you want, while the lock owner
> nowhere near CPU or the scheduler does not think you should give way
> to it. The least bad thing to do would be to add umtx support to
> remedy this problem, but that's a lot of work and it is probably not
> warranted. Then the fallback is pthread locks -- while pessimal, they
> don't suffer the problem. The lock operation not being in the fast
> path is not a factor for this problem.
>=20
> I tried to do a sensible benchmark of the thing, but ran into snags to
> sort out first. The idea was to 'tinderbox' it, except:
> 1. make's use of poll is a massive scalability bottleneck, addressed
> here in an experimental manner:
> https://github.com/macdice/freebsd/tree/bmake-shmem
> 2. with that out of the way there is still massive contention in the
> vm subsystem. I had a patch for most of it here:
> https://reviews.freebsd.org/D36038 . it rebases cleanly but no longer
> works, I don't know why yet
>=20
> There is other issues.
>=20
> Ultimately if snmalloc is to ship, it would be nice ot have it for 14
> and we are nearing the time where it will be too late, so I'll try to
> sort it all out next week.
>=20
> Meanwhile there is the lock issue to sort out.
>=20
> On to memcpy
>=20
> On 2/9/23, David Chisnall <theraven@freebsd.org> wrote:
>> On 9 Feb 2023, at 20:53, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>>=20
>>> First and foremost, perhaps I should clear up that I have no opinion
>>> on snmalloc or it replacing jemalloc. I'm here only about the memcpy
>>> thing.
>>>=20
>>> Inspecting assembly is what I intended to do, but got the =
compilation
>>> problem.
>>=20
>> If you just want to look at the memcpy, you might do better using the
>> upstream (CMake) build system, which builds a shim library that you =
can
>> disassemble and look at.
>>=20
>>> So, as someone who worked on memcpy previously, I note the variant
>>> currently implemented in libc is pessimal for sizes > 16 bytes =
because
>>> it does not use SIMD. I do have plans to rectify that, but ENOTIME.
>>=20
>> That=E2=80=99s one of the benefits of the C++ implementation.  We =
were able to try a
>> few dozen variations in a morning.  Writing a single one of those in
>> assembly would take a similar amount of time.
>>=20
>> We were inspired by the automemcpy paper, which demonstrated that you =
can
>> write memcpy implementations tuned to specific workloads in =
higher-level
>> languages and get better performance.
>>=20
>>> I also note CPUs are incredibly picky when it comes to starting =
point
>>> of the routine. The officialy recommended alignment of 16 bytes is
>>> basically a tradeoff between total binary size and performance. =
Should
>>> you move the routine at different 16 bytes intervals you will easily
>>> find big variation in performance, depending on how big of an
>>> alignment you ended up with.
>>=20
>> Yes, that=E2=80=99s what our test does.  It does a large number of =
different copies
>> with different sizes and alignments.
>>=20
>=20
> I'm not talking about placement of buffers passed to memcpy, I'm
> talking about placement of *code implementing memcpy*.
>=20
>>> I'm trying to say that if you end up with different funcs depending =
on
>>> bounds checking and you only align them to 16 bytes, the benchmark =
is
>>> most likely inaccurate if only for this reason.
>>=20
>> I=E2=80=99m not sure I follow this, sorry.
>>=20
>=20
> If the address of the *func itself* is divisible by 16 but not 32, it
> may end up with a different performance profile depending on uarch.
> And the code may randomly move up and down as you change the source,
> thus stabilizing it at some value (say 32) would be the best thing to
> do.
>=20
>>>> The fastest on x86 is roughly:
>>>>=20
>>>> - A jump table of power for small sizes that do power-of-two-sized =
small
>>>> copies (double-word, word, half-word, and byte) to perform the =
copy.
>>>=20
>>> Last I checked this was not true. The last slow thing to do was to
>>> branch on few sizes and handle them with overlapping stores. Roughly
>>> what memcpy in libc is doing now.
>>>=20
>>> Jump table aside, you got all sizes "spelled out", that is just
>>> avoidable impact on icache. While overlapping stores do come with =
some
>>> penalty, it is cheaper than the above combined.
>>=20
>> They=E2=80=99re not all spelled out, because a lot of them are =
subsets of the
>> others.  The compiler does a *very* good job of optimising this.  The
>> generated assembly was a lot more complex than anything I could write =
by
>> hand.
>>=20
>=20
> I said they are spelled out in the code which ends up being generated,
> it looks like this:
> [snip]
> mov    (%rsi),%rax
> mov    %rax,(%rdi)
> pop    %rbp
> ret
> mov    (%rsi),%rax
> mov    %rax,(%rdi)
> movzbl 0x8(%rsi),%eax
> mov    %al,0x8(%rdi)
> pop    %rbp
> ret
> mov    (%rsi),%rax
> mov    %rax,(%rdi)
> movzwl 0x8(%rsi),%eax
> mov    %ax,0x8(%rdi)
> pop    %rbp
> ret
> [/snip]
>=20
> That section is *massive* and it is pessimal, once more the thing to
> do is to do a few branches to land in sections handling stuff with
> overlapping stores, akin to what bionic is doing here:
> =
https://android.googlesource.com/platform/bionic/+/refs/heads/master/libc/=
arch-x86_64/string/sse2-memmove-slm.S
>=20
>>> I don't have numbers nor bench code handy, but if you insist on
>>> contesting the above I'll be glad to provide them.
>>>=20
>>>> - A vectorised copy for medium-sized copies using a loop of SSE =
copies.
>>>=20
>>> Depends on what you mean by medium and which simd instructions, but
>>> yes, they should be used here.
>>=20
>> This could probably do with more tuning, but all of this is =
configurable
>> with a couple of template parameters.  If it=E2=80=99s useful to have =
more variants
>> for different microarchitectures then it=E2=80=99s a trivial tweak to =
generate
>> them.
>>=20
>>>> - rep movsb for large copies.
>>>=20
>>> There are still old cpus here and there which don't support ERMS. =
They
>>> are very negatively impacted the above and should roll with rep =
stosq
>>> instead.
>>=20
>> We tested on some microarchitectures without FRMS but didn=E2=80=99t =
compare with
>> rep stosq as an alternative.  The rep movsb is inline assembly
>> =
(https://github.com/microsoft/snmalloc/blob/4370a23f3e5f1d1d06967b1e0d6162=
bf1ee81196/src/snmalloc/global/memcpy.h#L351)
>> so it would be quite easy to compare.  Very happy to take patches =
that make
>> things faster!
>>=20
>=20
>> Each tuned version is a dozen lines of code (plus a lot of comments =
to
>> explain *why* it does things a particular way), so it=E2=80=99s very =
easy to add
>> different versions with different tuning.
>>=20
>>> I see the code takes some care to align the target buffer. That's
>>> good, but not necessary on CPUs with FSRM.
>>=20
>> It doesn=E2=80=99t hurt though, on the sizes where it=E2=80=99s =
actually beneficial to use
>> the rep sequence the cost of alignment is negligible.
>>=20
>>> All that said, I have hard time believing the impact of bounds
>>> checking on short copies is around 20% or so. The code to do it =
looks
>>> super slow (not that I know to do better).
>>=20
>> The last time I looked at the disassembly, the code for the bounds =
check
>> touched three cache lines and has two branches.  It fits well in a
>> superscalar pipeline so adds a handful of cycles.  This is basically =
in the
>> noise for large copies but is double-digit percentage overhead for =
small
>> ones.
>>=20
>=20
> I'll be blunt here: that's misleading at best and intellectually
> dishonest at worst.
>=20
> By your own admission very small copies are most common and the very
> paper you refer to (automemcpy) gives a figure of 96% real-world
> copies being 128 bytes of less. These are also the copies which are
> most affected.
>=20
> As I tried to explain super short copies have very little to do, and
> the impact stated above can't be an accurate estimation on top of an
> ~optimal routine.
>=20
> Your benchmark code artificially pessimizes comparison to whatever is
> in libc because calls go through PLT, compared to calls to your own
> routine which don't. The difference does matter for short copies.
>=20
> I ran a quick check against glibc as shipped with Ubuntu 22. Your
> routine was massively slower for sizes < 12 and then started evening
> up with glibc, after that it was going back and forth due to what's
> likely just measurement error.
>=20
>> My original version extended the FreeBSD assembly memcpy with a call =
to a
>> function that did the bounds checks, but that had *much* worse =
performance.
>> We could insert assembly to do the bounds checks, but that=E2=80=99s =
hard to update
>> if we change the metadata layout in malloc.
>>=20
>=20
> I'm not saying the variant as shipped now is any good or should be
> used, I'm saying overlapping stores need to be used as opposed to a
> jump table. The current code should be whacked.
>=20
> Even then of course it had to get slower -- you added a func call in =
there.
>=20
>> We could modify the ifunc resolvers to use the current memcpy if you =
don=E2=80=99t
>> care about bounds checks.  I did a tiny amount of testing (mostly =
clang and
>> lld) in this configuration and it was slower than using the snmalloc =
memcpy,
>> but I didn=E2=80=99t do enough benchmarking to be confident of that.  =
Hence the
>> CFT.
>>=20
>=20
> Once more see previous remarks of the func currently implemented not =
being fast.
>=20
> The thing *to do* in the area is to implement sensible memcpy, it may
> be the automemcpy paper would be good basis to do it.
>=20
> --=20
> Mateusz Guzik <mjguzik gmail.com>




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?D7199237-B252-4172-B2C3-C111DFA4FC57>