Date: Wed, 8 Mar 2017 16:03:46 +0100 From: Mateusz Guzik <mjguzik@gmail.com> To: Slawa Olhovchenkov <slw@zxy.spb.ru> Cc: Kevin Bowling <kevin.bowling@kev009.com>, freebsd-net <freebsd-net@freebsd.org>, "Eugene M. Zheganin" <emz@norma.perm.ru> Subject: Re: about that DFBSD performance test Message-ID: <20170308150346.GA32269@dft-labs.eu> In-Reply-To: <20170308125710.GS15630@zxy.spb.ru> References: <b91a6e40-9956-1ad9-ac59-41a281846147@norma.perm.ru> <CAK7dMtDiT-PKyy5LkT1WEg5g-nwqv501F=Ap4dNCdwzwr_1dqA@mail.gmail.com> <20170308125710.GS15630@zxy.spb.ru>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Mar 08, 2017 at 03:57:10PM +0300, Slawa Olhovchenkov wrote: > On Wed, Mar 08, 2017 at 05:25:57AM -0700, Kevin Bowling wrote: > > > Right off the bat, FreeBSD doesn't really understand NUMA in any sufficient > > capacity. Unfortunately at companies like the one I work at, we take that > > to mean "OK buy a high bin CPU and only populate one socket" which serves > > NUMA applicable only to high-localy computed tasks. > http/https/any_network_related serving is not related to this. > Indeed, on modern CPU is not important to bind NIC irq handlers to > same CPU/sockets as NIC. > Well, for both benchmarks this is both true and false. First and foremost there is general kernel scalability. Certain counters and most locks are purely managed with atomic operations. An atomic operation grabs the entire cacheline with the particular variable (64 bytes in total) in exclusive mode. If you have to do an atomic operation you are somewhat slower than you be otherwise. If you have to do an atomic operation and another cpu has the cacheline, you are visibly slower. And if the cacheline travels a lot between cpus (e.g. because the lock is contended), the performance degrades rapidly. NUMA increases the cost of cacheline bounces, making the already bad situation even worse. Locking primitives are affected by NUMA significantly more than they have to be (I'm working on that), but any fixes in the area are just bandaids. For instancee, I reproduce the http benchmark and indeed I have about 75k req/s on 2 * 10 * 2 box, although I'm only using one client. Profiling shows excessive contention on the 'accept lock' and something else from the socket layer. The latter comes from kqueue being extremely inefficient by acquiring and releasing the same lock about 4 times per call on average (if it took it *once* it would significantly reduce lock bouncing around, including across the socket to a different node). But even taking it once is likely too bad - no matter how realistically fast this can get, if all nginx processes serialize on this lock this is not going to scale. That said, the end result would be significantly higher if lock granularity was better and I suspect numa-awareness would not be a significant factor in the http benchmark - provided locks are granular enough, they would travel across the socket only if they get pushed out of the cache (which would be rare), but there would be no contention. This is a small excerpt from a reply I intend to write to the other thread where the 'solisten' patch is discussed. It gets rid of the accept lock contention, but this increases the load on other lock and thattemporarily slows things down. -- Mateusz Guzik <mjguzik gmail.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170308150346.GA32269>