From owner-freebsd-net@freebsd.org Wed Mar 8 15:22:17 2017 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7CA82D02AE5 for ; Wed, 8 Mar 2017 15:22:17 +0000 (UTC) (envelope-from slw@zxy.spb.ru) Received: from zxy.spb.ru (zxy.spb.ru [195.70.199.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 3889C1059 for ; Wed, 8 Mar 2017 15:22:17 +0000 (UTC) (envelope-from slw@zxy.spb.ru) Received: from slw by zxy.spb.ru with local (Exim 4.86 (FreeBSD)) (envelope-from ) id 1cldPo-000Nax-Uo; Wed, 08 Mar 2017 18:22:12 +0300 Date: Wed, 8 Mar 2017 18:22:12 +0300 From: Slawa Olhovchenkov To: Mateusz Guzik Cc: Kevin Bowling , freebsd-net , "Eugene M. Zheganin" Subject: Re: about that DFBSD performance test Message-ID: <20170308152212.GD70430@zxy.spb.ru> References: <20170308125710.GS15630@zxy.spb.ru> <20170308150346.GA32269@dft-labs.eu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170308150346.GA32269@dft-labs.eu> User-Agent: Mutt/1.5.24 (2015-08-30) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: slw@zxy.spb.ru X-SA-Exim-Scanned: No (on zxy.spb.ru); SAEximRunCond expanded to false X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 Mar 2017 15:22:17 -0000 On Wed, Mar 08, 2017 at 04:03:46PM +0100, Mateusz Guzik wrote: > On Wed, Mar 08, 2017 at 03:57:10PM +0300, Slawa Olhovchenkov wrote: > > On Wed, Mar 08, 2017 at 05:25:57AM -0700, Kevin Bowling wrote: > > > > > Right off the bat, FreeBSD doesn't really understand NUMA in any sufficient > > > capacity. Unfortunately at companies like the one I work at, we take that > > > to mean "OK buy a high bin CPU and only populate one socket" which serves > > > > NUMA applicable only to high-localy computed tasks. > > http/https/any_network_related serving is not related to this. > > Indeed, on modern CPU is not important to bind NIC irq handlers to > > same CPU/sockets as NIC. > > > > Well, for both benchmarks this is both true and false. Hmm, I am not so clear. Main may point: at time of accept connection imposible to know location [CPU/NUMA] of data need for proccess request. I.e. imposible to full use NUMA advantages. Data for request may be cached on other NUMA domain, HDD w/ data may be connect to HBA on different PCIe domain and etc. > First and foremost there is general kernel scalability. Certain counters > and most locks are purely managed with atomic operations. An atomic > operation grabs the entire cacheline with the particular variable (64 > bytes in total) in exclusive mode. > > If you have to do an atomic operation you are somewhat slower than you > be otherwise. > > If you have to do an atomic operation and another cpu has the cacheline, > you are visibly slower. And if the cacheline travels a lot between cpus > (e.g. because the lock is contended), the performance degrades rapidly. > NUMA increases the cost of cacheline bounces, making the already bad > situation even worse. travels to other die don't give significat penalty for modern CPU and correct BIOS configuration. And for older CPU too (cores on same die synced through DRAM). > Locking primitives are affected by NUMA significantly more than they > have to be (I'm working on that), but any fixes in the area are just > bandaids. > > For instancee, I reproduce the http benchmark and indeed I have about > 75k req/s on 2 * 10 * 2 box, although I'm only using one client. > > Profiling shows excessive contention on the 'accept lock' and something > else from the socket layer. The latter comes from kqueue being extremely > inefficient by acquiring and releasing the same lock about 4 times per > call on average (if it took it *once* it would significantly reduce lock > bouncing around, including across the socket to a different node). But > even taking it once is likely too bad - no matter how realistically fast > this can get, if all nginx processes serialize on this lock this is not > going to scale. Hmm, what 'events' configuration of nginx? And how many workers? Is 'accept_mutex off;' present? > That said, the end result would be significantly higher if lock > granularity was better Always true. > and I suspect numa-awareness would not be a > significant factor in the http benchmark - provided locks are granular > enough, they would travel across the socket only if they get pushed out > of the cache (which would be rare), but there would be no contention. > > This is a small excerpt from a reply I intend to write to the other > thread where the 'solisten' patch is discussed. It gets rid of the > accept lock contention, but this increases the load on other lock and > thattemporarily slows things down. > > -- > Mateusz Guzik