From owner-freebsd-net@freebsd.org  Wed Mar  8 15:22:17 2017
Return-Path: <owner-freebsd-net@freebsd.org>
Delivered-To: freebsd-net@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7CA82D02AE5
 for <freebsd-net@mailman.ysv.freebsd.org>;
 Wed,  8 Mar 2017 15:22:17 +0000 (UTC) (envelope-from slw@zxy.spb.ru)
Received: from zxy.spb.ru (zxy.spb.ru [195.70.199.98])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 3889C1059
 for <freebsd-net@freebsd.org>; Wed,  8 Mar 2017 15:22:17 +0000 (UTC)
 (envelope-from slw@zxy.spb.ru)
Received: from slw by zxy.spb.ru with local (Exim 4.86 (FreeBSD))
 (envelope-from <slw@zxy.spb.ru>)
 id 1cldPo-000Nax-Uo; Wed, 08 Mar 2017 18:22:12 +0300
Date: Wed, 8 Mar 2017 18:22:12 +0300
From: Slawa Olhovchenkov <slw@zxy.spb.ru>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Kevin Bowling <kevin.bowling@kev009.com>,
 freebsd-net <freebsd-net@freebsd.org>,
 "Eugene M. Zheganin" <emz@norma.perm.ru>
Subject: Re: about that DFBSD performance test
Message-ID: <20170308152212.GD70430@zxy.spb.ru>
References: <b91a6e40-9956-1ad9-ac59-41a281846147@norma.perm.ru>
 <CAK7dMtDiT-PKyy5LkT1WEg5g-nwqv501F=Ap4dNCdwzwr_1dqA@mail.gmail.com>
 <20170308125710.GS15630@zxy.spb.ru>
 <20170308150346.GA32269@dft-labs.eu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170308150346.GA32269@dft-labs.eu>
User-Agent: Mutt/1.5.24 (2015-08-30)
X-SA-Exim-Connect-IP: <locally generated>
X-SA-Exim-Mail-From: slw@zxy.spb.ru
X-SA-Exim-Scanned: No (on zxy.spb.ru); SAEximRunCond expanded to false
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 08 Mar 2017 15:22:17 -0000

On Wed, Mar 08, 2017 at 04:03:46PM +0100, Mateusz Guzik wrote:

> On Wed, Mar 08, 2017 at 03:57:10PM +0300, Slawa Olhovchenkov wrote:
> > On Wed, Mar 08, 2017 at 05:25:57AM -0700, Kevin Bowling wrote:
> > 
> > > Right off the bat, FreeBSD doesn't really understand NUMA in any sufficient
> > > capacity.  Unfortunately at companies like the one I work at, we take that
> > > to mean "OK buy a high bin CPU and only populate one socket" which serves
> > 
> > NUMA applicable only to high-localy computed tasks.
> > http/https/any_network_related serving is not related to this.
> > Indeed, on modern CPU is not important to bind NIC irq handlers to
> > same CPU/sockets as NIC.
> > 
> 
> Well, for both benchmarks this is both true and false.

Hmm, I am not so clear. Main may point: at time of accept connection
imposible to know location [CPU/NUMA] of data need for proccess
request. I.e. imposible to full use NUMA advantages. Data for request
may be cached on other NUMA domain, HDD w/ data may be connect to HBA
on different PCIe domain and etc.

> First and foremost there is general kernel scalability. Certain counters
> and most locks are purely managed with atomic operations. An atomic
> operation grabs the entire cacheline with the particular variable (64
> bytes in total) in exclusive mode.
> 
> If you have to do an atomic operation you are somewhat slower than you
> be otherwise.
> 
> If you have to do an atomic operation and another cpu has the cacheline,
> you are visibly slower. And if the cacheline travels a lot between cpus
> (e.g. because the lock is contended), the performance degrades rapidly.

> NUMA increases the cost of cacheline bounces, making the already bad
> situation even worse.

travels to other die don't give significat penalty for modern CPU and
correct BIOS configuration. And for older CPU too (cores on same die
synced through DRAM).

> Locking primitives are affected by NUMA significantly more than they
> have to be (I'm working on that), but any fixes in the area are just
> bandaids.
> 
> For instancee, I reproduce the http benchmark and indeed I have about
> 75k req/s on 2 * 10 * 2 box, although I'm only using one client.
> 
> Profiling shows excessive contention on the 'accept lock' and something
> else from the socket layer. The latter comes from kqueue being extremely
> inefficient by acquiring and releasing the same lock about 4 times per
> call on average (if it took it *once* it would significantly reduce lock
> bouncing around, including across the socket to a different node). But
> even taking it once is likely too bad - no matter how realistically fast
> this can get, if all nginx processes serialize on this lock this is not
> going to scale.

Hmm, what 'events' configuration of nginx? And how many workers?
Is 'accept_mutex off;' present?

> That said, the end result would be significantly higher if lock
> granularity was better

Always true.

> and I suspect numa-awareness would not be a
> significant factor in the http benchmark - provided locks are granular
> enough, they would travel across the socket only if they get pushed out
> of the cache (which would be rare), but there would be no contention.
> 
> This is a small excerpt from a reply I intend to write to the other
> thread where the 'solisten' patch is discussed. It gets rid of the
> accept lock contention, but this increases the load on other lock and
> thattemporarily slows things down.
> 
> -- 
> Mateusz Guzik <mjguzik gmail.com>