Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 4 Jan 2020 22:34:49 -1000 (HST)
From:      Jeff Roberson <jroberson@jroberson.net>
To:        Mark Linimon <linimon@lonesome.com>
Cc:        Jeff Roberson <jeff@FreeBSD.org>, src-committers@freebsd.org,  svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r356348 - in head/sys: kern vm
Message-ID:  <alpine.BSF.2.21.9999.2001042216480.1198@desktop>
In-Reply-To: <20200105013314.GA3681@lonesome.com>
References:  <202001040315.0043FYhn047977@repo.freebsd.org> <20200105013314.GA3681@lonesome.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 5 Jan 2020, Mark Linimon wrote:

> On Sat, Jan 04, 2020 at 03:15:34AM +0000, Jeff Roberson wrote:
>>   Use a separate lock for the zone and keg.
>
> Out of curiosity, will there be measurable real-world speedups from
> this an similar work, or will this mostly apply to edge cases, or ... ?

It depends on which real world.  A lot of workloads don't really show much 
allocator activity.  For very high speed networking, and especially very 
high speed networking on big NUMA machines, the speedup is considerable. 
Netflix reported the earlier round of work cut the time spent in uma by 
about 30%.  For non-numa machines the last ~6 patches cut another 30% off 
of that in my tests.  Even for Netflix, uma was not in the top 5 of their 
profile before this work.

The major perf upshot was somewhere around an 8x improvement when freeing 
on a different NUMA domain than you allocated from when the allocation 
policy is first-touch.  This is called a cross-domain or 'xdomain' free in 
the code.  This made it possible to enable first-touch for UMA by default 
on all NUMA machines.

I wrote a simple allocator perf test that loops allocating 2k mbufs and 
appending them to a random remote core's queue after which it drains its 
local queue.  10 million iterations across 32 cores in two numa domains 
gives 320,000,000 packets allocated and freed.  The time within the same 
domain was about 4 seconds, before this patch series going to a different 
domain was around 40 seconds and after it was around 5 seconds.  So only 
a ~25% penalty when doing 2 million packets-per-second-per-core.

Many of the recent changes were really as much about code organization and 
readability as performance.  After 18 years of features coming and going, 
reorganizations, etc. it was getting a bit crufty.

Jeff

>
> mcl
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.21.9999.2001042216480.1198>