From owner-svn-src-all@FreeBSD.ORG  Wed Aug 21 14:59:52 2013
Return-Path: <owner-svn-src-all@FreeBSD.ORG>
Delivered-To: svn-src-all@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 4E8A8C56
 for <svn-src-all@freebsd.org>; Wed, 21 Aug 2013 14:59:52 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 653C921B4
 for <svn-src-all@freebsd.org>; Wed, 21 Aug 2013 14:59:51 +0000 (UTC)
Received: (qmail 72615 invoked from network); 21 Aug 2013 15:42:55 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <np@FreeBSD.org>; 21 Aug 2013 15:42:55 -0000
Message-ID: <5214D5E0.9040002@freebsd.org>
Date: Wed, 21 Aug 2013 16:59:44 +0200
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:17.0) Gecko/20130801 Thunderbird/17.0.8
MIME-Version: 1.0
To: Navdeep Parhar <np@FreeBSD.org>
Subject: Re: svn commit: r254520 - in head/sys: kern sys
References: <201308191116.r7JBGsc6065793@svn.freebsd.org>
 <521256CE.6070706@FreeBSD.org> <5212870A.50105@freebsd.org>
 <521291F1.8060500@FreeBSD.org>
In-Reply-To: <521291F1.8060500@FreeBSD.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org,
 src-committers@freebsd.org
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
 user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 21 Aug 2013 14:59:52 -0000

On 19.08.2013 23:45, Navdeep Parhar wrote:
> On 08/19/13 13:58, Andre Oppermann wrote:
>> On 19.08.2013 19:33, Navdeep Parhar wrote:
>>> On 08/19/13 04:16, Andre Oppermann wrote:
>>>> Author: andre
>>>> Date: Mon Aug 19 11:16:53 2013
>>>> New Revision: 254520
>>>> URL: http://svnweb.freebsd.org/changeset/base/254520
>>>>
>>>> Log:
>>>>     Remove the unused M_NOFREE mbuf flag.  It didn't have any in-tree
>>>> users
>>>>     for a very long time, if ever.
>>>>
>>>>     Should such a functionality ever be needed again the appropriate and
>>>>     much better way to do it is through a custom EXT_SOMETHING
>>>> external mbuf
>>>>     type together with a dedicated *ext_free function.
>>>>
>>>>     Discussed with:    trociny, glebius
>>>>
>>>> Modified:
>>>>     head/sys/kern/kern_mbuf.c
>>>>     head/sys/kern/uipc_mbuf.c
>>>>     head/sys/sys/mbuf.h
>>>>
>>>
>>> Hello Andre,
>>>
>>> Is this just garbage collection or is there some other reason for this?
>>
>> This is garbage collection and removal of not quite right, rotten,
>> functionality.
>>
>>> I recently tried some experiments to reduce the number of mbuf and
>>> cluster allocations in a 40G NIC driver.  M_NOFREE and EXT_EXTREF proved
>>> very useful and the code changes to the kernel were minimal.  See
>>> user/np/cxl_tuning.  The experiment was quite successful and I was
>>> planning to bring in most of those changes to HEAD.  I was hoping to get
>>> some runtime mileage on the approach in general before tweaking the
>>> ctors/dtors for jumpbo, jumbo9, jumbo16 to allow for an mbuf+refcnt
>>> within the cluster.  But now M_NOFREE has vanished without a warning...
>>
>> I'm looking through your experimental code and that is some really good
>> numbers you're achieving there!
>>
>> However a couple things don't feel quite right, hackish even, and not
>> fit for HEAD.  This is a bit the same situation we had with some of the
>> first 1GigE cards quite a number of years back (mostly ti(4)).  There
>> we ended up with a couple of just good enough hacks to make it fast.
>> Most of the remains I've collected today.
>
> If M_NOFREE and EXT_EXTREF are properly supported in the tree (and I'm
> arguing that they were, before r254520) then the changes are perfectly
> legitimate.  The only hackish part was that I was getting the cluster
> from the jumbop zone while bypassing its normal refcnt mechanism.  This
> I did so as to use the same zone as m_uiotombuf to keep it "hot" for all
> consumers (driver + network stack).

If you insist I'll revert the commit removing M_NOFREE.  EXT_EXTREF isn't
touched yet, but should get better support.

The hackish part for me is that the driver again manages its own memory
pool.  Windows works that way, NetBSD is moving towards it while FreeBSD
has and remains at a central network memory pool.  The latter (our current)
way of doing it seems more efficient overall especially on heavily loaded
networked machines.  There may be significant queues building (think app
blocked having many sockets buffer fill up) up delaying the freeing and
returning of network memory resources.  Together with fragmentation this
can lead to bad very outcomes.  Router applications with many interfaces
also greatly benefit from central memory pools.

So I'm really not sure that we should move back in the driver owned pool
direction with lots of code duplication and copy-pasting (see NetBSD).
Also it is kinda weird to have a kernel based pool for data going down
the stack and another one in each driver for those going up.

Actually I'm of the opinion that we should stay with the central memory
pool and fix so that it works just as well for those cases a driver pool
currently performs better.

>> I believe most of the improvements you've shown can be implemented in
>> a more generic and safe way into the mbuf system.  Also a number of things
>> in your experimental code may have side-effects in situations other than
>> netperf runs.
>
> Agreed.  As I mentioned my long term plan was to tweak the jumboX zones
> to allow them to allocate cluster with embedded mbuf + refcount.  The
> M_NOFREE+EXT_EXTREF approach is the perfect bridge from here to there.
> It is non-intrusive and lends itself well to experimentation.

Agreed, full support on fixing the refcount issue.

>> To summarize what I get from your experimental branch commits:
>> - the Chelsio T4/T5 card can DMA multiple ethernet frames (packets) into
>>    a single memory buffer, provided it is large enough.
>> - you make use of that feature and point multiple m_ext mbufs into that
>>    buffer to separate the packets and send them up the stack.
>> - you embed the m_ext refcount into the single memory buffer as well.
>
> yes, yes, and yes.
>
>> - you recycle mbufs? (I'm not entirely clear on that as I'm not familiar
>>    with the cxgbe code)
>
> I recycle the cluster (and the embedded mbuf in it) when possible.  All
> depends on whether it's still in use by the time the rx queue needs it back.

There's always a couple of problems with driver managed pools.  Driver shutdown/
unloading requires all such managed mbufs to have returned before it can proceed.
This may take a undetermined long time as mbufs are sitting in socket buffers or
other queues.

>> Lets examine and discuss these parts:
>> - M_NOFREE wasn't really safe with bpf anyway at least for packets going
>>    down the stack.
>
> I see.  Can you point out the parts of bpf unable to deal with M_NOFREE?

Sorry, false alarm.

>> - Instead of M_NOFREE a custom *ext_free should be used that has the same
>>    and even more functionality.
>
> Yes, that's what I was planning too, with the jumboX zone changes.  It
> would be faster than running the m_ext's free function (which is a
> function dereference+call).

It would be a bit faster but how do you know when an mbuf itself has been
free'd and can be reused without notification?

>> - Recycling mbufs may cause imbalances to the mbuf pool with multiple cores
>>    and possibly NUMA in the future.  Not that we are very good with it at
>>    the moment but bypassing the mbuf allocator shouldn't become the norm.
>>    If it is a problem/bottleneck again it should be fixed, not worked around
>>    and copy-pasted n-times in so many drivers.
>
> If/when a cluster is recycled, it is given back to the same rx ithread
> that originally allocated it, and not not any other queue.  If the
> ithread stays in the same NUMA domain (and it really should) then
> recycling the cluster in the same queue really shouldn't cause imbalances.

OK. (Side note at the moment the mbuf allocator isn't NUMA aware at all.)

>> - jumbo9 and jumbo16 mbufs should not be used because they are more special
>>    than necessary with being KVM and physically contiguous.  This probably
>>    isn't necessary for the T4/T5 cards and any other modern DMA engine.
>>    Under heavy diverse network the kernel memory becomes fragmented and
>> can't
>>    find memory fulfilling both criteria anymore.  In fact both are an
>> artifact
>>    of the early GigE hacks when high speed DMA engines were in their
>> infancy.
>>    Both jumbo9 and jumbo16 should go away without direct replacement.
>
> The devices I deal will be able to cope, but I suspect this will be
> disruptive (for not enough reason to justify the disruption, imho) to
> others.  That's a separate discussion anyway.

Again using the jumbo9 or jumbo16 doesn't scale.  Wollman had it fail quite
badly on his heavily loaded NFS server.

The way to go should be 4K clusters as they are native to the architecture.
IIRC a PCIe DMA can't cross a 4K boundary anyway, so it having a number of
them together shouldn't have much of an impact except for consuming descriptors?

On a well loaded system it's almost impossible to obtain physically contiguous
pages after some time, except through pre-allocation.  But then the pre-allocation
has to be properly sized in advance and takes memory away from other uses.

>>    In your T4/T5 case the alternative would be either to a) allocate your
>>    own memory directly from KVM with the necessary properties (KVM and/or
>> phys
>>    contig); b) have such a generic kernel mbuf type fulfilling the same
>> role.
>>    There may be some cache line issues on non-x86 systems that have to be
>>    though and taken care of.
>> - Refcounts should have the option of being separately allocated.  It was
>>    mandatory to use the refcount zone so far because there wasn't any other
>>    type of refcount.  Now that we have one we should revisit the issue.
>>    Actually the entire mbuf allocation and initialization could be
>> streamlined
>>    as a whole.
>>
>> On a side note a different mbuf handling for the RX DMA rings may give some
>> significant improvements as well: allocate just a cluster without a mbuf
>> through
>> m_cljget() and put it into the RX ring.  Only when the DMA has put a
>> packet into
>> it allocated attach the mbuf (in the drivers RX function).  This avoids
>> the cache
>> pollution from touching the mbuf fields during initial allocation,
>> including the
>> refcount.
>
> Er, this is pretty much how cxgbe(4) has always done it.  A cluster is
> allocated with m_cljget, an mbuf is allocated with MT_NOINIT to avoid
> touching any line in the mbuf.  m_init is called later in the rx ithread.

Excellent.  I haven't (yet) spent much time reading the cxgbe(4) code.

>> Another nice trick would be to shorten the mbuf by 4 bytes (in ext_size)
>> and put
>> the refcount there.
>>
>> Lets work on these together.
>
> Yes.  I think we agree on the long term optimization, or are at least
> headed in the same general direction.  But removing M_NOFREE removed a
> dirt-cheap way to test and exercise the approach without implementing
> the "real thing."  Why not just let it be?

I don't mind too much having M_NOFREE return but, while convenient, I'm
not convinced at all that it is the right and best way of handling that
use case.

-- 
Andre