From owner-svn-src-all@FreeBSD.ORG  Tue Mar 12 16:33:07 2013
Return-Path: <owner-svn-src-all@FreeBSD.ORG>
Delivered-To: svn-src-all@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id D79DBBFC
 for <svn-src-all@freebsd.org>; Tue, 12 Mar 2013 16:33:07 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 by mx1.freebsd.org (Postfix) with ESMTP id 5A556B34
 for <svn-src-all@freebsd.org>; Tue, 12 Mar 2013 16:33:07 +0000 (UTC)
Received: (qmail 11216 invoked from network); 12 Mar 2013 17:45:38 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <glebius@FreeBSD.org>; 12 Mar 2013 17:45:38 -0000
Message-ID: <513F58C0.4050302@freebsd.org>
Date: Tue, 12 Mar 2013 17:33:04 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:17.0) Gecko/20130107 Thunderbird/17.0.2
MIME-Version: 1.0
To: Gleb Smirnoff <glebius@FreeBSD.org>
Subject: Re: svn commit: r248196 - head/sys/nfs
References: <201303121219.r2CCJN5Z069789@svn.freebsd.org>
 <513F3A54.3090702@freebsd.org> <20130312150053.GI48089@FreeBSD.org>
 <513F4A39.8040107@freebsd.org> <20130312155005.GJ48089@FreeBSD.org>
In-Reply-To: <20130312155005.GJ48089@FreeBSD.org>
Content-Type: text/plain; charset=KOI8-R; format=flowed
Content-Transfer-Encoding: 7bit
Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org,
 src-committers@freebsd.org
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
 user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Mar 2013 16:33:08 -0000

On 12.03.2013 16:50, Gleb Smirnoff wrote:
> On Tue, Mar 12, 2013 at 04:31:05PM +0100, Andre Oppermann wrote:
> A> > If you are concerned about using jumbos that are > PAGE_SIZE, then I can
> A> > extend API in my patch.  ... done.
> A> >
> A> > Patch attached.
> A> >
> A> > The NFS code itself guarantees that it won't request > than MCLBYTES,
> A> > so using bare m_get2() here is safe. I can add flag there later for
> A> > clarity.
> A>
> A> Using PAGE_SIZE clusters is perfectly fine and no flag to prevent that
> A> is necessary.  In fact we're doing it for years on socket writes without
> A> complaints (through m_getm2()).
>
> mbuf usage isn't limited to sockets. There is some code that right now utilizes
> only mbufs and standard clusters, netipsec for example.

Yes, I understand that.

> I'd like to remove a lot of handmade mbuf allocating, in different places in
> kernel and this can be done with M_NOJUMBO flag. I don't have time to dig more
> deep into large chunks of code trying to understand whether it is possible to
> convert them into using PAGE_SIZE clusters or not, I just want to reduce
> amount of pasted hand allocating.

Reducing the amount of hand allocation is very good.

> We have very common case when we allocate either mbuf or mbuf + cluster,
> depending on size. Everywhere this is made by hand, but can be substituted
> with m_get2(len, ..., M_NOJUMBO);

I guess what I'm trying to say is that not wanting jumbo > PAGE_SIZE is
normal and shouldn't be specified all the time.

This makes the API look like this:

  m_get2(len, ..., 0);	/* w/o flags I get at most MJUMPAGESIZE */

If someone really, really, really knows what he is doing he can say
he wants jumbo > PAGE_SIZE returned with M_JUMBOOK or such.  However
IMHO even that shouldn't be offered and m_getm2() should be used for
a chain.

> A> However I think that m_get2() should never ever even try to attempt to
> A> allocate mbuf clusters larger than PAGE_SIZE.  Not even with flags.
> A>
> A> All mbufs > PAGE_SIZE should be exclusively and only ever be used by drivers
> A> for NIC's with "challenged" DMA engines.  Possibly only available through a
> A> dedicated API to prevent all other uses of it.
>
> Have you done any benchmarking that proves that scatter-gather on the level of
> busdma is any worse than chaining on mbuf level?

The problem is different.  With our current jumbo mbufs > PAGE_SIZE there
isn't any scatter-gather at busdma level because they are contiguous at
physical *and* KVA level.  Allocating such jumbo mbufs shifts the burden
of mbuf chains to the VM and pmap layer in trying to come up with such
contiguous patches of physical memory.  This fails quickly after some
activity and memory fragmentation on the machine as we've seen in recent
days even with 96GB of RAM available.  It gets worse the more load the
machine has.  Which is exactly what we *don't* want.

> Dealing with contiguous in virtual memory mbuf is handy, for protocols that
> look through entire payload, for example pfsync. I guess NFS may also benefit
> from that.

Of course it is handy.  However that carries other tradeoffs, some significant,
in other parts of the system.  And then for incoming packets it depends on the
MTU size.  For NFS, as far as I've read through the code today, the control
messages tend to be rather small.  The vast bulk of the data is transported
between mbuf and VFS/filesystem.

> P.S. Ok about the patch?

No.  m_getm2() doesn't need the flag at all.  PAGE_SIZE mbufs are always
good.  Calling m_get2() without flag should return at most a PAGE_SIZE
mbuf.  The (ab)use of M_PROTO1|2 flags is icky and may conflict with
protocol specific uses.

-- 
Andre