From owner-freebsd-fs@FreeBSD.ORG  Tue Mar 18 23:57:39 2014
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 944A433C;
 Tue, 18 Mar 2014 23:57:39 +0000 (UTC)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
 [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 47BE9107;
 Tue, 18 Mar 2014 23:57:38 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqIEAAzdKFODaFve/2dsb2JhbABahBiDBrwPgxCBP3SCJQEBAQMBIwRSBRYOCgICDRkCWQaIBAiuIaI5F4EpjGEBIzQHgm+BSQSqdoNJIYEsAR8i
X-IronPort-AV: E=Sophos;i="4.97,681,1389762000"; d="scan'208";a="106977219"
Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.222])
 by esa-jnhn.mail.uoguelph.ca with ESMTP; 18 Mar 2014 19:57:37 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id C7C4CB3F1A;
 Tue, 18 Mar 2014 19:57:37 -0400 (EDT)
Date: Tue, 18 Mar 2014 19:57:37 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: Alexander Motin <mav@FreeBSD.org>
Message-ID: <2092082855.24699674.1395187057807.JavaMail.root@uoguelph.ca>
In-Reply-To: <5328065D.60201@FreeBSD.org>
Subject: Re: review/test: NFS patch to use pagesize mbuf clusters
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.209]
X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790)
Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 18 Mar 2014 23:57:39 -0000

Alexander Motin wrote:
> Hi.
> 
> On 18.03.2014 03:26, Rick Macklem wrote:
> > Several of the TSO capable network interfaces have a limit of
> > 32 mbufs in the transmit mbuf chain (the drivers call these
> > transmit
> > segments, which I admit I find confusing).
> >
> > For a 64K read/readdir reply or 64K write request, NFS passes
> > a list of 34 mbufs down to TCP. TCP will split the list, since
> > it is slightly more than 64K bytes, but that split will normally
> > be a copy by reference of the last mbuf cluster. As such, normally
> > the network interface will get a list of 34 mbufs.
> >
> > For TSO enabled interfaces that are limited to 32 mbufs in the
> > list, the usual workaround in the driver is to copy { real copy,
> > not copy by reference } the list to 32 mbuf clusters via
> > m_defrag().
> > (A few drivers use m_collapse() which is less likely to succeed.)
> >
> > As a workaround to this problem, the attached patch modifies NFS
> > to use larger pagesize clusters, so that the 64K RPC message is
> > in 18 mbufs (assuming a 4K pagesize).
> >
> > Testing on my slow hardware which does not have TSO capability
> > shows it to be performance neutral, but I believe avoiding the
> > overhead of copying via m_defrag() { and possible failures
> > resulting in the message never being transmitted } makes this
> > patch worth doing.
> >
> > As such, I'd like to request review and/or testing of this patch
> > by anyone who can do so.
> 
> First, I've tried to find respective NIC to test: cxgb/cxgbe have
> limit
> of 36, and so probably unaffected, ixgb -- 100, igb -- 64, only on em
> I've found limit of 32.
> 
When I did a find/grep on sys/dev, I found a bunch of them, but I didn't
save the output. The case that came up was virtio and the author fixed
that driver, since there was no hardware limitation. The "ix" driver
(in sys/dev/ixgbe) is an example for some chips. I believe the 82599
chips have the 32 limit.

> I run several profiles on em NIC with and without the patch. I can
> confirm that without the patch m_defrag() is indeed called, while
> with
> patch it is not any more. But profiler shows to me that very small
> amount of time (percents or even fractions) is spent there. I can't
> measure the effect (my Core-i7 desktop test system has only about 5%
> CPU
> load while serving full 1Gbps NFS over the em), though I can't say
> for
> sure that effect can't be there on some low-end system.
> 
Well, since m_defrag() creates a new list and bcopy()s the data, there
is some overhead, although I'm not surprised it isn't that easy to measure.
(I thought your server built entirely of SSDs might show a difference.)

I am more concerned with the possibility of m_defrag() failing and the
driver dropping the reply, forcing the client to do a fresh TCP connection
and retry of the RPC after a long timeout (1minute or more). This will
show up as "terrible performance" for users.

Also, some drivers use m_collapse() instead of m_defrag() and these
will probably be "train wrecks". I get cases where reports of serious
NFS problems get "fixed" by disabling TSO and I was hoping this would
work around that.

> I am also not very sure about replacing M_WAITOK with M_NOWAIT.
> Instead
> of waiting a bit while VM find a cluster, NFSMCLGET() will return
> single
> mbuf, as result, replacing chain of 2K clusters instead of 4K ones
> with
> chain of 256b mbufs.
> 
I hoped the comment in the patch would explain this.

When I was testing (on a small i386 system), I succeeded in getting
threads stuck sleeping on "btalloc" a couple of times when I used
M_WAITOK for m_getjcl(). As far as I could see, this indicated that
it hasd run out of kernel address space, but I'm not sure.
--> That is why I used M_NOWAIT for m_getjcl().

As for using MCLGET(..M_NOWAIT), the main reason for doing that
was I noticed that the code does a drain on zone_mcluster if this
allocation attempt for a cluster fails. For some reason, m_getcl()
and m_getjcl() do not do this drain of the zone?
I thought the drain might help memory constrained cases.
To be honest, I've never been able to get a MCLGET(..M_NOWAIT)
to fail during testing.

rick

> --
> Alexander Motin
>