From owner-freebsd-fs@FreeBSD.ORG  Mon Mar 31 23:09:32 2014
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id A5E699D2;
 Mon, 31 Mar 2014 23:09:32 +0000 (UTC)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
 [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 338DB106;
 Mon, 31 Mar 2014 23:09:31 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: ArsEAOP0OVODaFve/2dsb2JhbABZg0FXgwq/DIEegTV0giUBAQEDASMERwsbGAICDRkCIzYZG4dKAwkIDa52mwsNh0sXgSmIJ4MTgUQBIzQHgm+BSQSUYAeBeoMgiziFSoNMIYEsAR8i
X-IronPort-AV: E=Sophos;i="4.97,768,1389762000"; d="scan'208";a="110685846"
Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.222])
 by esa-jnhn.mail.uoguelph.ca with ESMTP; 31 Mar 2014 19:09:24 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 94508B3F46;
 Mon, 31 Mar 2014 19:09:24 -0400 (EDT)
Date: Mon, 31 Mar 2014 19:09:24 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: pyunyh@gmail.com
Message-ID: <779330717.3747229.1396307364595.JavaMail.root@uoguelph.ca>
In-Reply-To: <20140331023253.GC3548@michelle.cdnetworks.com>
Subject: Re: RFC: How to fix the NFS/iSCSI vs TSO problem
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.201]
X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790)
Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>,
 FreeBSD Net <freebsd-net@freebsd.org>, Alexander Motin <mav@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2014 23:09:32 -0000

Yonghyeon Pyun wrote:
> On Wed, Mar 26, 2014 at 08:27:48PM -0400, Rick Macklem wrote:
> > pyunyh@gmail.com wrote:
> > > On Tue, Mar 25, 2014 at 07:10:35PM -0400, Rick Macklem wrote:
> > > > Hi,
> > > > 
> > > > First off, I hope you don't mind that I cross-posted this,
> > > > but I wanted to make sure both the NFS/iSCSI and networking
> > > > types say it.
> > > > If you look in this mailing list thread:
> > > >   http://docs.FreeBSD.org/cgi/mid.cgi?1850411724.1687820.1395621539316.JavaMail.root
> > > > you'll see that several people have been working hard at
> > > > testing
> > > > and
> > > > thanks to them, I think I now know what is going on.
> > > 
> > > 
> > > Thanks for your hard work on narrowing down that issue.  I'm too
> > > busy for $work in these days so I couldn't find time to
> > > investigate
> > > the issue.
> > > 
> > > > (This applies to network drivers that support TSO and are
> > > > limited
> > > > to 32 transmit
> > > >  segments->32 mbufs in chain.) Doing a quick search I found the
> > > >  following
> > > > drivers that appear to be affected (I may have missed some):
> > > >  jme, fxp, age, sge, msk, alc, ale, ixgbe/ix, nfe, e1000/em, re
> > > > 
> > > 
> > > The magic number 32 was chosen long time ago when I implemented
> > > TSO
> > > in non-Intel drivers.  I tried to find optimal number to reduce
> > > the
> > > size kernel stack usage at that time.  bus_dma(9) will coalesce
> > > with previous segment if possible so I thought the number 32 was
> > > not an issue.  Not sure current bus_dma(9) also has the same code
> > > though.  The number 32 is arbitrary one so you can increase
> > > it if you want.
> > > 
> > Well, in the case of "ix" Jack Vogel says it is a hardware
> > limitation.
> > I can't change drivers that I can't test and don't know anything
> > about
> > the hardware. Maybe replacing m_collapse() with m_defrag() is an
> > exception,
> > since I know what that is doing and it isn't hardware related, but
> > I
> > would still prefer a review by the driver author/maintainer before
> > making
> > such a change.
> > 
> > If there are drivers that you know can be increased from 32->35
> > please do
> > so, since that will not only avoid the EFBIG failures but also
> > avoid a
> > lot of calls to m_defrag().
> > 
> > > > Further, of these drivers, the following use m_collapse() and
> > > > not
> > > > m_defrag()
> > > > to try and reduce the # of mbufs in the chain. m_collapse() is
> > > > not
> > > > going to
> > > > get the 35 mbufs down to 32 mbufs, as far as I can see, so
> > > > these
> > > > ones are
> > > > more badly broken:
> > > >  jme, fxp, age, sge, alc, ale, nfe, re
> > > 
> > > I guess m_defeg(9) is more optimized for non-TSO packets. You
> > > don't
> > > want to waste CPU cycles to copy the full frame to reduce the
> > > number of mbufs in the chain.  For TSO packets, m_defrag(9) looks
> > > better but if we always have to copy a full TSO packet to make
> > > TSO
> > > work, driver writers have to invent better scheme rather than
> > > blindly relying on m_defrag(9), I guess.
> > > 
> > Yes, avoiding m_defrag() calls would be nice. For this issue,
> > increasing
> > the transmit segment limit from 32->35 does that, if the change can
> > be
> > done easily/safely.
> > 
> > Otherwise, all I can think of is my suggestion to add something
> > like
> > if_hw_tsomaxseg which the driver can use to tell tcp_output() the
> > driver's limit for # of mbufs in the chain.
> > 
> > > > 
> > > > The long description is in the above thread, but the short
> > > > version
> > > > is:
> > > > - NFS generates a chain with 35 mbufs in it for (read/readdir
> > > > replies and write requests)
> > > >   made up of (tcpip header, RPC header, NFS args, 32 clusters
> > > >   of
> > > >   file data)
> > > > - tcp_output() usually trims the data size down to tp->t_tsomax
> > > > (65535) and
> > > >   then some more to make it an exact multiple of TCP transmit
> > > >   data
> > > >   size.
> > > >   - the net driver prepends an ethernet header, growing the
> > > >   length
> > > >   by 14 (or
> > > >     sometimes 18 for vlans), but in the first mbuf and not
> > > >     adding
> > > >     one to the chain.
> > > >   - m_defrag() copies this to a chain of 32 mbuf clusters
> > > >   (because
> > > >   the total data
> > > >     length is <= 64K) and it gets sent
> > > > 
> > > > However, it the data length is a little less than 64K when
> > > > passed
> > > > to tcp_output()
> > > > so that the length including headers is in the range
> > > > 65519->65535...
> > > > - tcp_output() doesn't reduce its size.
> > > >   - the net driver adds an ethernet header, making the total
> > > >   data
> > > >   length slightly
> > > >     greater than 64K
> > > >   - m_defrag() copies it to a chain of 33 mbuf clusters, which
> > > >   fails with EFBIG
> > > > --> trainwrecks NFS performance, because the TSO segment is
> > > > dropped
> > > > instead of sent.
> > > > 
> > > > A tester also stated that the problem could be reproduced using
> > > > iSCSI. Maybe
> > > > Edward Napierala might know some details w.r.t. what kind of
> > > > mbuf
> > > > chain iSCSI
> > > > generates?
> > > > 
> > > > Also, one tester has reported that setting if_hw_tsomax in the
> > > > driver before
> > > > the ether_ifattach() call didn't make the value of tp->t_tsomax
> > > > smaller.
> > > > However, reducing IP_MAXPACKET (which is what it is set to by
> > > > default) did
> > > > reduce it. I have no idea why this happens or how to fix it,
> > > > but it
> > > > implies
> > > > that setting if_hw_tsomax in the driver isn't a solution until
> > > > this
> > > > is resolved.
> > > > 
> > > > So, what to do about this?
> > > > First, I'd like a simple fix/workaround that can go into 9.3.
> > > > (which is code
> > > > freeze in May). The best thing I can think of is setting
> > > > if_hw_tsomax to a
> > > > smaller default value. (Line# 658 of sys/net/if.c in head.)
> > > > 
> > > > Version A:
> > > > replace
> > > >   ifp->if_hw_tsomax = IP_MAXPACKET;
> > > > with
> > > >   ifp->if_hw_tsomax = min(32 * MCLBYTES - (ETHER_HDR_LEN +
> > > >   ETHER_VLAN_ENCAP_LEN),
> > > >       IP_MAXPACKET);
> > > > plus
> > > >   replace m_collapse() with m_defrag() in the drivers listed
> > > >   above.
> > > > 
> > > > This would only reduce the default from 65535->65518, so it
> > > > only
> > > > impacts
> > > > the uncommon case where the output size (with tcpip header) is
> > > > within
> > > > this range. (As such, I don't think it would have a negative
> > > > impact
> > > > for
> > > > drivers that handle more than 32 transmit segments.)
> > > > From the testers, it seems that this is sufficient to get rid
> > > > of
> > > > the EFBIG
> > > > errors. (The total data length including ethernet header
> > > > doesn't
> > > > exceed 64K,
> > > > so m_defrag() fits it into 32 mbuf clusters.)
> > > > 
> > > > The main downside of this is that there will be a lot of
> > > > m_defrag()
> > > > calls
> > > > being done and they do quite a bit of bcopy()'ng.
> > > > 
> > > > Version B:
> > > > replace
> > > >   ifp->if_hw_tsomax = IP_MAXPACKET;
> > > > with
> > > >   ifp->if_hw_tsomax = min(29 * MCLBYTES, IP_MAXPACKET);
> > > > 
> > > > This one would avoid the m_defrag() calls, but might have a
> > > > negative
> > > > impact on TSO performance for drivers that can handle 35
> > > > transmit
> > > > segments,
> > > > since the maximum TSO segment size is reduced by about 6K.
> > > > (Because
> > > > of the
> > > > second size reduction to an exact multiple of TCP transmit data
> > > > size, the
> > > > exact amount varies.)
> > > > 
> > > > Possible longer term fixes:
> > > > One longer term fix might be to add something like
> > > > if_hw_tsomaxseg
> > > > so that
> > > > a driver can set a limit on the number of transmit segments
> > > > (mbufs
> > > > in chain)
> > > > and tcp_output() could use that to limit the size of the TSO
> > > > segment, as
> > > > required. (I have a first stab at such a patch, but no way to
> > > > test
> > > > it, so
> > > > I can't see that being done by May. Also, it would require
> > > > changes
> > > > to a lot
> > > > of drivers to make it work. I've attached this patch, in case
> > > > anyone wants
> > > > to work on it?)
> > > > 
> > > > Another might be to increase the size of MCLBYTES (I don't see
> > > > this
> > > > as
> > > > practical for 9.3, although the actual change is simple.). I do
> > > > think
> > > > that increasing MCLBYTES might be something to consider doing
> > > > in
> > > > the
> > > > future, for reasons beyond fixing this.
> > > > 
> > > > So, what do others think should be done? rick
> > > > 
> > > 
> > > AFAIK all TSO capable drivers you mentioned above have no limit
> > > on
> > > the number of TX segments in TSO path.  Not sure about Intel
> > > controllers though.  Increasing the number of segment will
> > > consume
> > > lots of kernel stack in those drivers. Given that ixgbe, which
> > > seems to use 100, didn't show any kernal stack shortage, I think
> > > bumping the number of segments would be quick way to address the
> > > issue.
> > > 
> > Well, bumping it from 32->35 is all it would take for NFS (can't
> > comment
> > w.r.t. iSCSI). ixgbe uses 100 for the 82598 chip and 32 for the
> > 82599
> > (just so others aren't confused by the above comment). I understand
> > your point was w.r.t. using 100 without blowing the kernel stack,
> > but
> > since the testers have been using "ix" with the 82599 chip which is
> > limited to 32 transmit segments...
> > 
> > However, please increase any you know can be safely done from
> > 32->35, rick
> 
> Done in r263957.
> 
Thanks, rick
ps: I've pinged the guys who seem to be maintaining bge, bce, bxe, since
    they all have the problem, too.