From owner-freebsd-fs@FreeBSD.ORG  Tue Mar 25 23:10:36 2014
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id B689AF1C;
 Tue, 25 Mar 2014 23:10:36 +0000 (UTC)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
 [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 447F915D;
 Tue, 25 Mar 2014 23:10:36 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqIEAKwLMlODaFve/2dsb2JhbABZg0FXgwe+R4EegTN0gk8ERws1Ag0ZAl8BLYdeDa02oiEXgSmMbQEjNIJ2gUkElF8HhRmRAINKIYEsAR8i
X-IronPort-AV: E=Sophos;i="4.97,730,1389762000"; d="scan'208";a="109106594"
Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.222])
 by esa-jnhn.mail.uoguelph.ca with ESMTP; 25 Mar 2014 19:10:35 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 54716B405A;
 Tue, 25 Mar 2014 19:10:35 -0400 (EDT)
Date: Tue, 25 Mar 2014 19:10:35 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: FreeBSD Filesystems <freebsd-fs@freebsd.org>, 
 FreeBSD Net <freebsd-net@freebsd.org>
Message-ID: <1609686124.539328.1395789035334.JavaMail.root@uoguelph.ca>
Subject: RFC: How to fix the NFS/iSCSI vs TSO problem
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.201]
X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790)
Cc: Alexander Motin <mav@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Mar 2014 23:10:36 -0000

Hi,

First off, I hope you don't mind that I cross-posted this,
but I wanted to make sure both the NFS/iSCSI and networking
types say it.
If you look in this mailing list thread:
  http://docs.FreeBSD.org/cgi/mid.cgi?1850411724.1687820.1395621539316.JavaMail.root
you'll see that several people have been working hard at testing and
thanks to them, I think I now know what is going on.
(This applies to network drivers that support TSO and are limited to 32 transmit
 segments->32 mbufs in chain.) Doing a quick search I found the following
drivers that appear to be affected (I may have missed some):
 jme, fxp, age, sge, msk, alc, ale, ixgbe/ix, nfe, e1000/em, re

Further, of these drivers, the following use m_collapse() and not m_defrag()
to try and reduce the # of mbufs in the chain. m_collapse() is not going to
get the 35 mbufs down to 32 mbufs, as far as I can see, so these ones are
more badly broken:
 jme, fxp, age, sge, alc, ale, nfe, re

The long description is in the above thread, but the short version is:
- NFS generates a chain with 35 mbufs in it for (read/readdir replies and write requests)
  made up of (tcpip header, RPC header, NFS args, 32 clusters of file data)
- tcp_output() usually trims the data size down to tp->t_tsomax (65535) and
  then some more to make it an exact multiple of TCP transmit data size.
  - the net driver prepends an ethernet header, growing the length by 14 (or
    sometimes 18 for vlans), but in the first mbuf and not adding one to the chain.
  - m_defrag() copies this to a chain of 32 mbuf clusters (because the total data
    length is <= 64K) and it gets sent

However, it the data length is a little less than 64K when passed to tcp_output()
so that the length including headers is in the range 65519->65535...
- tcp_output() doesn't reduce its size.
  - the net driver adds an ethernet header, making the total data length slightly
    greater than 64K
  - m_defrag() copies it to a chain of 33 mbuf clusters, which fails with EFBIG
--> trainwrecks NFS performance, because the TSO segment is dropped instead of sent.

A tester also stated that the problem could be reproduced using iSCSI. Maybe
Edward Napierala might know some details w.r.t. what kind of mbuf chain iSCSI
generates?

Also, one tester has reported that setting if_hw_tsomax in the driver before
the ether_ifattach() call didn't make the value of tp->t_tsomax smaller.
However, reducing IP_MAXPACKET (which is what it is set to by default) did
reduce it. I have no idea why this happens or how to fix it, but it implies
that setting if_hw_tsomax in the driver isn't a solution until this is resolved.

So, what to do about this?
First, I'd like a simple fix/workaround that can go into 9.3. (which is code
freeze in May). The best thing I can think of is setting if_hw_tsomax to a
smaller default value. (Line# 658 of sys/net/if.c in head.)

Version A:
replace
  ifp->if_hw_tsomax = IP_MAXPACKET;
with
  ifp->if_hw_tsomax = min(32 * MCLBYTES - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN),
      IP_MAXPACKET);
plus
  replace m_collapse() with m_defrag() in the drivers listed above.

This would only reduce the default from 65535->65518, so it only impacts
the uncommon case where the output size (with tcpip header) is within
this range. (As such, I don't think it would have a negative impact for
drivers that handle more than 32 transmit segments.)
>From the testers, it seems that this is sufficient to get rid of the EFBIG
errors. (The total data length including ethernet header doesn't exceed 64K,
so m_defrag() fits it into 32 mbuf clusters.)

The main downside of this is that there will be a lot of m_defrag() calls
being done and they do quite a bit of bcopy()'ng.

Version B:
replace
  ifp->if_hw_tsomax = IP_MAXPACKET;
with
  ifp->if_hw_tsomax = min(29 * MCLBYTES, IP_MAXPACKET);

This one would avoid the m_defrag() calls, but might have a negative
impact on TSO performance for drivers that can handle 35 transmit segments,
since the maximum TSO segment size is reduced by about 6K. (Because of the
second size reduction to an exact multiple of TCP transmit data size, the
exact amount varies.)

Possible longer term fixes:
One longer term fix might be to add something like if_hw_tsomaxseg so that
a driver can set a limit on the number of transmit segments (mbufs in chain)
and tcp_output() could use that to limit the size of the TSO segment, as
required. (I have a first stab at such a patch, but no way to test it, so
I can't see that being done by May. Also, it would require changes to a lot
of drivers to make it work. I've attached this patch, in case anyone wants
to work on it?)

Another might be to increase the size of MCLBYTES (I don't see this as
practical for 9.3, although the actual change is simple.). I do think
that increasing MCLBYTES might be something to consider doing in the
future, for reasons beyond fixing this.

So, what do others think should be done? rick