From owner-freebsd-fs@FreeBSD.ORG Tue Mar 25 23:10:36 2014 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id B689AF1C; Tue, 25 Mar 2014 23:10:36 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 447F915D; Tue, 25 Mar 2014 23:10:36 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqIEAKwLMlODaFve/2dsb2JhbABZg0FXgwe+R4EegTN0gk8ERws1Ag0ZAl8BLYdeDa02oiEXgSmMbQEjNIJ2gUkElF8HhRmRAINKIYEsAR8i X-IronPort-AV: E=Sophos;i="4.97,730,1389762000"; d="scan'208";a="109106594" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 25 Mar 2014 19:10:35 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 54716B405A; Tue, 25 Mar 2014 19:10:35 -0400 (EDT) Date: Tue, 25 Mar 2014 19:10:35 -0400 (EDT) From: Rick Macklem To: FreeBSD Filesystems , FreeBSD Net Message-ID: <1609686124.539328.1395789035334.JavaMail.root@uoguelph.ca> Subject: RFC: How to fix the NFS/iSCSI vs TSO problem MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: Alexander Motin X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Mar 2014 23:10:36 -0000 Hi, First off, I hope you don't mind that I cross-posted this, but I wanted to make sure both the NFS/iSCSI and networking types say it. If you look in this mailing list thread: http://docs.FreeBSD.org/cgi/mid.cgi?1850411724.1687820.1395621539316.JavaMail.root you'll see that several people have been working hard at testing and thanks to them, I think I now know what is going on. (This applies to network drivers that support TSO and are limited to 32 transmit segments->32 mbufs in chain.) Doing a quick search I found the following drivers that appear to be affected (I may have missed some): jme, fxp, age, sge, msk, alc, ale, ixgbe/ix, nfe, e1000/em, re Further, of these drivers, the following use m_collapse() and not m_defrag() to try and reduce the # of mbufs in the chain. m_collapse() is not going to get the 35 mbufs down to 32 mbufs, as far as I can see, so these ones are more badly broken: jme, fxp, age, sge, alc, ale, nfe, re The long description is in the above thread, but the short version is: - NFS generates a chain with 35 mbufs in it for (read/readdir replies and write requests) made up of (tcpip header, RPC header, NFS args, 32 clusters of file data) - tcp_output() usually trims the data size down to tp->t_tsomax (65535) and then some more to make it an exact multiple of TCP transmit data size. - the net driver prepends an ethernet header, growing the length by 14 (or sometimes 18 for vlans), but in the first mbuf and not adding one to the chain. - m_defrag() copies this to a chain of 32 mbuf clusters (because the total data length is <= 64K) and it gets sent However, it the data length is a little less than 64K when passed to tcp_output() so that the length including headers is in the range 65519->65535... - tcp_output() doesn't reduce its size. - the net driver adds an ethernet header, making the total data length slightly greater than 64K - m_defrag() copies it to a chain of 33 mbuf clusters, which fails with EFBIG --> trainwrecks NFS performance, because the TSO segment is dropped instead of sent. A tester also stated that the problem could be reproduced using iSCSI. Maybe Edward Napierala might know some details w.r.t. what kind of mbuf chain iSCSI generates? Also, one tester has reported that setting if_hw_tsomax in the driver before the ether_ifattach() call didn't make the value of tp->t_tsomax smaller. However, reducing IP_MAXPACKET (which is what it is set to by default) did reduce it. I have no idea why this happens or how to fix it, but it implies that setting if_hw_tsomax in the driver isn't a solution until this is resolved. So, what to do about this? First, I'd like a simple fix/workaround that can go into 9.3. (which is code freeze in May). The best thing I can think of is setting if_hw_tsomax to a smaller default value. (Line# 658 of sys/net/if.c in head.) Version A: replace ifp->if_hw_tsomax = IP_MAXPACKET; with ifp->if_hw_tsomax = min(32 * MCLBYTES - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN), IP_MAXPACKET); plus replace m_collapse() with m_defrag() in the drivers listed above. This would only reduce the default from 65535->65518, so it only impacts the uncommon case where the output size (with tcpip header) is within this range. (As such, I don't think it would have a negative impact for drivers that handle more than 32 transmit segments.) >From the testers, it seems that this is sufficient to get rid of the EFBIG errors. (The total data length including ethernet header doesn't exceed 64K, so m_defrag() fits it into 32 mbuf clusters.) The main downside of this is that there will be a lot of m_defrag() calls being done and they do quite a bit of bcopy()'ng. Version B: replace ifp->if_hw_tsomax = IP_MAXPACKET; with ifp->if_hw_tsomax = min(29 * MCLBYTES, IP_MAXPACKET); This one would avoid the m_defrag() calls, but might have a negative impact on TSO performance for drivers that can handle 35 transmit segments, since the maximum TSO segment size is reduced by about 6K. (Because of the second size reduction to an exact multiple of TCP transmit data size, the exact amount varies.) Possible longer term fixes: One longer term fix might be to add something like if_hw_tsomaxseg so that a driver can set a limit on the number of transmit segments (mbufs in chain) and tcp_output() could use that to limit the size of the TSO segment, as required. (I have a first stab at such a patch, but no way to test it, so I can't see that being done by May. Also, it would require changes to a lot of drivers to make it work. I've attached this patch, in case anyone wants to work on it?) Another might be to increase the size of MCLBYTES (I don't see this as practical for 9.3, although the actual change is simple.). I do think that increasing MCLBYTES might be something to consider doing in the future, for reasons beyond fixing this. So, what do others think should be done? rick