From owner-freebsd-fs@FreeBSD.ORG Mon Mar 31 23:09:32 2014 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A5E699D2; Mon, 31 Mar 2014 23:09:32 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 338DB106; Mon, 31 Mar 2014 23:09:31 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ArsEAOP0OVODaFve/2dsb2JhbABZg0FXgwq/DIEegTV0giUBAQEDASMERwsbGAICDRkCIzYZG4dKAwkIDa52mwsNh0sXgSmIJ4MTgUQBIzQHgm+BSQSUYAeBeoMgiziFSoNMIYEsAR8i X-IronPort-AV: E=Sophos;i="4.97,768,1389762000"; d="scan'208";a="110685846" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 31 Mar 2014 19:09:24 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 94508B3F46; Mon, 31 Mar 2014 19:09:24 -0400 (EDT) Date: Mon, 31 Mar 2014 19:09:24 -0400 (EDT) From: Rick Macklem To: pyunyh@gmail.com Message-ID: <779330717.3747229.1396307364595.JavaMail.root@uoguelph.ca> In-Reply-To: <20140331023253.GC3548@michelle.cdnetworks.com> Subject: Re: RFC: How to fix the NFS/iSCSI vs TSO problem MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: FreeBSD Filesystems , FreeBSD Net , Alexander Motin X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2014 23:09:32 -0000 Yonghyeon Pyun wrote: > On Wed, Mar 26, 2014 at 08:27:48PM -0400, Rick Macklem wrote: > > pyunyh@gmail.com wrote: > > > On Tue, Mar 25, 2014 at 07:10:35PM -0400, Rick Macklem wrote: > > > > Hi, > > > > > > > > First off, I hope you don't mind that I cross-posted this, > > > > but I wanted to make sure both the NFS/iSCSI and networking > > > > types say it. > > > > If you look in this mailing list thread: > > > > http://docs.FreeBSD.org/cgi/mid.cgi?1850411724.1687820.1395621539316.JavaMail.root > > > > you'll see that several people have been working hard at > > > > testing > > > > and > > > > thanks to them, I think I now know what is going on. > > > > > > > > > Thanks for your hard work on narrowing down that issue. I'm too > > > busy for $work in these days so I couldn't find time to > > > investigate > > > the issue. > > > > > > > (This applies to network drivers that support TSO and are > > > > limited > > > > to 32 transmit > > > > segments->32 mbufs in chain.) Doing a quick search I found the > > > > following > > > > drivers that appear to be affected (I may have missed some): > > > > jme, fxp, age, sge, msk, alc, ale, ixgbe/ix, nfe, e1000/em, re > > > > > > > > > > The magic number 32 was chosen long time ago when I implemented > > > TSO > > > in non-Intel drivers. I tried to find optimal number to reduce > > > the > > > size kernel stack usage at that time. bus_dma(9) will coalesce > > > with previous segment if possible so I thought the number 32 was > > > not an issue. Not sure current bus_dma(9) also has the same code > > > though. The number 32 is arbitrary one so you can increase > > > it if you want. > > > > > Well, in the case of "ix" Jack Vogel says it is a hardware > > limitation. > > I can't change drivers that I can't test and don't know anything > > about > > the hardware. Maybe replacing m_collapse() with m_defrag() is an > > exception, > > since I know what that is doing and it isn't hardware related, but > > I > > would still prefer a review by the driver author/maintainer before > > making > > such a change. > > > > If there are drivers that you know can be increased from 32->35 > > please do > > so, since that will not only avoid the EFBIG failures but also > > avoid a > > lot of calls to m_defrag(). > > > > > > Further, of these drivers, the following use m_collapse() and > > > > not > > > > m_defrag() > > > > to try and reduce the # of mbufs in the chain. m_collapse() is > > > > not > > > > going to > > > > get the 35 mbufs down to 32 mbufs, as far as I can see, so > > > > these > > > > ones are > > > > more badly broken: > > > > jme, fxp, age, sge, alc, ale, nfe, re > > > > > > I guess m_defeg(9) is more optimized for non-TSO packets. You > > > don't > > > want to waste CPU cycles to copy the full frame to reduce the > > > number of mbufs in the chain. For TSO packets, m_defrag(9) looks > > > better but if we always have to copy a full TSO packet to make > > > TSO > > > work, driver writers have to invent better scheme rather than > > > blindly relying on m_defrag(9), I guess. > > > > > Yes, avoiding m_defrag() calls would be nice. For this issue, > > increasing > > the transmit segment limit from 32->35 does that, if the change can > > be > > done easily/safely. > > > > Otherwise, all I can think of is my suggestion to add something > > like > > if_hw_tsomaxseg which the driver can use to tell tcp_output() the > > driver's limit for # of mbufs in the chain. > > > > > > > > > > The long description is in the above thread, but the short > > > > version > > > > is: > > > > - NFS generates a chain with 35 mbufs in it for (read/readdir > > > > replies and write requests) > > > > made up of (tcpip header, RPC header, NFS args, 32 clusters > > > > of > > > > file data) > > > > - tcp_output() usually trims the data size down to tp->t_tsomax > > > > (65535) and > > > > then some more to make it an exact multiple of TCP transmit > > > > data > > > > size. > > > > - the net driver prepends an ethernet header, growing the > > > > length > > > > by 14 (or > > > > sometimes 18 for vlans), but in the first mbuf and not > > > > adding > > > > one to the chain. > > > > - m_defrag() copies this to a chain of 32 mbuf clusters > > > > (because > > > > the total data > > > > length is <= 64K) and it gets sent > > > > > > > > However, it the data length is a little less than 64K when > > > > passed > > > > to tcp_output() > > > > so that the length including headers is in the range > > > > 65519->65535... > > > > - tcp_output() doesn't reduce its size. > > > > - the net driver adds an ethernet header, making the total > > > > data > > > > length slightly > > > > greater than 64K > > > > - m_defrag() copies it to a chain of 33 mbuf clusters, which > > > > fails with EFBIG > > > > --> trainwrecks NFS performance, because the TSO segment is > > > > dropped > > > > instead of sent. > > > > > > > > A tester also stated that the problem could be reproduced using > > > > iSCSI. Maybe > > > > Edward Napierala might know some details w.r.t. what kind of > > > > mbuf > > > > chain iSCSI > > > > generates? > > > > > > > > Also, one tester has reported that setting if_hw_tsomax in the > > > > driver before > > > > the ether_ifattach() call didn't make the value of tp->t_tsomax > > > > smaller. > > > > However, reducing IP_MAXPACKET (which is what it is set to by > > > > default) did > > > > reduce it. I have no idea why this happens or how to fix it, > > > > but it > > > > implies > > > > that setting if_hw_tsomax in the driver isn't a solution until > > > > this > > > > is resolved. > > > > > > > > So, what to do about this? > > > > First, I'd like a simple fix/workaround that can go into 9.3. > > > > (which is code > > > > freeze in May). The best thing I can think of is setting > > > > if_hw_tsomax to a > > > > smaller default value. (Line# 658 of sys/net/if.c in head.) > > > > > > > > Version A: > > > > replace > > > > ifp->if_hw_tsomax = IP_MAXPACKET; > > > > with > > > > ifp->if_hw_tsomax = min(32 * MCLBYTES - (ETHER_HDR_LEN + > > > > ETHER_VLAN_ENCAP_LEN), > > > > IP_MAXPACKET); > > > > plus > > > > replace m_collapse() with m_defrag() in the drivers listed > > > > above. > > > > > > > > This would only reduce the default from 65535->65518, so it > > > > only > > > > impacts > > > > the uncommon case where the output size (with tcpip header) is > > > > within > > > > this range. (As such, I don't think it would have a negative > > > > impact > > > > for > > > > drivers that handle more than 32 transmit segments.) > > > > From the testers, it seems that this is sufficient to get rid > > > > of > > > > the EFBIG > > > > errors. (The total data length including ethernet header > > > > doesn't > > > > exceed 64K, > > > > so m_defrag() fits it into 32 mbuf clusters.) > > > > > > > > The main downside of this is that there will be a lot of > > > > m_defrag() > > > > calls > > > > being done and they do quite a bit of bcopy()'ng. > > > > > > > > Version B: > > > > replace > > > > ifp->if_hw_tsomax = IP_MAXPACKET; > > > > with > > > > ifp->if_hw_tsomax = min(29 * MCLBYTES, IP_MAXPACKET); > > > > > > > > This one would avoid the m_defrag() calls, but might have a > > > > negative > > > > impact on TSO performance for drivers that can handle 35 > > > > transmit > > > > segments, > > > > since the maximum TSO segment size is reduced by about 6K. > > > > (Because > > > > of the > > > > second size reduction to an exact multiple of TCP transmit data > > > > size, the > > > > exact amount varies.) > > > > > > > > Possible longer term fixes: > > > > One longer term fix might be to add something like > > > > if_hw_tsomaxseg > > > > so that > > > > a driver can set a limit on the number of transmit segments > > > > (mbufs > > > > in chain) > > > > and tcp_output() could use that to limit the size of the TSO > > > > segment, as > > > > required. (I have a first stab at such a patch, but no way to > > > > test > > > > it, so > > > > I can't see that being done by May. Also, it would require > > > > changes > > > > to a lot > > > > of drivers to make it work. I've attached this patch, in case > > > > anyone wants > > > > to work on it?) > > > > > > > > Another might be to increase the size of MCLBYTES (I don't see > > > > this > > > > as > > > > practical for 9.3, although the actual change is simple.). I do > > > > think > > > > that increasing MCLBYTES might be something to consider doing > > > > in > > > > the > > > > future, for reasons beyond fixing this. > > > > > > > > So, what do others think should be done? rick > > > > > > > > > > AFAIK all TSO capable drivers you mentioned above have no limit > > > on > > > the number of TX segments in TSO path. Not sure about Intel > > > controllers though. Increasing the number of segment will > > > consume > > > lots of kernel stack in those drivers. Given that ixgbe, which > > > seems to use 100, didn't show any kernal stack shortage, I think > > > bumping the number of segments would be quick way to address the > > > issue. > > > > > Well, bumping it from 32->35 is all it would take for NFS (can't > > comment > > w.r.t. iSCSI). ixgbe uses 100 for the 82598 chip and 32 for the > > 82599 > > (just so others aren't confused by the above comment). I understand > > your point was w.r.t. using 100 without blowing the kernel stack, > > but > > since the testers have been using "ix" with the 82599 chip which is > > limited to 32 transmit segments... > > > > However, please increase any you know can be safely done from > > 32->35, rick > > Done in r263957. > Thanks, rick ps: I've pinged the guys who seem to be maintaining bge, bce, bxe, since they all have the problem, too.