From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 00:58:27 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 605F0A75 for ; Tue, 28 Jan 2014 00:58:27 +0000 (UTC) Received: from mail-pd0-x231.google.com (mail-pd0-x231.google.com [IPv6:2607:f8b0:400e:c02::231]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 2E548136F for ; Tue, 28 Jan 2014 00:58:27 +0000 (UTC) Received: by mail-pd0-f177.google.com with SMTP id x10so6402562pdj.36 for ; Mon, 27 Jan 2014 16:58:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:date:to:cc:subject:message-id:reply-to:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=Kydc5GLBfufbqmaeY5S8edqaJ9Hxqemk8fq/lmUiyf0=; b=gyDLZhkqNDC8DzYk7gK7OCFptNRyBL5B2rM6S4Z944zcQsAguwg0SNGrVdewA8KJr7 U5pyJ4TVYlDSx9mIFM0jmXnN4KZJ33ibL10LdLAs6TPkVaKoquMXBvjbTb2Fuh6Hi/L5 fn2mGQQJixs246OmMrXPrwahd1re7TKdpd8UTgG0u0xZ6CHicG4PPt93L1/mdhrB0DfR jADrjQvBS+im5BHS9XjJHmqCUkPc/8mz/70nIukldH4wGmXgkvJKeHzNG5O7LbE+w11Z BazE+lrVjddQy5xoyXScA4vHxA9tiPFKBuTM+XCSKzyXdioyvx56v3xHTAxXvHTrqGBd ws3Q== X-Received: by 10.67.5.131 with SMTP id cm3mr33310107pad.92.1390870706768; Mon, 27 Jan 2014 16:58:26 -0800 (PST) Received: from pyunyh@gmail.com (lpe4.p59-icn.cdngp.net. [114.111.62.249]) by mx.google.com with ESMTPSA id un5sm97038939pab.3.2014.01.27.16.58.22 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Mon, 27 Jan 2014 16:58:25 -0800 (PST) Received: by pyunyh@gmail.com (sSMTP sendmail emulation); Tue, 28 Jan 2014 09:58:18 +0900 From: Yonghyeon PYUN Date: Tue, 28 Jan 2014 09:58:18 +0900 To: Rick Macklem Subject: Re: Terrible NFS performance under 9.2-RELEASE? Message-ID: <20140128005818.GB2722@michelle.cdnetworks.com> References: <20140127055047.GA1368@michelle.cdnetworks.com> <1168237133.17228249.1390865239175.JavaMail.root@uoguelph.ca> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="CE+1k2dSO48ffgeK" Content-Disposition: inline In-Reply-To: <1168237133.17228249.1390865239175.JavaMail.root@uoguelph.ca> User-Agent: Mutt/1.4.2.3i Cc: Daniel Braniss , freebsd-net@freebsd.org, Adam McDougall X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list Reply-To: pyunyh@gmail.com List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 00:58:27 -0000 --CE+1k2dSO48ffgeK Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Mon, Jan 27, 2014 at 06:27:19PM -0500, Rick Macklem wrote: > pyunyh@gmail.com wrote: > > On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote: > > > Adam McDougall wrote: > > > > Also try rsize=32768,wsize=32768 in your mount options, made a > > > > huge > > > > difference for me. I've noticed slow file transfers on NFS in 9 > > > > and > > > > finally did some searching a couple months ago, someone suggested > > > > it > > > > and > > > > they were on to something. > > > > > > > I have a "hunch" that might explain why 64K NFS reads/writes > > > perform > > > poorly for some network environments. > > > A 64K NFS read reply/write request consists of a list of 34 mbufs > > > when > > > passed to TCP via sosend() and a total data length of around > > > 65680bytes. > > > Looking at a couple of drivers (virtio and ixgbe), they seem to > > > expect > > > no more than 32-33 mbufs in a list for a 65535 byte TSO xmit. I > > > think > > > (I don't have anything that does TSO to confirm this) that NFS will > > > pass > > > a list that is longer (34 plus a TCP/IP header). > > > At a glance, it appears that the drivers call m_defrag() or > > > m_collapse() > > > when the mbuf list won't fit in their scatter table (32 or 33 > > > elements) > > > and if this fails, just silently drop the data without sending it. > > > If I'm right, there would considerable overhead from > > > m_defrag()/m_collapse() > > > and near disaster if they fail to fix the problem and the data is > > > silently > > > dropped instead of xmited. > > > > > > > I think the actual number of DMA segments allocated for the mbuf > > chain is determined by bus_dma(9). bus_dma(9) will coalesce > > current segment with previous segment if possible. > > > Ok, I'll have to take a look, but I thought that an array of sized > by "num_segs" is passed in as an argument. (And num_segs is set to > either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).) > It looked to me that the ixgbe driver called itself ix, so it isn't > obvious to me which we are talking about. (I know that Daniel Braniss > had an ix0 and ix1, which were fixed for NFS by disabling TSO.) > It's ix(4). ixbge(4) is a different driver. > I'll admit I mostly looked at virtio's network driver, since that > was the one being used by J David. > > Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have been > cropping up for quite a while, and I am just trying to find out why. > (I have no hardware/software that exhibits the problem, so I can > only look at the sources and ask others to try testing stuff.) > > > I'm not sure whether you're referring to ixgbe(4) or ix(4) but I > > see the total length of all segment size of ix(4) is 65535 so > > it has no room for ethernet/VLAN header of the mbuf chain. The > > driver should be fixed to transmit a 64KB datagram. > Well, if_hw_tsomax is set to 65535 by the generic code (the driver > doesn't set it) and the code in tcp_output() seems to subtract the > size of an tcp/ip header from that before passing data to the driver, > so I think the mbuf chain passed to the driver will fit in one > ip datagram. (I'd assume all sorts of stuff would break for TSO > enabled drivers if that wasn't the case?) I believe the generic code is doing right. I'm under the impression the non-working TSO indicates a bug in driver. Some drivers didn't account for additional ethernet/VLAN header so the total size of DMA segments exceeded 65535. I've attached a diff for ix(4). It wasn't tested at all as I don't have hardware to test. > > > I think the use of m_defrag(9) in TSO is suboptimal. All TSO > > capable controllers are able to handle multiple TX buffers so it > > should have used m_collapse(9) rather than copying entire chain > > with m_defrag(9). > > > I haven't looked at these closely yet (plan on doing so to-day), but > even m_collapse() looked like it copied data between mbufs and that > is certainly suboptimal, imho. I don't see why a driver can't split > the mbuf list, if there are too many entries for the scatter/gather > and do it in two iterations (much like tcp_output() does already, > since the data length exceeds 65535 - tcp/ip header size). > It can split the mbuf list if controllers supports increased number of TX buffers. Because controller shall consume the same number of DMA descriptors for the mbuf list, drivers tend to impose a limit on the number of TX buffers to save resources. > However, at this point, I just want to find out if the long chain > of mbufs is why TSO is problematic for these drivers, since I'll > admit I'm getting tired of telling people to disable TSO (and I > suspect some don't believe me and never try it). > TSO capable controllers tend to have various limitations(the first TX buffer should have complete ethernet/IP/TCP header, ip_len of IP header should be reset to 0, TCP pseudo checksum should be recomputed etc) and cheap controllers need more assistance from driver to let its firmware know various IP/TCP header offset location in the mbuf. Because this requires a IP/TCP header parsing, it's error prone and very complex. > > > Anyhow, I have attached a patch that makes NFS use MJUMPAGESIZE > > > clusters, > > > so the mbuf count drops from 34 to 18. > > > > > > > Could we make it conditional on size? > > > Not sure what you mean? If you mean "the size of the read/write", > that would be possible for NFSv3, but less so for NFSv4. (The read/write > is just one Op. in the compound for NFSv4 and there is no way to > predict how much more data is going to be generated by subsequent Ops.) > Sorry, I should have been more clearer. You already answered my question. Thanks. > If by "size" you mean amount of memory in the machine then, yes, it > certainly could be conditional on that. (I plan to try and look at > the allocator to-day as well, but if others know of disadvantages with > using MJUMPAGESIZE instead of MCLBYTES, please speak up.) > > Garrett Wollman already alluded to the MCLBYTES case being pre-allocated, > but I'll admit I have no idea what the implications of that are at this > time. > > > > If anyone has a TSO scatter/gather enabled net interface and can > > > test this > > > patch on it with NFS I/O (default of 64K rsize/wsize) when TSO is > > > enabled > > > and see what effect it has, that would be appreciated. > > > > > > Btw, thanks go to Garrett Wollman for suggesting the change to > > > MJUMPAGESIZE > > > clusters. > > > > > > rick > > > ps: If the attachment doesn't make it through and you want the > > > patch, just > > > email me and I'll send you a copy. > > > --CE+1k2dSO48ffgeK Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="ix.TSO.diff" Index: sys/dev/ixgbe/ixv.h =================================================================== --- sys/dev/ixgbe/ixv.h (revision 260903) +++ sys/dev/ixgbe/ixv.h (working copy) @@ -172,7 +172,7 @@ #define IXV_SCATTER 32 #define IXV_RX_HDR 128 #define MSIX_BAR 3 -#define IXV_TSO_SIZE 65535 +#define IXV_TSO_SIZE (65535 + sizeof(struct ether_vlan_header)) #define IXV_BR_SIZE 4096 #define IXV_LINK_ITR 2000 #define TX_BUFFER_SIZE ((u32) 1514) --CE+1k2dSO48ffgeK--