Date: Tue, 18 Mar 2014 19:57:37 -0400 (EDT) From: Rick Macklem <rmacklem@uoguelph.ca> To: Alexander Motin <mav@FreeBSD.org> Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org> Subject: Re: review/test: NFS patch to use pagesize mbuf clusters Message-ID: <2092082855.24699674.1395187057807.JavaMail.root@uoguelph.ca> In-Reply-To: <5328065D.60201@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Alexander Motin wrote: > Hi. > > On 18.03.2014 03:26, Rick Macklem wrote: > > Several of the TSO capable network interfaces have a limit of > > 32 mbufs in the transmit mbuf chain (the drivers call these > > transmit > > segments, which I admit I find confusing). > > > > For a 64K read/readdir reply or 64K write request, NFS passes > > a list of 34 mbufs down to TCP. TCP will split the list, since > > it is slightly more than 64K bytes, but that split will normally > > be a copy by reference of the last mbuf cluster. As such, normally > > the network interface will get a list of 34 mbufs. > > > > For TSO enabled interfaces that are limited to 32 mbufs in the > > list, the usual workaround in the driver is to copy { real copy, > > not copy by reference } the list to 32 mbuf clusters via > > m_defrag(). > > (A few drivers use m_collapse() which is less likely to succeed.) > > > > As a workaround to this problem, the attached patch modifies NFS > > to use larger pagesize clusters, so that the 64K RPC message is > > in 18 mbufs (assuming a 4K pagesize). > > > > Testing on my slow hardware which does not have TSO capability > > shows it to be performance neutral, but I believe avoiding the > > overhead of copying via m_defrag() { and possible failures > > resulting in the message never being transmitted } makes this > > patch worth doing. > > > > As such, I'd like to request review and/or testing of this patch > > by anyone who can do so. > > First, I've tried to find respective NIC to test: cxgb/cxgbe have > limit > of 36, and so probably unaffected, ixgb -- 100, igb -- 64, only on em > I've found limit of 32. > When I did a find/grep on sys/dev, I found a bunch of them, but I didn't save the output. The case that came up was virtio and the author fixed that driver, since there was no hardware limitation. The "ix" driver (in sys/dev/ixgbe) is an example for some chips. I believe the 82599 chips have the 32 limit. > I run several profiles on em NIC with and without the patch. I can > confirm that without the patch m_defrag() is indeed called, while > with > patch it is not any more. But profiler shows to me that very small > amount of time (percents or even fractions) is spent there. I can't > measure the effect (my Core-i7 desktop test system has only about 5% > CPU > load while serving full 1Gbps NFS over the em), though I can't say > for > sure that effect can't be there on some low-end system. > Well, since m_defrag() creates a new list and bcopy()s the data, there is some overhead, although I'm not surprised it isn't that easy to measure. (I thought your server built entirely of SSDs might show a difference.) I am more concerned with the possibility of m_defrag() failing and the driver dropping the reply, forcing the client to do a fresh TCP connection and retry of the RPC after a long timeout (1minute or more). This will show up as "terrible performance" for users. Also, some drivers use m_collapse() instead of m_defrag() and these will probably be "train wrecks". I get cases where reports of serious NFS problems get "fixed" by disabling TSO and I was hoping this would work around that. > I am also not very sure about replacing M_WAITOK with M_NOWAIT. > Instead > of waiting a bit while VM find a cluster, NFSMCLGET() will return > single > mbuf, as result, replacing chain of 2K clusters instead of 4K ones > with > chain of 256b mbufs. > I hoped the comment in the patch would explain this. When I was testing (on a small i386 system), I succeeded in getting threads stuck sleeping on "btalloc" a couple of times when I used M_WAITOK for m_getjcl(). As far as I could see, this indicated that it hasd run out of kernel address space, but I'm not sure. --> That is why I used M_NOWAIT for m_getjcl(). As for using MCLGET(..M_NOWAIT), the main reason for doing that was I noticed that the code does a drain on zone_mcluster if this allocation attempt for a cluster fails. For some reason, m_getcl() and m_getjcl() do not do this drain of the zone? I thought the drain might help memory constrained cases. To be honest, I've never been able to get a MCLGET(..M_NOWAIT) to fail during testing. rick > -- > Alexander Motin >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2092082855.24699674.1395187057807.JavaMail.root>