Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 30 Jan 2014 17:44:03 -0500 (EST)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        J David <j.david.lists@gmail.com>
Cc:        Bryan Venteicher <bryanv@freebsd.org>, Garrett Wollman <wollman@freebsd.org>, freebsd-net@freebsd.org
Subject:   Re: Terrible NFS performance under 9.2-RELEASE?
Message-ID:  <87942875.478893.1391121843834.JavaMail.root@uoguelph.ca>
In-Reply-To: <CABXB=RR1eDvdUAaZd73Vv99EJR=DFzwRvMTw3WFER3aQ%2B2%2B2zQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
J David wrote:
> On Wed, Jan 29, 2014 at 10:31 PM, Rick Macklem <rmacklem@uoguelph.ca>
> wrote:
> >> I've been busy the last few days, and won't be able to get to any
> >> code
> >> until the weekend.
> 
> Is there likely to be more to it than just cranking the MAX_TX_SEGS
> value and recompiling?  If so, is it something I could take on?
> 
> > Well, NFS hands TCP a list of 34 mbufs. If TCP obly adds one, then
> > increasing it from 34 to 35 would be all it takes. However, see
> > below.
> 
> One thing I don't want to miss here is that an NFS block size of
> 65,536 is really suboptimal.  The largest size of a TCP datagram is
> 65535.  So by the time NFS adds the overhead on and the total amount
> of data to be sent winds up in that ~65k range, it guarantees that
> the
> operation has to be split it into at least two TCP packets, one
> max-size and one tiny one.  This doubles a lot of the network stack
> overhead, regardless of whether the packet ends up being segmented
> into tiny bits down the road or not.
> 
For your virtual network, yes. For the underlying file system on the
server (which would not normally be in memory), a large block size
will normally be good. (No one size fits all, which is why there are
the rsize/wsize mount options.) To be honest, the limit is MAXBSIZE,
which just happens to be 64K at this time. I'd like to see MAXBSIZE
increased to at least 128K, since that is the default blocksize for
ZFS, I've been told.

Also, for real networks, the NFS RPC message will be broken into
quite a few packets to go on the wire, as far as I know. (I don't
think there are real networks using a 64K jumbo packet, is there?)
For my hardware, the packets will be 1500bytes each on the wire,
since nothing I have does jumbo packets.

Unfortunately, NFS adds a little bit to the front of the data, so
an NFS RPC will always be a little bit more than a power of 2 in
size for reads/writes of a power of 2. Also, most NFS RPC messages
are small, so NFS traffic is always going to have a lot of small
TCP segments interspersed with a few large ones (and going in both
directions on the TCP connection concurrently).

Now, I am not sure why 65535 (largest ip datagram) has been chosen
as the default limit for TSO segments? (From my point of view, it
would be nice if the limit were larger, assuming there is a limit on
the number of mbufs in the list, so that calls to m_collapse()/m_defrag()
are avoided. I am hoping the networking types consider my recent post
and maybe the suggestion of having a if_hw_tsomaxseg limit along with
if_hw_tsomax.)

> If NFS could be modified to respect the actual size of a TCP packet,
> generating a steady stream of 63.9k (or thereabout) writes instead of
> the current 64k-1k-64k-1k, performance would likely see another
> significant boost.  This would nearly double the average throughput
> per packet, which would help with network latency and CPU load.
> 
> It's also not 100% clear but it seems like in some cases the existing
> behavior also causes the TCP stack to park on the "leftover" bit and
> wait for more data, which comes in another >64k chunk, and from there
> on out there's no more correlation between TCP packets and NFS
> operations, so an operation doesn't begin on a packet boundary.  That
> continues as long as load keeps up.  That's probably not good for
> performance either.  And it certainly confuses the heck out of
> tcpdump.
> 
Well, since NFS sets the TCP_NODELAY socket option, that shouldn't
occur in the TCP layer. If some network device driver is delaying,
waiting for more to send, then I'd say that device driver is broken.

> Probably 60k would be the next most reasonable size, since it's the
> largest page size multiple that will fit into a TCP packet while
> still
> leaving room for overhead.
> 
> Since the max size of TCP packets is not an area where there's really
> any flexibility, what would have to happen to NFS to make that (or
> arbitrary values) perform at its best within that constraint?
> 
For real NFS environments, the performance of the file system and
underlying disk subsystem is generally more important than the network.
(Your benchmark has artificially taken the file system on disk out of
 the mix, so you will see an exaggerated effect from network performance.
 This is fine if you are looking for network bottlenecks, but not if
 you want to relate this to performance of a real NFS environment.)
I already mentioned that the Linux client doing file_sync 8K writes
will result in poor performance of a server's disk file system. (Some
NAS vendors avoid this by using non-volatile ram in the server as stable
storage, but a FreeBSD server can't expect such hardware to be available.)

> It's apparent from even trivial testing that performance is
> dramatically affected if the "use a power of two for NFS rsize/wsize"
> recommendation isn't followed, but what is the origin of that?  Is it
> something that could be changed?
> 
Because disk file systems on file servers always use block sizes that
are a power of 2.

> > I don't think that m_collapse() is more likely to fail, since it
> > only copies data to the previous mbuf when the entire mbuf that
> > follows will fit and it's allowed. I'd assume that a ref count
> > copied mbuf cluster doesn't allow this copy or things would be
> > badly broken.)
> 
> m_collapse checks M_WRITEABLE which appears to cover the ref count
> case.  (It's a dense macro, but it seems to require a ref count of 1
> if a cluster is used.)
> 
> The cases where m_collapse can succeed are pretty slim.  It pretty
> much requires two consecutive underutilizied buffers, which probably
> explains why it fails so often in this code path.  Since one of its
> two methods outright skips the packet header mbuf (to avoid risk of
> moving it), possibly the only case where it succeeds is when the last
> data mbuf is short enough that whatever NFS trailers are being
> appended can fit with it.
> 
Yes, I would agree with this. (I think I somehow mistyped what I
meant to say. I didn't mean to imply that m_collapse() will usually
succeed for these long NFS mbuf list RPC messages.)

> > Bottom line, I think calling either m_collapse() or m_defrag()
> > should be considered a "last resort".
> 
> It definitely seems more designed for a case where 8 different stack
> layers each put their own little header/trailer fingerprint on the
> packet, and that's not what's happening here.
> 
> > Maybe the driver could reduce the size of if_hw_tsomax whenever
> > it finds it needs to call one of these functions, to try and avoid
> > a re-occurrence?
> 
> Since the issue is one of segment length rather than packet length,
> this seems risky.  If one of those touched-by-everybody packets goes
> by, it may not be that large, but it would risk permanently (until
> reboot) dropping the throughput of that interface.
> 
Agreed. I think adding a if_hw_tsomaxseg that TCP can use is preferable.
I didn't think of that until after sending the first post.
Also, I think adding it implies a driver KPI change, which means it can't
be done for 9.n or 10.n.

rick

> Thanks!
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscribe@freebsd.org"
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?87942875.478893.1391121843834.JavaMail.root>