From owner-freebsd-net@FreeBSD.ORG  Fri Jan 31 04:36:19 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id BA598C9;
 Fri, 31 Jan 2014 04:36:19 +0000 (UTC)
Received: from mail-ig0-x232.google.com (mail-ig0-x232.google.com
 [IPv6:2607:f8b0:4001:c05::232])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 728B51779;
 Fri, 31 Jan 2014 04:36:19 +0000 (UTC)
Received: by mail-ig0-f178.google.com with SMTP id uq10so8705475igb.5
 for <multiple recipients>; Thu, 30 Jan 2014 20:36:18 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=aagJ+YALSALUvr3mHWx7BykmqSFhuPfJR6jgzI5O/W0=;
 b=ACOTWZ3Vs8rwaqpmcxcKSYdq6E/SJAwybpmZvADtLItLslyKfGImiiX5eblRgUyu7Z
 DA/vxbCud1fGcl23L/vaYV08J31goWMupdHyLxDWr6/f/OnmJ/I/M2kCfXaLifo9xctL
 0nHDDDjXt/fJHjXq3xOnaHx1pLvKg/WicfhaGF9V/g7xVMB+4czuiEmPnKVwoYzVSFAX
 sXn2BO7ocqs5QBy5bZJ749NP3kotbZ+kYSViHsCLjCpDvw6d1M1JsQXeFA8nCiSXzfgg
 QsH5ui7IahSFfaGv8Q5aXnvt03fw+YGBfcO4oxvSng9pX5OGJn0niSGZGsMwPd/BjwdY
 jNrg==
MIME-Version: 1.0
X-Received: by 10.43.51.65 with SMTP id vh1mr13779261icb.24.1391142978559;
 Thu, 30 Jan 2014 20:36:18 -0800 (PST)
Sender: jdavidlists@gmail.com
Received: by 10.42.170.8 with HTTP; Thu, 30 Jan 2014 20:36:18 -0800 (PST)
In-Reply-To: <87942875.478893.1391121843834.JavaMail.root@uoguelph.ca>
References: <CABXB=RR1eDvdUAaZd73Vv99EJR=DFzwRvMTw3WFER3aQ+2+2zQ@mail.gmail.com>
 <87942875.478893.1391121843834.JavaMail.root@uoguelph.ca>
Date: Thu, 30 Jan 2014 23:36:18 -0500
X-Google-Sender-Auth: ziCZJJVV4QHzrdWTvg3C_eY9gU4
Message-ID: <CABXB=RTx9_gE=0G9UAzwJ3LuYv8fy=sAOZp1e2D7cJ6_=kgd9A@mail.gmail.com>
Subject: Re: Terrible NFS performance under 9.2-RELEASE?
From: J David <j.david.lists@gmail.com>
To: Rick Macklem <rmacklem@uoguelph.ca>
Content-Type: text/plain; charset=ISO-8859-1
Cc: Bryan Venteicher <bryanv@freebsd.org>,
 Garrett Wollman <wollman@freebsd.org>, freebsd-net@freebsd.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 31 Jan 2014 04:36:19 -0000

On Thu, Jan 30, 2014 at 5:44 PM, Rick Macklem <rmacklem@uoguelph.ca> wrote:
> I'd like to see MAXBSIZE
> increased to at least 128K, since that is the default block size for
> ZFS, I've been told.

Regrettably, that is incomplete.  The ZFS record size is variable *up
to* 128kiB by default; it's more of an upper limit than a hard and
fast rule.  Also, it is configurable at runtime on a per-filesystem
basis.  Although any file >128kiB probably does use 128kiB blocks, ZFS
has ARC and L2ARC and manages its own prefetch.  Probably as long as
NFS treats the rsize/wsize as a fixed-sized block, the number of
workloads benefited by pushing it to 128kiB may be very limited.


> Also, for real networks, the NFS RPC message will be broken into
> quite a few packets to go on the wire, as far as I know. (I don't
> think there are real networks using a 64K jumbo packet, is there?)
> For my hardware, the packets will be 1500bytes each on the wire,
> since nothing I have does jumbo packets.

Real environments for NFS in 2014 are 10gig LANs with hardware TSO
that makes the overhead of TSO negligible.  As someone else on this
thread has already pointed out, efficiently utilizing TSO is
essentially mandatory to make good use of 10gig hardware.  So as far
as FreeBSD is concerned, yes, many networks effectively have a 64k MTU
(for TCP only since FreeBSD does not implement GSO at this time) and
it should act accordingly when dealing with them.

This NFS buffer size is nearly doubling the number of TCP packets it
takes to move the same amount data.  Regardless of how those packets
are eventually segmented -- which can be effectively ignored in the
real world of hardware TSO -- the overhead of TCP and IP is not nil,
cannot be offloaded, and doubling it is not a good thing.  It doubles
every step down to the very bottom, including optional stuff like PF
if it is hanging around in there.

> Unfortunately, NFS adds a little bit to the front of the data, so
> an NFS RPC will always be a little bit more than a power of 2 in
> size for reads/writes of a power of 2.

That's why NFS should be able to operate on page-sized multiples
rather than powers of 2.  Then it can operate on the filesystem using
the best size for that, operate on the network using the best size for
that, and mediate the two using page-sized jumbo clusters.

If you know the underlying filesystem block size, by all means, read
or write based on it where appropriate.

> Now, I am not sure why 65535 (largest ip datagram) has been chosen
> as the default limit for TSO segments?

The process of TCP segmentation, whether offloaded or not, is
performed on a single TCP packet.  It operates by reusing that
packet's header over and over for each segment with slight
modifications.  Consequently the maximum size that can be offloaded is
the maximum size that can be segmented: one packet.

> Well, since NFS sets the TCP_NODELAY socket option, that shouldn't
> occur in the TCP layer. If some network device driver is delaying,
> waiting for more to send, then I'd say that device driver is broken.

This is not a driver issue.  TCP_NODELAY means "don't wait for more
data."  It doesn't mean "don't send more data that is ready to be
sent."  If there's more data already present on the stream by the time
the TCP stack gets to it, which is possible in an SMP environment,
TCP_NODELAY won't, as far as I know, prevent it from being sent in the
next available packet.  This isn't necessarily something that happens
every time, or even consistently, but when you're sending a hundred
thousand packets per second, it looks like the chain can indeed come
off the bicycle.

NFS is not sending packets to the TCP stack, it is sending stream
data.  With TCP_NODELAY it should be possible to engineer a one send =
one packet correlation, but that's true if and only if that send is
less than the max packet size.

> For real NFS environments, the performance of the file system and
> underlying disk subsystem is generally more important than the network.

Maybe this is the case if NFS is serving from one spinning disk.  It's
definitely not the case for ZFS installs with 128GiB RAM, shelves of
SAS drives, TB of SSD L2ARC, and STEC slog devices.

The performance of the virtual environment we're using as a test
platform is remarkably close to that.  It just has the benefit of
being two orders of magnitude cheaper and therefore something that can
be set aside for testing stuff like this.

> (Some
> NAS vendors avoid this by using non-volatile ram in the server as stable
> storage, but a FreeBSD server can't expect such hardware to be available.)

Nonvolatile slogs are all but mandatory in any ZFS-backed-NFS
fileserver deployment.  Like TSO, it's not hypothetical, it is
standard for production deployments.

>> but what is the origin of that?  Is it
>> something that could be changed?
>>
> Because disk file systems on file servers always use block sizes that
> are a power of 2.

Maybe my question wasn't phrased well.  What is the origin of the huge
performance drop when a non-multiple-of-2 size is used?  This is
visible under small random ops where the data difference between a 60k
read and a 64k read isn't ever used and the next block is almost
certainly not going to be read next.  So it's very weird (to me) that
performance drops as much as it does.

> Agreed. I think adding a if_hw_tsomaxseg that TCP can use is preferable.

It may be valuable for other workloads to prevent drops on some kind
of pathologically sliced-up packets, but jumbo cluster support in NFS
should pretty much guarantee that it is not going to have a problem in
this area with any interface in common use.

Thanks!