From owner-freebsd-net@FreeBSD.ORG  Thu Jan 30 20:30:18 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id A99C8DC7;
 Thu, 30 Jan 2014 20:30:18 +0000 (UTC)
Received: from mail-ig0-x233.google.com (mail-ig0-x233.google.com
 [IPv6:2607:f8b0:4001:c05::233])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 6328A1241;
 Thu, 30 Jan 2014 20:30:18 +0000 (UTC)
Received: by mail-ig0-f179.google.com with SMTP id c10so8032174igq.0
 for <multiple recipients>; Thu, 30 Jan 2014 12:30:17 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=BvNKMo4HmFJTvtr1m9IzBaaalkwmp7VIJ/nXUSIDH0w=;
 b=D31j1TdIDJpGm+qHklnbsHH3ZMdnyHwRjIj2mNUBtZTD4v004Ghhq1SyEAzG4N9Zyg
 JycEvLY1F68ueIdFqwGWZzRPEVJR3Fq5hsvJ5rifA3cPHQO7ZOtpOUMticISC0sW3Gp6
 bPEuSp5pMqT1dg3KHU0n6sp7Qs2uXTdr/fLOk1/fuTlec/yHB2iwSBRhf8Pp3bHe9kO2
 hGJgMha3BIEaVSiV7oScDI3/tH9O4NnQ/zW0rOHdWqSTWixQqdWfkPbVaqH5comlwPhD
 NKwIkhqE3fBvy5W6k4fpzo+bLTOy4rQyEzk/+ywHv2NwQnDfC0L/jJTRk2ChubUDeHlj
 dHqQ==
MIME-Version: 1.0
X-Received: by 10.50.50.70 with SMTP id a6mr15952234igo.1.1391113816972; Thu,
 30 Jan 2014 12:30:16 -0800 (PST)
Sender: jdavidlists@gmail.com
Received: by 10.42.170.8 with HTTP; Thu, 30 Jan 2014 12:30:16 -0800 (PST)
In-Reply-To: <1879662319.18746958.1391052668182.JavaMail.root@uoguelph.ca>
References: <CAGaYwLcDVMA3=1x4hXXVvRojCBewWFZUyZfdiup=jo685+51+A@mail.gmail.com>
 <1879662319.18746958.1391052668182.JavaMail.root@uoguelph.ca>
Date: Thu, 30 Jan 2014 15:30:16 -0500
X-Google-Sender-Auth: dtPD4ZjLe0S89T7j6CnmOBITrh4
Message-ID: <CABXB=RR1eDvdUAaZd73Vv99EJR=DFzwRvMTw3WFER3aQ+2+2zQ@mail.gmail.com>
Subject: Re: Terrible NFS performance under 9.2-RELEASE?
From: J David <j.david.lists@gmail.com>
To: Rick Macklem <rmacklem@uoguelph.ca>
Content-Type: text/plain; charset=ISO-8859-1
Cc: Bryan Venteicher <bryanv@freebsd.org>,
 Garrett Wollman <wollman@freebsd.org>, freebsd-net@freebsd.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Jan 2014 20:30:18 -0000

On Wed, Jan 29, 2014 at 10:31 PM, Rick Macklem <rmacklem@uoguelph.ca> wrote:
>> I've been busy the last few days, and won't be able to get to any
>> code
>> until the weekend.

Is there likely to be more to it than just cranking the MAX_TX_SEGS
value and recompiling?  If so, is it something I could take on?

> Well, NFS hands TCP a list of 34 mbufs. If TCP obly adds one, then
> increasing it from 34 to 35 would be all it takes. However, see below.

One thing I don't want to miss here is that an NFS block size of
65,536 is really suboptimal.  The largest size of a TCP datagram is
65535.  So by the time NFS adds the overhead on and the total amount
of data to be sent winds up in that ~65k range, it guarantees that the
operation has to be split it into at least two TCP packets, one
max-size and one tiny one.  This doubles a lot of the network stack
overhead, regardless of whether the packet ends up being segmented
into tiny bits down the road or not.

If NFS could be modified to respect the actual size of a TCP packet,
generating a steady stream of 63.9k (or thereabout) writes instead of
the current 64k-1k-64k-1k, performance would likely see another
significant boost.  This would nearly double the average throughput
per packet, which would help with network latency and CPU load.

It's also not 100% clear but it seems like in some cases the existing
behavior also causes the TCP stack to park on the "leftover" bit and
wait for more data, which comes in another >64k chunk, and from there
on out there's no more correlation between TCP packets and NFS
operations, so an operation doesn't begin on a packet boundary.  That
continues as long as load keeps up.  That's probably not good for
performance either.  And it certainly confuses the heck out of
tcpdump.

Probably 60k would be the next most reasonable size, since it's the
largest page size multiple that will fit into a TCP packet while still
leaving room for overhead.

Since the max size of TCP packets is not an area where there's really
any flexibility, what would have to happen to NFS to make that (or
arbitrary values) perform at its best within that constraint?

It's apparent from even trivial testing that performance is
dramatically affected if the "use a power of two for NFS rsize/wsize"
recommendation isn't followed, but what is the origin of that?  Is it
something that could be changed?

> I don't think that m_collapse() is more likely to fail, since it
> only copies data to the previous mbuf when the entire mbuf that
> follows will fit and it's allowed. I'd assume that a ref count
> copied mbuf cluster doesn't allow this copy or things would be
> badly broken.)

m_collapse checks M_WRITEABLE which appears to cover the ref count
case.  (It's a dense macro, but it seems to require a ref count of 1
if a cluster is used.)

The cases where m_collapse can succeed are pretty slim.  It pretty
much requires two consecutive underutilizied buffers, which probably
explains why it fails so often in this code path.  Since one of its
two methods outright skips the packet header mbuf (to avoid risk of
moving it), possibly the only case where it succeeds is when the last
data mbuf is short enough that whatever NFS trailers are being
appended can fit with it.

> Bottom line, I think calling either m_collapse() or m_defrag()
> should be considered a "last resort".

It definitely seems more designed for a case where 8 different stack
layers each put their own little header/trailer fingerprint on the
packet, and that's not what's happening here.

> Maybe the driver could reduce the size of if_hw_tsomax whenever
> it finds it needs to call one of these functions, to try and avoid
> a re-occurrence?

Since the issue is one of segment length rather than packet length,
this seems risky.  If one of those touched-by-everybody packets goes
by, it may not be that large, but it would risk permanently (until
reboot) dropping the throughput of that interface.

Thanks!