From owner-freebsd-net@freebsd.org  Sun Jul 29 21:38:24 2018
Return-Path: <owner-freebsd-net@freebsd.org>
Delivered-To: freebsd-net@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 23B161064A73
 for <freebsd-net@mailman.ysv.freebsd.org>;
 Sun, 29 Jul 2018 21:38:24 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from CAN01-QB1-obe.outbound.protection.outlook.com
 (mail-eopbgr660070.outbound.protection.outlook.com [40.107.66.70])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits))
 (Client CN "mail.protection.outlook.com",
 Issuer "GlobalSign Organization Validation CA - SHA256 - G3" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id AF2837B581
 for <freebsd-net@freebsd.org>; Sun, 29 Jul 2018 21:38:23 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM (52.132.44.24) by
 YTOPR0101MB1404.CANPRD01.PROD.OUTLOOK.COM (52.132.47.20) with Microsoft SMTP
 Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.995.19; Sun, 29 Jul 2018 21:38:20 +0000
Received: from YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM
 ([fe80::5543:df04:c733:a99]) by YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM
 ([fe80::5543:df04:c733:a99%5]) with mapi id 15.20.0995.020; Sun, 29 Jul 2018
 21:38:20 +0000
From: Rick Macklem <rmacklem@uoguelph.ca>
To: Adrian Chadd <adrian.chadd@gmail.com>, "ryan@ixsystems.com"
 <ryan@ixsystems.com>, FreeBSD Net <freebsd-net@freebsd.org>
Subject: Re: 9k jumbo clusters
Thread-Topic: 9k jumbo clusters
Thread-Index: AQHUJrJe/B7MvWEGOEuFhL2lVBq7caSms7tL
Date: Sun, 29 Jul 2018 21:38:20 +0000
Message-ID: <YTOPR0101MB0953AE665C73D96D0B1E6BBADD280@YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM>
References: <EBDE6EDD-D875-43D8-8D65-1F1344A6B817@ixsystems.com>
 <20180727221843.GZ2884@funkthat.com>,
 <CAJ-VmomHQ+zcJ+HXAjMg9aS1RPZsdHy0tYjdKzjpwrUY+05NiQ@mail.gmail.com>
In-Reply-To: <CAJ-VmomHQ+zcJ+HXAjMg9aS1RPZsdHy0tYjdKzjpwrUY+05NiQ@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
authentication-results: spf=none (sender IP is )
 smtp.mailfrom=rmacklem@uoguelph.ca; 
x-ms-publictraffictype: Email
x-microsoft-exchange-diagnostics: 1; YTOPR0101MB1404;
 6:oo3z4CkiIYbApXRa8hss/otDWqQOID6syYr4QRcJW32Ji3chn6CRRnUdMGM01Unjodho13N3P4hvshLVcxg7kuC9mkHDcDDaXZIuJKj6hWFUTfC/QJqEspasM3OMOpqINxjPrSV2Va2ZjxKx2v0IWlISC7RhJCJzYgzXGXGlZCSiFrokiB2qbiqcrPvzm3Ha+Jv57dRG6+Gkppx7YIj4sFWvDb27aNlm0MbA7qpzTafN/78KAGTTEANef+xdZbmmsTtbG7Ex/3EIzFXLQboSDEd+MQuAzxvbEQklor5oMeohaHPvGT9CJVaKZdC+Ois4D0dnW8ssA9cVzjyJC3PfP3N2xWN+rvMWYS4QWeu7PpQ4jnK/ClGHsFQf6SOag2SPrssGCZuM6kIlSlnXXkn4Gl8FLiQ/JoxZxl79LyxAEePGX2AmGC6W3y0L9PYNEzJ8tsz7kbYdYyZlMxQm71kWdA==;
 5:meavW8kiAGHTgHEtdq8PSR2ND2LT04l6sDFHLgSe9UJu3CcTfKcrZFzMoJsEtiK79N+k4alVmwYkC8ZNxEQEhB7g83+amyjmDesrE+IQLKgRiLVDreOsBtpK7+mpgjPecWeVOccxFQiqwuKy1K1pDHlpopl9Hk1wTw494cXVEVg=;
 7:6m4TUpAneH3jMsi97MTkJN/Ox/mwwrFqA7IVRvt8luVLi708Oyb2hpaLZnq63tCBlfF0fAmpiIzfHPlUJPOLnKRgDvIpl5tWtjVgjAwfsWAidaNkwKqd8kYoWsFytFjQ/QAwYnGAA+8gCrfYctCuQssnGIiof6LhRuG6YR/H6zCw44Lo2S0wf4/4TOr1WyTTcrw7e+Bp/g9u+U0T3iK028Vw7bGh/RB/eY+YY6QP+Sesf2xY4nUJ/X/kOgJNNptA
x-ms-exchange-antispam-srfa-diagnostics: SOS;
x-ms-office365-filtering-correlation-id: 2168cd48-96d2-47d4-96ca-08d5f59b9ac9
x-microsoft-antispam: BCL:0; PCL:0;
 RULEID:(7020095)(4652040)(8989117)(4534165)(4627221)(201703031133081)(201702281549075)(8990107)(5600074)(711020)(2017052603328)(7153060)(7193020);
 SRVR:YTOPR0101MB1404; 
x-ms-traffictypediagnostic: YTOPR0101MB1404:
x-microsoft-antispam-prvs: <YTOPR0101MB14043AAA16B2C72D4CFB2EA5DD280@YTOPR0101MB1404.CANPRD01.PROD.OUTLOOK.COM>
x-exchange-antispam-report-test: UriScan:(158342451672863);
x-ms-exchange-senderadcheck: 1
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0;
 RULEID:(6040522)(2401047)(5005006)(8121501046)(93006095)(93001095)(3231311)(944501410)(52105095)(3002001)(10201501046)(149027)(150027)(6041310)(201703131423095)(201702281529075)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123558120)(20161123564045)(20161123560045)(20161123562045)(6072148)(201708071742011)(7699016);
 SRVR:YTOPR0101MB1404; BCL:0; PCL:0; RULEID:; SRVR:YTOPR0101MB1404; 
x-forefront-prvs: 0748FF9A04
x-forefront-antispam-report: SFV:NSPM;
 SFS:(10009020)(396003)(39850400004)(366004)(346002)(376002)(136003)(189003)(199004)(9686003)(55016002)(110136005)(14454004)(53936002)(25786009)(316002)(786003)(74482002)(229853002)(33656002)(305945005)(8936002)(5660300001)(39060400002)(74316002)(81166006)(6436002)(2906002)(81156014)(256004)(14444005)(6246003)(6506007)(102836004)(486006)(2900100001)(68736007)(26005)(186003)(446003)(476003)(11346002)(478600001)(99286004)(5250100002)(105586002)(2501003)(7696005)(76176011)(8676002)(97736004)(106356001)(86362001);
 DIR:OUT; SFP:1101; SCL:1; SRVR:YTOPR0101MB1404;
 H:YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM; FPR:; SPF:None; LANG:en;
 PTR:InfoNoRecords; MX:1; A:1; 
received-spf: None (protection.outlook.com: uoguelph.ca does not designate
 permitted sender hosts)
x-microsoft-antispam-message-info: JssI/JMXhlWqbhrqRNNRaiX19d9C9zoHrE3WwP/y2q74ct/bNzWMM82JM9hX22WLKYn3CBmhNOGrWCaa0PEk5itQ0AeCPztrVd+zFBn1hAUQ33biRaHPcppVroNuz87b8JK9N6/BIgRhL8kUaHYfkYQlkTrmTAME+uwCKrkEsLVH9hbI6tXSm4noMElQnuTxGlnTvuaIF3zyX/g0ADbQHcKLwtAcWwhibJTOl/QHaT0c2ByagonljTagfKana+11v0tLvqLOlq2pDRLl0htg3kM2hq6IYiJG1KaRF3kM4v1Jay3+Y4J1iZ2/7dfmmhnDAam4bEtuE7nUpfdMPVT2z1WkLiYUbJLqQ29FZu5NYpY=
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginatorOrg: uoguelph.ca
X-MS-Exchange-CrossTenant-Network-Message-Id: 2168cd48-96d2-47d4-96ca-08d5f59b9ac9
X-MS-Exchange-CrossTenant-originalarrivaltime: 29 Jul 2018 21:38:20.7494 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: be62a12b-2cad-49a1-a5fa-85f4f3156a7d
X-MS-Exchange-Transport-CrossTenantHeadersStamped: YTOPR0101MB1404
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.27
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 29 Jul 2018 21:38:24 -0000

Adrian Chadd wrote:
>John-Mark Gurney wrote:
[stuff snipped]
>>
>> Drivers need to be fixed to use 4k pages instead of cluster.  I really h=
ope
>> no one is using a card that can't do 4k pages, or if they are, then they
>> should get a real card that can do scatter/gather on 4k pages for jumbo
>> frames..
>
>Yeah but it's 2018 and your server has like minimum a dozen million 4k
>pages.
>
>So if you're doing stuff like lots of network packet kerchunking why not
>have specialised allocator paths that can do things like "hey, always give
>me 64k physical contig pages for storage/mbufs because you know what?
>they're going to be allocated/freed together always."
>
>There was always a race between bus bandwidth, memory bandwidth and
>bus/memory latencies. I'm not currently on the disk/packet pushing side of
>things, but the last couple times I were it was at different points in tha=
t
>4d space and almost every single time there was a benefit from having a
>couple of specialised allocators so you didn't have to try and manage a fe=
w
>dozen million 4k pages based on your changing workload.
>
>I enjoy the 4k page size management stuff for my 128MB routers. Your 128G
>server has a lot of 4k pages. It's a bit silly.
Here's my NFS guy perspective.
I do think 9K mbuf clusters should go away. I'll note that I once coded NFS=
 so it
would use 4K mbuf clusters for the big RPCs (write requests and read replie=
s) and
I actually could get the mbuf cluster pool fragmented to the point it stopp=
ed
working on a small machine, so it is possible (although not likely) to frag=
ment
even a 2K/4K mix.

For me, send and receive are two very different cases:
- For sending a large NFS RPC (lets say a reply to a 64K read), the NFS cod=
e will
  generate a list of 33 2K mbuf clusters. If the net interface doesn't do T=
SO, this
  is probably fine, since tcp_output() will end up busting this up into a b=
unch of
  TCP segments using the list of mbuf clusters with TCP/IP headers added fo=
r
  each segment, etc...
  - If the net interface does TSO, this long list goes down to the net driv=
er and uses
    34->35 ring entries to send it (it adds at least one segment for the MA=
C header
    typically). If the driver isn't buggy and the net chip supports lots of=
 transmit
    ring entries, this works ok but...
 - If there was a 64K supercluster, the NFS code could easily use that for =
the 64K
   of data and the TSO enabled net interface would use 2 transmit ring entr=
ies.
   (one for the MAC/TCP/NFS header and one for the 64K of data). If the net=
 interface
   can't handle a TSO segment over 65535bytes, it will end up getting 2 TSO=
 segments
   from tcp_output(), but that still is a lot less than 35.
I don't know enough about net hardware to know when/if this will help perf.=
, but
it seems that it might, at least for some chipsets?

For receive, it seems that a 64K mbuf cluster is overkill for jumbo packets=
, but as
others have noted, they won't be allocated for long unless packets arrive o=
ut of
order, at least for NFS. (For other apps., they  might not read the socket =
for a while
to get the data, so they might sit in the socket rcv queue for a while.)

I chose 64K, since that is what most net interfaces can handle for TSO thes=
e days.
(If it will soon be larger, I think this should be even larger, but all of =
them the same
 size to avoid fragmentation.) For the send case for NFS, it wouldn't even =
need to
be a very large pool, since they get free'd as soon as the net interface tr=
ansmits
the TSO segment.

For NFS, it could easily call mget_supercl() and then fall back on the curr=
ent code using 2K mbuf clusters if mget_supercl() failed, so a small pool w=
ould be fine for the
 NFS send side.

I'd like to see a pool for 64K or larger mbuf clusters for the send side.
For the receive side, I'll let others figure out the best solution (4K or l=
arger
for jumbo clusters). I do think anything larger than 4K needs a separate al=
location
pool to avoid fragmentation.
(I don't know, but I'd guess iSCSI could use them as well?)

rick