From owner-freebsd-net@freebsd.org Sun Jul 29 21:38:24 2018 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 23B161064A73 for ; Sun, 29 Jul 2018 21:38:24 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from CAN01-QB1-obe.outbound.protection.outlook.com (mail-eopbgr660070.outbound.protection.outlook.com [40.107.66.70]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "GlobalSign Organization Validation CA - SHA256 - G3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id AF2837B581 for ; Sun, 29 Jul 2018 21:38:23 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM (52.132.44.24) by YTOPR0101MB1404.CANPRD01.PROD.OUTLOOK.COM (52.132.47.20) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.995.19; Sun, 29 Jul 2018 21:38:20 +0000 Received: from YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM ([fe80::5543:df04:c733:a99]) by YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM ([fe80::5543:df04:c733:a99%5]) with mapi id 15.20.0995.020; Sun, 29 Jul 2018 21:38:20 +0000 From: Rick Macklem To: Adrian Chadd , "ryan@ixsystems.com" , FreeBSD Net Subject: Re: 9k jumbo clusters Thread-Topic: 9k jumbo clusters Thread-Index: AQHUJrJe/B7MvWEGOEuFhL2lVBq7caSms7tL Date: Sun, 29 Jul 2018 21:38:20 +0000 Message-ID: References: <20180727221843.GZ2884@funkthat.com>, In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=rmacklem@uoguelph.ca; x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; YTOPR0101MB1404; 6:oo3z4CkiIYbApXRa8hss/otDWqQOID6syYr4QRcJW32Ji3chn6CRRnUdMGM01Unjodho13N3P4hvshLVcxg7kuC9mkHDcDDaXZIuJKj6hWFUTfC/QJqEspasM3OMOpqINxjPrSV2Va2ZjxKx2v0IWlISC7RhJCJzYgzXGXGlZCSiFrokiB2qbiqcrPvzm3Ha+Jv57dRG6+Gkppx7YIj4sFWvDb27aNlm0MbA7qpzTafN/78KAGTTEANef+xdZbmmsTtbG7Ex/3EIzFXLQboSDEd+MQuAzxvbEQklor5oMeohaHPvGT9CJVaKZdC+Ois4D0dnW8ssA9cVzjyJC3PfP3N2xWN+rvMWYS4QWeu7PpQ4jnK/ClGHsFQf6SOag2SPrssGCZuM6kIlSlnXXkn4Gl8FLiQ/JoxZxl79LyxAEePGX2AmGC6W3y0L9PYNEzJ8tsz7kbYdYyZlMxQm71kWdA==; 5:meavW8kiAGHTgHEtdq8PSR2ND2LT04l6sDFHLgSe9UJu3CcTfKcrZFzMoJsEtiK79N+k4alVmwYkC8ZNxEQEhB7g83+amyjmDesrE+IQLKgRiLVDreOsBtpK7+mpgjPecWeVOccxFQiqwuKy1K1pDHlpopl9Hk1wTw494cXVEVg=; 7:6m4TUpAneH3jMsi97MTkJN/Ox/mwwrFqA7IVRvt8luVLi708Oyb2hpaLZnq63tCBlfF0fAmpiIzfHPlUJPOLnKRgDvIpl5tWtjVgjAwfsWAidaNkwKqd8kYoWsFytFjQ/QAwYnGAA+8gCrfYctCuQssnGIiof6LhRuG6YR/H6zCw44Lo2S0wf4/4TOr1WyTTcrw7e+Bp/g9u+U0T3iK028Vw7bGh/RB/eY+YY6QP+Sesf2xY4nUJ/X/kOgJNNptA x-ms-exchange-antispam-srfa-diagnostics: SOS; x-ms-office365-filtering-correlation-id: 2168cd48-96d2-47d4-96ca-08d5f59b9ac9 x-microsoft-antispam: BCL:0; PCL:0; RULEID:(7020095)(4652040)(8989117)(4534165)(4627221)(201703031133081)(201702281549075)(8990107)(5600074)(711020)(2017052603328)(7153060)(7193020); SRVR:YTOPR0101MB1404; x-ms-traffictypediagnostic: YTOPR0101MB1404: x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(158342451672863); x-ms-exchange-senderadcheck: 1 x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(6040522)(2401047)(5005006)(8121501046)(93006095)(93001095)(3231311)(944501410)(52105095)(3002001)(10201501046)(149027)(150027)(6041310)(201703131423095)(201702281529075)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123558120)(20161123564045)(20161123560045)(20161123562045)(6072148)(201708071742011)(7699016); SRVR:YTOPR0101MB1404; BCL:0; PCL:0; RULEID:; SRVR:YTOPR0101MB1404; x-forefront-prvs: 0748FF9A04 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(396003)(39850400004)(366004)(346002)(376002)(136003)(189003)(199004)(9686003)(55016002)(110136005)(14454004)(53936002)(25786009)(316002)(786003)(74482002)(229853002)(33656002)(305945005)(8936002)(5660300001)(39060400002)(74316002)(81166006)(6436002)(2906002)(81156014)(256004)(14444005)(6246003)(6506007)(102836004)(486006)(2900100001)(68736007)(26005)(186003)(446003)(476003)(11346002)(478600001)(99286004)(5250100002)(105586002)(2501003)(7696005)(76176011)(8676002)(97736004)(106356001)(86362001); DIR:OUT; SFP:1101; SCL:1; SRVR:YTOPR0101MB1404; H:YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; MX:1; A:1; received-spf: None (protection.outlook.com: uoguelph.ca does not designate permitted sender hosts) x-microsoft-antispam-message-info: JssI/JMXhlWqbhrqRNNRaiX19d9C9zoHrE3WwP/y2q74ct/bNzWMM82JM9hX22WLKYn3CBmhNOGrWCaa0PEk5itQ0AeCPztrVd+zFBn1hAUQ33biRaHPcppVroNuz87b8JK9N6/BIgRhL8kUaHYfkYQlkTrmTAME+uwCKrkEsLVH9hbI6tXSm4noMElQnuTxGlnTvuaIF3zyX/g0ADbQHcKLwtAcWwhibJTOl/QHaT0c2ByagonljTagfKana+11v0tLvqLOlq2pDRLl0htg3kM2hq6IYiJG1KaRF3kM4v1Jay3+Y4J1iZ2/7dfmmhnDAam4bEtuE7nUpfdMPVT2z1WkLiYUbJLqQ29FZu5NYpY= spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: uoguelph.ca X-MS-Exchange-CrossTenant-Network-Message-Id: 2168cd48-96d2-47d4-96ca-08d5f59b9ac9 X-MS-Exchange-CrossTenant-originalarrivaltime: 29 Jul 2018 21:38:20.7494 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: be62a12b-2cad-49a1-a5fa-85f4f3156a7d X-MS-Exchange-Transport-CrossTenantHeadersStamped: YTOPR0101MB1404 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 29 Jul 2018 21:38:24 -0000 Adrian Chadd wrote: >John-Mark Gurney wrote: [stuff snipped] >> >> Drivers need to be fixed to use 4k pages instead of cluster. I really h= ope >> no one is using a card that can't do 4k pages, or if they are, then they >> should get a real card that can do scatter/gather on 4k pages for jumbo >> frames.. > >Yeah but it's 2018 and your server has like minimum a dozen million 4k >pages. > >So if you're doing stuff like lots of network packet kerchunking why not >have specialised allocator paths that can do things like "hey, always give >me 64k physical contig pages for storage/mbufs because you know what? >they're going to be allocated/freed together always." > >There was always a race between bus bandwidth, memory bandwidth and >bus/memory latencies. I'm not currently on the disk/packet pushing side of >things, but the last couple times I were it was at different points in tha= t >4d space and almost every single time there was a benefit from having a >couple of specialised allocators so you didn't have to try and manage a fe= w >dozen million 4k pages based on your changing workload. > >I enjoy the 4k page size management stuff for my 128MB routers. Your 128G >server has a lot of 4k pages. It's a bit silly. Here's my NFS guy perspective. I do think 9K mbuf clusters should go away. I'll note that I once coded NFS= so it would use 4K mbuf clusters for the big RPCs (write requests and read replie= s) and I actually could get the mbuf cluster pool fragmented to the point it stopp= ed working on a small machine, so it is possible (although not likely) to frag= ment even a 2K/4K mix. For me, send and receive are two very different cases: - For sending a large NFS RPC (lets say a reply to a 64K read), the NFS cod= e will generate a list of 33 2K mbuf clusters. If the net interface doesn't do T= SO, this is probably fine, since tcp_output() will end up busting this up into a b= unch of TCP segments using the list of mbuf clusters with TCP/IP headers added fo= r each segment, etc... - If the net interface does TSO, this long list goes down to the net driv= er and uses 34->35 ring entries to send it (it adds at least one segment for the MA= C header typically). If the driver isn't buggy and the net chip supports lots of= transmit ring entries, this works ok but... - If there was a 64K supercluster, the NFS code could easily use that for = the 64K of data and the TSO enabled net interface would use 2 transmit ring entr= ies. (one for the MAC/TCP/NFS header and one for the 64K of data). If the net= interface can't handle a TSO segment over 65535bytes, it will end up getting 2 TSO= segments from tcp_output(), but that still is a lot less than 35. I don't know enough about net hardware to know when/if this will help perf.= , but it seems that it might, at least for some chipsets? For receive, it seems that a 64K mbuf cluster is overkill for jumbo packets= , but as others have noted, they won't be allocated for long unless packets arrive o= ut of order, at least for NFS. (For other apps., they might not read the socket = for a while to get the data, so they might sit in the socket rcv queue for a while.) I chose 64K, since that is what most net interfaces can handle for TSO thes= e days. (If it will soon be larger, I think this should be even larger, but all of = them the same size to avoid fragmentation.) For the send case for NFS, it wouldn't even = need to be a very large pool, since they get free'd as soon as the net interface tr= ansmits the TSO segment. For NFS, it could easily call mget_supercl() and then fall back on the curr= ent code using 2K mbuf clusters if mget_supercl() failed, so a small pool w= ould be fine for the NFS send side. I'd like to see a pool for 64K or larger mbuf clusters for the send side. For the receive side, I'll let others figure out the best solution (4K or l= arger for jumbo clusters). I do think anything larger than 4K needs a separate al= location pool to avoid fragmentation. (I don't know, but I'd guess iSCSI could use them as well?) rick