From nobody Sat Nov 1 20:11:02 2025 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4czTWb0w08z6DsBC for ; Sat, 01 Nov 2025 20:11:23 +0000 (UTC) (envelope-from rick.macklem@gmail.com) Received: from mail-ed1-x529.google.com (mail-ed1-x529.google.com [IPv6:2a00:1450:4864:20::529]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "WR4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4czTWZ3QXjz3TkR for ; Sat, 01 Nov 2025 20:11:22 +0000 (UTC) (envelope-from rick.macklem@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20230601 header.b=Wpv5dz05; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of rick.macklem@gmail.com designates 2a00:1450:4864:20::529 as permitted sender) smtp.mailfrom=rick.macklem@gmail.com Received: by mail-ed1-x529.google.com with SMTP id 4fb4d7f45d1cf-63c4f1e7243so5163241a12.3 for ; Sat, 01 Nov 2025 13:11:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1762027876; x=1762632676; darn=freebsd.org; h=cc:to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=S/eDcsCvEm/rMxrO+S9o7D7aeuHjc3WEo1JyfTXYx5U=; b=Wpv5dz052VS5gtbUBsjIfPeLmwBqChDLA6cI8XUPwpF3Y2PfokyTxzaCqjoZO144tt Gck9dd1Z3XPAFm6xDXZq3dURwFA7I6ms27JZVhDHjQTvGxFexXHVCs71W8C1LM3KCLDg SPk3xjWzPe6ppxBzj69AFI+pXGe1WjUXFVLz3zjaEwDdt7hWAsZ0LmSP58BsXRIartYq tLS90SqJCD3sx7s65tJo6exZBHkFulHBmwLYawTpG3IV8r24PPP1TNfXovXNk9vWvx8v 8+gRngEUjUenYPXyV8f9tFvIrlTD8UlJGaudr3i3kq8BEZ/t+74ino6TJPyJaFTs39oA Pm7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1762027876; x=1762632676; h=cc:to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=S/eDcsCvEm/rMxrO+S9o7D7aeuHjc3WEo1JyfTXYx5U=; b=IF2UaovJb2715C09/nmXAOt/us0NI27w1cbkPEAvBUlxExVs7mMitk8s/PNXKsv7Xb kVPBPGTbnhk5ASbllw4vmsW1Ck1qjY/+bS24FNu5l5oFEQkS48SuksUlHG5ECesFAHfH Pvw/dZjHjJ121tDK30F730Yd3U7iuzaoPwhcv2wVBSv8Cc96XVKZ5XOpCBUNgfxw1msW aU2H8g44oSb3tw2DKOT9kGq2SZI5Y3m8ephRR7vFIBdIjR+6gd08T51pNe1hX8n9Fini UAtn9H3NOi+wR6i5XBBxvYzhKPczYRpXZ2xc84Y+1crBOvsibksu4VKtzCaGk9VqXs/f GvNg== X-Gm-Message-State: AOJu0Yw8RmpsZYYE2PziFausydsDcTKMLnkS6j9tzHwTzOQjjMyTVIPn 30aKGZmvzXvY1LRQ0U7YO5mCQoNDSRdMRAmPe0hJUHaaB65qNr7TSAvuQkYHDu/QOpwwOorl17p D1nlt2p4Y+2KiHh8XU90BOxpFpmS9ERJ9Q8M= X-Gm-Gg: ASbGncsYp6DPCty6O9dHoaFaVE02emNJBDfb+WA0jOI/micSpRQuBjauBum6kxH4BKh D2IClPQUTPgKcAcYKIdPldcnZ21iYSwluhQb8+ISkPpEzc5y5Yqog1qvpt7XVxXcvjFK4Ht2SVG NyK77soaY10sGiQsB3czD8KCvjtRyuO95Zg1vfKeWGBRd9McNQW8ciifBbMlneu3z5Lg61bHPd9 c/GqG0xKbh349eyzkqRnZFEx5y7NTu1TQhANuFY+GesmmMWovZkUlRucc16vYhLFZMGhKUdgoaG iormzvIJi4/3cAu9 X-Google-Smtp-Source: AGHT+IHNfsSiWbJ9TP3XIH2kwl49ZKnHT8TLnLt6GcxGjZprvM6ryFuqlXzvJA071F7b46e34LzbIzOfhk+YDw1ZEnI= X-Received: by 2002:a05:6402:34d3:b0:640:aae4:b866 with SMTP id 4fb4d7f45d1cf-640aae4ba39mr964744a12.0.1762027875369; Sat, 01 Nov 2025 13:11:15 -0700 (PDT) List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@FreeBSD.org MIME-Version: 1.0 From: Rick Macklem Date: Sat, 1 Nov 2025 13:11:02 -0700 X-Gm-Features: AWmQ_blh5dX7MgpoNFF4vLbuBx4BYxAp0kK_GasTQPKKWpqsc3ffpjRy0FBBa9Q Message-ID: Subject: RFC: NFS over RDMA To: FreeBSD CURRENT Cc: Konstantin Belousov , Navdeep Parhar , "erj@freebsd.org" , "aehrenberg@nvidia.com" , "sreekanth.reddy@broadcom.com" , John Baldwin Content-Type: text/plain; charset="UTF-8" X-Spamd-Bar: --- X-Spamd-Result: default: False [-4.00 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-0.999]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20230601]; R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36:c]; MIME_GOOD(-0.10)[text/plain]; TAGGED_FROM(0.00)[]; RCVD_TLS_LAST(0.00)[]; ARC_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; TO_DN_EQ_ADDR_SOME(0.00)[]; FREEMAIL_FROM(0.00)[gmail.com]; TO_DN_SOME(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; FROM_HAS_DN(0.00)[]; FREEMAIL_ENVFROM(0.00)[gmail.com]; PREVIOUSLY_DELIVERED(0.00)[freebsd-current@freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; MISSING_XM_UA(0.00)[]; MID_RHS_MATCH_FROMTLD(0.00)[]; RCPT_COUNT_SEVEN(0.00)[7]; MLMMJ_DEST(0.00)[freebsd-current@freebsd.org]; RCVD_COUNT_ONE(0.00)[1]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]; RCVD_IN_DNSWL_NONE(0.00)[2a00:1450:4864:20::529:from] X-Rspamd-Queue-Id: 4czTWZ3QXjz3TkR Hi, I've had NFS over RDMA on my todo list for a very loonnnggg time. I've avoided it because I haven't had a way to test it, but I'm now going to start working on it. (A bunch of this work is already done for NFS-over-TLS which added code for handling M_EXTPG mbufs.) >From RFC-8166, there appears to be 4 operations the krpc needs to do: send-rdma - Send on the payload stream (sending messages that are kept in order). recv-rdma - Receive the above. ddp-write - Do a write of DDP data. ddp-read - Do a read of DDP data. So, here is how I see the krpc doing this. An NFS write RPC for example: - The NFS client code packages the Write RPC XDR as follows: - 1 or more mbufs/mbuf_clusters of XDR for the NFS arguments that precede the write data. - an mbuf that indicates "start of ddp-read". (Maybe use M_PROTO1?) - 1 or more M_EXTPG mbugs with page(s) loaded with the data to be written. - 0 or more mbufs/mbuf_clusters with additional RPC request XDR. This would be passed to the krpc which would... - the mbufs up to "start of ddp" in the payload stream. - Would specify a ddp-read for the pages from the M_EXTPG mbufs and send that in the payload stream. - send the remaining mbufs/mbuf_clusters in the payload stream The NFS server end would process the received payload stream, putting the non-ddp stuff in mbufs/mbuf_clusters. It would do the ddp-read of the data into anonymous pages it allocates and would associate these with M_EXTPG mbufs. It would put any remaining payload stream stuff for the RPC message in additional mbufs/mbuf_clusters. --> Call the NFS server with the mbuf list for processing. - When the NFS server gets to the write data (in M_EXTPG mbufs) it would set up a uio/iovec for the pages and call VOP_WRITE(). Now, the above is straightforward for me, since I know the NFS and krpc code fairly well. But that is where my expertise ends. So, what kind of calls do the drivers provide to send and receive what RFC-8166 calls the payload stream? And what kind of calls do the drivers provide to write and read DDP chunks? Also, if the above sounds way off the mark, please let me know. As for testing, I am planning on hacking away at one of the RDMA in software drivers in Linux to get it working well enough to use for testing. Whatever seems to be easiest to get kinda working. Anyhow, any comments would be appreciated, rick ps: I did a bunch of cc's trying to get to the people that might know how the RDMA drivers work and what calls would do the above for them.