From nobody Tue Nov 4 06:10:46 2025 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4d0ykd6L1Hz6F8Gk for ; Tue, 04 Nov 2025 06:11:05 +0000 (UTC) (envelope-from rick.macklem@gmail.com) Received: from mail-ed1-x52a.google.com (mail-ed1-x52a.google.com [IPv6:2a00:1450:4864:20::52a]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "WR4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4d0ykd4Cwgz3CWf for ; Tue, 04 Nov 2025 06:11:05 +0000 (UTC) (envelope-from rick.macklem@gmail.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-ed1-x52a.google.com with SMTP id 4fb4d7f45d1cf-640b4a52950so2280914a12.1 for ; Mon, 03 Nov 2025 22:11:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1762236658; x=1762841458; darn=freebsd.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=x4kLL0sEsvOoxsxTijNmNqAYhG1HfcD3i7jZlUvOZZk=; b=mzSTv0F0fbey36TrTWCSGOSmHnNpVJMN207H2KjO4fp8BatnYmmP/spsH+Ov3f/RKn QREcwaV1zIgs9m9htbR1FSX8VDDWcDPd3GPUYTdVbk6yH6MSlsPAkjDPP00KUd3+Na7s hI1QfVsiKgz9atRMDpcyZsV1S/u3u5B2JqYopAIyZtHxCiT4prLbyhJs58cM8PYFInw1 XBaNYGIC/YNKEc/N2UycddFwl5c2usQJ28m5XvkHZHibbF0DbF36c6HgjNa6ZHuQ4wXR H6DEtg76Y5zubARvlEgRpsH5srcSUJ1OA8c51VDDw2QRuvHedH7W0D4zAMQwIykF5j/H kkXw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1762236658; x=1762841458; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=x4kLL0sEsvOoxsxTijNmNqAYhG1HfcD3i7jZlUvOZZk=; b=t+LNvqAlH0oMcpfezBLYI+wt2Jyd3Qkipk0BIDenaq+JNdJOnwi8Iq1jV9+rhmShq7 n6HxFxIifbizinWTEag+bNZ/dCrM8VEl5A8D9E3ueoITFyPrhaUXGXji9WgN5Ib6pGY5 Y85peJbLbUJ3FC1RwnOnKSYzWk6GkE8y1LfEc8uNCpY7iRiCfcO2509F3GNgX1yVdTFB MLMVcyV789AXlSfnifAcppWz1MlDd348J3FcgK5FiRnUZVKfD8gM+O42h/fEJBpr7DGT TDy2CxA3E/moSKBblZInvPNV35kAiyX2IC/fI2I/+4sQ3E0w/My2Dda0QZQDw1u1U2ZR RB/w== X-Forwarded-Encrypted: i=1; AJvYcCWSfDyF6iicwQ3kJfl9PzNEk1NIsHbf5L0vc6fJGMZDMwJnovsNdveMewGMrXoQ7gk7FD7IpjY71pugYg1cCas=@freebsd.org X-Gm-Message-State: AOJu0YzWsoZFDZUJXbXq3eCXLAUWPn1tVc6MBEi+3y15Fwd7efTClJVw 81ifXUGuSXnLWKSiLK7HQu063uZ7AfrKOwSB67C1lgbvltrhPaJqS/WDavGonk5MrUcM8hiWcS0 5PTZ9mJyzatXJliRYitf12oLuFIjqUA== X-Gm-Gg: ASbGncsYJO/JLYrMPwp39+Ry6Y1C0di0Kz25n7TSGk4JQ/AUclcrFtcUx3vnl6PU8Li Ezngm0+NfDMskRAMYzGo1SVnlFkXv1lsdUDzGoI4MlzmJTDIAFybvDxazTX20pDXzq55z44rB9x +/gyQBvy+KPpZFEjQy82pk3gAb2bC/uoYbpRPi0t0ekne6HwOmmjlvhKzyRROIqkrvYL4wXpbLL s/wr9obhCRwobWFQZEhNU/iQORyl4WZqGU4F5jBQVfg20TL44y5KNO4PtqItxJCY6cAlvyuE9u4 JoaZB9kPYa+DfbIUmMh99xfQssk= X-Google-Smtp-Source: AGHT+IET8McF7GkgajSviQkcNvEbv7ufl1WlkEAWdvWTODFdL0anmFxXeGtMFEnW1PLYhENiDufs+IkW6B+5n8y9eQA= X-Received: by 2002:a05:6402:350a:b0:640:976f:13ad with SMTP id 4fb4d7f45d1cf-640976f1807mr8909946a12.8.1762236657687; Mon, 03 Nov 2025 22:10:57 -0800 (PST) List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@FreeBSD.org MIME-Version: 1.0 References: In-Reply-To: From: Rick Macklem Date: Mon, 3 Nov 2025 22:10:46 -0800 X-Gm-Features: AWmQ_bn4acpv6cOazsdjfzNy393APoWc_ma0R0aP_YMlJDvsc1GKClJqI9MWZtU Message-ID: Subject: Re: RFC: NFS over RDMA To: John Baldwin Cc: Konstantin Belousov , FreeBSD CURRENT , Navdeep Parhar , "erj@freebsd.org" , "aehrenberg@nvidia.com" , slavash@nvidia.com, "sreekanth.reddy@broadcom.com" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spamd-Bar: ---- X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; TAGGED_FROM(0.00)[]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US] X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Rspamd-Queue-Id: 4d0ykd4Cwgz3CWf On Mon, Nov 3, 2025 at 6:35=E2=80=AFAM John Baldwin wrote= : > > On 11/1/25 17:26, Rick Macklem wrote: > > On Sat, Nov 1, 2025 at 2:10=E2=80=AFPM Konstantin Belousov wrote: > >> > >> On Sat, Nov 01, 2025 at 02:03:59PM -0700, Rick Macklem wrote: > >>> On Sat, Nov 1, 2025 at 1:50=E2=80=AFPM Konstantin Belousov wrote: > >>>> > >>>> Added Slava Schwartsman. > >>>> > >>>> On Sat, Nov 01, 2025 at 01:11:02PM -0700, Rick Macklem wrote: > >>>>> Hi, > >>>>> > >>>>> I've had NFS over RDMA on my todo list for a very loonnnggg > >>>>> time. I've avoided it because I haven't had a way to test it, > >>>>> but I'm now going to start working on it. (A bunch of this work > >>>>> is already done for NFS-over-TLS which added code for handling > >>>>> M_EXTPG mbufs.) > >>>>> > >>>>> >From RFC-8166, there appears to be 4 operations the krpc > >>>>> needs to do: > >>>>> send-rdma - Send on the payload stream (sending messages that > >>>>> are kept in order). > >>>>> recv-rdma - Receive the above. > >>>>> ddp-write - Do a write of DDP data. > >>>>> ddp-read - Do a read of DDP data. > >>>>> > >>>>> So, here is how I see the krpc doing this. > >>>>> An NFS write RPC for example: > >>>>> - The NFS client code packages the Write RPC XDR as follows: > >>>>> - 1 or more mbufs/mbuf_clusters of XDR for the NFS arguments > >>>>> that precede the write data. > >>>>> - an mbuf that indicates "start of ddp-read". (Maybe use M_PROTO= 1?) > >>>>> - 1 or more M_EXTPG mbugs with page(s) loaded with the data to b= e > >>>>> written. > >>>>> - 0 or more mbufs/mbuf_clusters with additional RPC request XDR. > >>>>> > >>>>> This would be passed to the krpc which would... > >>>>> - the mbufs up to "start of ddp" in the payload stream. > >>>>> - Would specify a ddp-read for the pages from the M_EXTPG mbufs > >>>>> and send that in the payload stream. > >>>>> - send the remaining mbufs/mbuf_clusters in the payload stream > >>>>> > >>>>> The NFS server end would process the received payload stream, > >>>>> putting the non-ddp stuff in mbufs/mbuf_clusters. > >>>>> It would do the ddp-read of the data into anonymous pages it alloca= tes > >>>>> and would associate these with M_EXTPG mbufs. > >>>>> It would put any remaining payload stream stuff for the RPC message= in > >>>>> additional mbufs/mbuf_clusters. > >>>>> --> Call the NFS server with the mbuf list for processing. > >>>>> - When the NFS server gets to the write data (in M_EXTPG mbuf= s) > >>>>> it would set up a uio/iovec for the pages and call VOP_WRIT= E(). > >>>>> > >>>>> Now, the above is straightforward for me, since I know the NFS and > >>>>> krpc code fairly well. > >>>>> But that is where my expertise ends. > >>>>> > >>>>> So, what kind of calls do the drivers provide to send and receive > >>>>> what RFC-8166 calls the payload stream? > >>>>> > >>>>> And what kind of calls do the drivers provide to write and read DDP > >>>>> chunks? > >>>>> > >>>>> Also, if the above sounds way off the mark, please let me know. > >>>> > >>>> What you need is, most likely, the infiniband API or KPI to handle > >>>> RDMA. It is driver-independent, same as for ip NFS you use system I= P > >>>> stack and not call to ethernet drivers. In fact, most likely the > >>>> transport used would be not native IB, but IB over UDP (RoCE v2). > >>>> > >>>> IB verbs, which is the official interface for both kernel and user m= ode, > >>>> are not well documented. An overview is provided by the document > >>>> titled "RDMA Aware Networks Programming User Manual", which should > >>>> be google-able. Otherwise, the Infiniband specication is the refere= nce. > >>> Thanks. I'll look at that. (I notice that the Intel code references s= omething > >>> they call Linux-OpenIB. Hopefully that looks about the same and the > >>> glue needed to support non-Mellanox drivers isn't too difficult?) > >> OpenIB is perhaps the reference to the IB code in Linux kernel proper > >> plus userspace libraries from rdma-core. This is what was forked/grow= n > >> from OFED. > >> > >> Intel put efforts into the iWARP, which is sort of alternative for RoC= Ev2. > >> It has RFCs and works over TCP AFAIR, which causes problems for it. > > Heh, heh. I'm trying to avoid the iWARP vs RoCE wars.;-) > > (I did see a Mellanox white paper with graphs showing how RoCE outperfo= rms > > iWARP.) > > Intel currently claims to support RoCE on its 810 and 820 NICs. > > Broadcom also claims to support RoCE, but doesn't mention FreeBSD > > drivers and Chelsio does iWARP, afaik. > > > > For some reason, at the last NFSv4 Bakeathon, Chuck was testing with > > iWARP and not RoCE? (I haven't asked Chuck why he chose that. It > > might just be more convenient to set up the siw driver in Linux vs the > > rxe one? He is the main author of RFC-8166, so he's the NFS-over-RDMA g= uy.) > > > > But it does look like a fun project for the next year. (I recall jhb@ m= entioning > > that NFS-over-TLS wouldn't be easy and it turned out to be a fun > > little project.) > > Konstantin is right though that sys/ofed is Linux OpenIB and has an inter= face > that should let you do RDMA (both ROCEv2 and iWARP). I'm hoping to use t= he APIs > in sys/ofed to support NVMe over RDMA (both ROCEv2 and iWARP) at some poi= nt as > well. > > rick > > > >> > >>> > >>> Btw, if anyone is interested in taking a more active involvement in t= his, > >>> they are more than welcome to do so. (I'm going to be starting where = I > >>> understand things in the krpc/nfs. I'm not looking forward to porting= rxe, > >>> but will probably end up there. I have already had one offer w.r.t. a= ccess > >>> to a lab that includes Mellanox hardware, but I don't know if remote > >>> debugging will be practical yet.) > >>> > >>> rick > >>> > >>>> > >>>> The IB implementation for us is still called OFED for historical rea= sons, > >>>> and it is located in sys/ofed. > >>>> > >>>>> > >>>>> As for testing, I am planning on hacking away at one of the RDMA > >>>>> in software drivers in Linux to get it working well enough to use f= or > >>>>> testing. Whatever seems to be easiest to get kinda working. > >>>> Yes rxe driver is the sw RoCE v2 implementation. We looked at the > >>>> amount of work to port it. Its size is ~12 kLoC, which is compatibl= e > >>>> with libibverbs (userspace core infiniband interface). > > Interesting. I'm currently working on merging back several OFED commits = from > Linux to sys/ofed (currently I have about 30 commits merged, some older t= han > Hans' last merge, and some newer, some of the newer ones should permit re= moving > compat stubs for some of the newer APIs that are duplicated in bnxt, irdm= a, and > mlx*). When I get a bit further along I'll post the branch I have for mo= re > testing (it is a bunch of individual cherry-picks rather than a giant mer= ge). > > Porting over rxe could be useful for me as well for some work I am doing. I have https://github.com/rmacklem/freebsd-rdma. For now, I'll only be doin= g commits to it for the NFS and krpc files. It will be a while before anythi= ng in it is useful for others. I'll email when I get into the rxe port. (If you hurry, you can beat me to = it;-) Others are welcome to push/pull on the above. (Email if you need permission= s changes. I know diddly about github.) rick > > -- > John Baldwin >