Nov 1, 2025 at 2:10=E2=80=AFPM Konstantin Belousov wrote: > >> > >> On Sat, Nov 01, 2025 at 02:03:59PM -0700, Rick Macklem wrote: > >>> On Sat, Nov 1, 2025 at 1:50=E2=80=AFPM Konstantin Belousov wrote: > >>>> > >>>> Added Slava Schwartsman. > >>>> > >>>> On Sat, Nov 01, 2025 at 01:11:02PM -0700, Rick Macklem wrote: > >>>>> Hi, > >>>>> > >>>>> I've had NFS over RDMA on my todo list for a very loonnnggg > >>>>> time. I've avoided it because I haven't had a way to test it, > >>>>> but I'm now going to start working on it. (A bunch of this work > >>>>> is already done for NFS-over-TLS which added code for handling > >>>>> M_EXTPG mbufs.) > >>>>> > >>>>> >From RFC-8166, there appears to be 4 operations the krpc > >>>>> needs to do: > >>>>> send-rdma - Send on the payload stream (sending messages that > >>>>> are kept in order). > >>>>> recv-rdma - Receive the above. > >>>>> ddp-write - Do a write of DDP data. > >>>>> ddp-read - Do a read of DDP data. > >>>>> > >>>>> So, here is how I see the krpc doing this. > >>>>> An NFS write RPC for example: > >>>>> - The NFS client code packages the Write RPC XDR as follows: > >>>>> - 1 or more mbufs/mbuf_clusters of XDR for the NFS arguments > >>>>> that precede the write data. > >>>>> - an mbuf that indicates "start of ddp-read". (Maybe use M_PROTO= 1?) > >>>>> - 1 or more M_EXTPG mbugs with page(s) loaded with the data to b= e > >>>>> written. > >>>>> - 0 or more mbufs/mbuf_clusters with additional RPC request XDR. > >>>>> > >>>>> This would be passed to the krpc which would... > >>>>> - the mbufs up to "start of ddp" in the payload stream. > >>>>> - Would specify a ddp-read for the pages from the M_EXTPG mbufs > >>>>> and send that in the payload stream. > >>>>> - send the remaining mbufs/mbuf_clusters in the payload stream > >>>>> > >>>>> The NFS server end would process the received payload stream, > >>>>> putting the non-ddp stuff in mbufs/mbuf_clusters. > >>>>> It would do the ddp-read of the data into anonymous pages it alloca= tes > >>>>> and would associate these with M_EXTPG mbufs. > >>>>> It would put any remaining payload stream stuff for the RPC message= in > >>>>> additional mbufs/mbuf_clusters. > >>>>> --> Call the NFS server with the mbuf list for processing. > >>>>> - When the NFS server gets to the write data (in M_EXTPG mbuf= s) > >>>>> it would set up a uio/iovec for the pages and call VOP_WRIT= E(). > >>>>> > >>>>> Now, the above is straightforward for me, since I know the NFS and > >>>>> krpc code fairly well. > >>>>> But that is where my expertise ends. > >>>>> > >>>>> So, what kind of calls do the drivers provide to send and receive > >>>>> what RFC-8166 calls the payload stream? > >>>>> > >>>>> And what kind of calls do the drivers provide to write and read DDP > >>>>> chunks? > >>>>> > >>>>> Also, if the above sounds way off the mark, please let me know. > >>>> > >>>> What you need is, most likely, the infiniband API or KPI to handle > >>>> RDMA. It is driver-independent, same as for ip NFS you use system I= P > >>>> stack and not call to ethernet drivers. In fact, most likely the > >>>> transport used would be not native IB, but IB over UDP (RoCE v2). > >>>> > >>>> IB verbs, which is the official interface for both kernel and user m= ode, > >>>> are not well documented. An overview is provided by the document > >>>> titled "RDMA Aware Networks Programming User Manual", which should > >>>> be google-able. Otherwise, the Infiniband specication is the refere= nce. > >>> Thanks. I'll look at that. (I notice that the Intel code references s= omething > >>> they call Linux-OpenIB. Hopefully that looks about the same and the > >>> glue needed to support non-Mellanox drivers isn't too difficult?) > >> OpenIB is perhaps the reference to the IB code in Linux kernel proper > >> plus userspace libraries from rdma-core. This is what was forked/grow= n > >> from OFED. > >> > >> Intel put efforts into the iWARP, which is sort of alternative for RoC= Ev2. > >> It has RFCs and works over TCP AFAIR, which causes problems for it. > > Heh, heh. I'm trying to avoid the iWARP vs RoCE wars.;-) > > (I did see a Mellanox white paper with graphs showing how RoCE outperfo= rms > > iWARP.) > > Intel currently claims to support RoCE on its 810 and 820 NICs. > > Broadcom also claims to support RoCE, but doesn't mention FreeBSD > > drivers and Chelsio does iWARP, afaik. > > > > For some reason, at the last NFSv4 Bakeathon, Chuck was testing with > > iWARP and not RoCE? (I haven't asked Chuck why he chose that. It > > might just be more convenient to set up the siw driver in Linux vs the > > rxe one? He is the main author of RFC-8166, so he's the NFS-over-RDMA g= uy.) > > > > But it does look like a fun project for the next year. (I recall jhb@ m= entioning > > that NFS-over-TLS wouldn't be easy and it turned out to be a fun > > little project.) > > Konstantin is right though that sys/ofed is Linux OpenIB and has an inter= face > that should let you do RDMA (both ROCEv2 and iWARP). I'm hoping to use t= he APIs > in sys/ofed to support NVMe over RDMA (both ROCEv2 and iWARP) at some poi= nt as > well. > > rick > > > >> > >>> > >>> Btw, if anyone is interested in taking a more active involvement in t= his, > >>> they are more than welcome to do so. (I'm going to be starting where = I > >>> understand things in the krpc/nfs. I'm not looking forward to porting= rxe, > >>> but will probably end up there. I have already had one offer w.r.t. a= ccess > >>> to a lab that includes Mellanox hardware, but I don't know if remote > >>> debugging will be practical yet.) > >>> > >>> rick > >>> > >>>> > >>>> The IB implementation for us is still called OFED for historical rea= sons, > >>>> and it is located in sys/ofed. > >>>> > >>>>> > >>>>> As for testing, I am planning on hacking away at one of the RDMA > >>>>> in software drivers in Linux to get it working well enough to use f= or > >>>>> testing. Whatever seems to be easiest to get kinda working. > >>>> Yes rxe driver is the sw RoCE v2 implementation. We looked at the > >>>> amount of work to port it. Its size is ~12 kLoC, which is compatibl= e > >>>> with libibverbs (userspace core infiniband interface). > > Interesting. I'm currently working on merging back several OFED commits = from > Linux to sys/ofed (currently I have about 30 commits merged, some older t= han > Hans' last merge, and some newer, some of the newer ones should permit re= moving > compat stubs for some of the newer APIs that are duplicated in bnxt, irdm= a, and > mlx*). When I get a bit further along I'll post the branch I have for mo= re > testing (it is a bunch of individual cherry-picks rather than a giant mer= ge). > > Porting over rxe could be useful for me as well for some work I am doing. I have https://github.com/rmacklem/freebsd-rdma. For now, I'll only be doin= g commits to it for the NFS and krpc files. It will be a while before anythi= ng in it is useful for others. I'll email when I get into the rxe port. (If you hurry, you can beat me to = it;-) Others are welcome to push/pull on the above. (Email if you need permission= s changes. I know diddly about github.) rick > > -- > John Baldwin >