Date: Sun, 22 May 2022 22:26:07 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: Adam Stylinski <kungfujesus06@gmail.com> Cc: John <jwd@freebsd.org>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: Re: zfs/nfsd performance limiter Message-ID: <YQBPR0101MB9742056AFEF03C6CAF2B7F56DDD59@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <CAJwHY9WHE4MFScuhry7v9MqRQBSTNY5XYCH5qfO4xEn6Swwtrw@mail.gmail.com> References: <CAJwHY9WMOOLy=rb9FNjExQtYej21Zv=Po9Cbg=19gkw1SLFSww@mail.gmail.com> <YonqGfJST09cUV6W@FreeBSD.org> <CAJwHY9W-3eEXR%2BjTw40thcio65Ukjw8qgnp-qPiS3bdeZS0kLw@mail.gmail.com> <YQBPR0101MB97429323AD5F921BE76C613EDDD59@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM> <CAJwHY9WHE4MFScuhry7v9MqRQBSTNY5XYCH5qfO4xEn6Swwtrw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Adam Stylinski <kungfujesus06@gmail.com> wrote:=0A= [stuff snipped]=0A= >=0A= > However, in general, RPC RTT will define how well NFS performs and not=0A= > the I/O rate for a bulk file read/write.=0A= Lets take this RPC RTT thing a step further...=0A= - If I got the math right, at 40Gbps, 1Mbyte takes about 200usec on the wir= e.=0A= Without readahead, the protocol looks like this:=0A= Client Server (time going down the scre= en)=0A= small Read request --->=0A= <-- 1Mbyte reply=0A= small Read request -->=0A= <-- 1Mbyte reply=0A= The 1Mbyte replies take 200usec on the wire.=0A= =0A= Then suppose your ping time is 400usec (I see about 350usec on my little la= n).=0A= - The wire is only transferring data about half of the time, because the sm= all=0A= request message takes almost as long as the 1Mbyte reply.=0A= =0A= As you can see, readahead (where multiple reads are done concurrently)=0A= is critical for this case. I have no idea how Linux decides to do readahead= .=0A= (FreeBSD defaults to 1 readahead, with a mount option that can increase=0A= that.)=0A= =0A= Now, net interfaces normally do interrupt moderation. This is done to=0A= avoid an interrupt storm during bulk data transfer. However, interrupt=0A= moderation results in interrupt delay for handling the small Read request= =0A= message.=0A= --> Interrupt moderation can increase RPC RTT. Turning it off, if possible,= =0A= might help.=0A= =0A= So, ping the server from the client to see what your RTT roughly is.=0A= Also, you could look at some traffic in wireshark, to see what readahead=0A= is happening and what the RPC RTT is.=0A= (You can capture with "tcpdump", but wireshark knows how to decode=0A= NFS properly.)=0A= =0A= As you can see, RPC traffic is very different from bulk data transfer.=0A= =0A= rick=0A= =0A= > Btw, writing is a very different story than reading, largely due to the n= eed=0A= > to commit data/metadata to stable storage while writing.=0A= >=0A= > I can't help w.r.t. ZFS nor high performance nets (my fastest is 1Gbps), = rick=0A= >=0A= > > You mention iperf. Please post the options you used when invoking iper= f and it's output.=0A= >=0A= > Setting up the NFS client as a "server", since it seems that the=0A= > terminology is a little bit flipped with iperf, here's the output:=0A= >=0A= > -----------------------------------------------------------=0A= > Server listening on 5201 (test #1)=0A= > -----------------------------------------------------------=0A= > Accepted connection from 10.5.5.1, port 11534=0A= > [ 5] local 10.5.5.4 port 5201 connected to 10.5.5.1 port 43931=0A= > [ ID] Interval Transfer Bitrate=0A= > [ 5] 0.00-1.00 sec 3.81 GBytes 32.7 Gbits/sec=0A= > [ 5] 1.00-2.00 sec 4.20 GBytes 36.1 Gbits/sec=0A= > [ 5] 2.00-3.00 sec 4.18 GBytes 35.9 Gbits/sec=0A= > [ 5] 3.00-4.00 sec 4.21 GBytes 36.1 Gbits/sec=0A= > [ 5] 4.00-5.00 sec 4.20 GBytes 36.1 Gbits/sec=0A= > [ 5] 5.00-6.00 sec 4.21 GBytes 36.2 Gbits/sec=0A= > [ 5] 6.00-7.00 sec 4.10 GBytes 35.2 Gbits/sec=0A= > [ 5] 7.00-8.00 sec 4.20 GBytes 36.1 Gbits/sec=0A= > [ 5] 8.00-9.00 sec 4.21 GBytes 36.1 Gbits/sec=0A= > [ 5] 9.00-10.00 sec 4.20 GBytes 36.1 Gbits/sec=0A= > [ 5] 10.00-10.00 sec 7.76 MBytes 35.3 Gbits/sec=0A= > - - - - - - - - - - - - - - - - - - - - - - - - -=0A= > [ ID] Interval Transfer Bitrate=0A= > [ 5] 0.00-10.00 sec 41.5 GBytes 35.7 Gbits/sec rec= eiver=0A= > -----------------------------------------------------------=0A= > Server listening on 5201 (test #2)=0A= > -----------------------------------------------------------=0A= >=0A= > On Sun, May 22, 2022 at 3:45 AM John <jwd@freebsd.org> wrote:=0A= > >=0A= > > ----- Adam Stylinski's Original Message -----=0A= > > > Hello,=0A= > > >=0A= > > > I have two systems connected via ConnectX-3 mellanox cards in etherne= t=0A= > > > mode. They have their MTU's maxed at 9000, their ring buffers maxed= =0A= > > > at 8192, and I can hit around 36 gbps with iperf.=0A= > > >=0A= > > > When using an NFS client (client =3D linux, server =3D freebsd), I se= e a=0A= > > > maximum rate of around 20gbps. The test file is fully in ARC. The= =0A= > > > test is performed with an NFS mount nconnect=3D4 and an rsize/wsize o= f=0A= > > > 1MB.=0A= > > >=0A= > > > Here's the flame graph of the kernel of the system in question, with= =0A= > > > idle stacks removed:=0A= > > >=0A= > > > https://gist.github.com/KungFuJesus/918c6dcf40ae07767d5382deafab3a52#= file-nfs_fg-svg=0A= > > >=0A= > > > The longest functions seems like maybe it's the ERMS aware memcpy=0A= > > > happening from the ARC? Is there maybe a missing fast path that coul= d=0A= > > > take fewer copies into the socket buffer?=0A= > >=0A= > > Hi Adam -=0A= > >=0A= > > Some items to look at and possibly include for more responses....=0A= > >=0A= > > - What is your server system? Make/model/ram/etc. What is your=0A= > > overall 'top' cpu utilization 'top -aH' ...=0A= > >=0A= > > - It looks like you're using a 40gb/s card. Posting the output of=0A= > > 'ifconfig -vm' would provide additional information.=0A= > >=0A= > > - Are the interfaces running cleanly? 'netstat -i' is helpful.=0A= > >=0A= > > - Inspect 'netstat -s'. Duplicate pkts? Resends? Out-of-order?=0A= > >=0A= > > - Inspect 'netstat -m'. Denied? Delayed?=0A= > >=0A= > >=0A= > > - You mention iperf. Please post the options you used when=0A= > > invoking iperf and it's output.=0A= > >=0A= > > - You appear to be looking for through-put vs low-latency. Have=0A= > > you looked at window-size vs the amount of memory allocated to the=0A= > > streams. These values vary based on the bit-rate of the connection.= =0A= > > Tcp connections require outstanding un-ack'd data to be held.=0A= > > Effects values below.=0A= > >=0A= > >=0A= > > - What are your values for:=0A= > >=0A= > > -- kern.ipc.maxsockbuf=0A= > > -- net.inet.tcp.sendbuf_max=0A= > > -- net.inet.tcp.recvbuf_max=0A= > >=0A= > > -- net.inet.tcp.sendspace=0A= > > -- net.inet.tcp.recvspace=0A= > >=0A= > > -- net.inet.tcp.delayed_ack=0A= > >=0A= > > - What threads/irq are allocated to your NIC? 'vmstat -i'=0A= > >=0A= > > - Are the above threads floating or mapped? 'cpuset -g ...'=0A= > >=0A= > > - Determine best settings for LRO/TSO for your card.=0A= > >=0A= > > - Disable nfs tcp drc=0A= > >=0A= > > - What is your atime setting?=0A= > >=0A= > >=0A= > > If you really think you have a ZFS/Kernel issue, and you're=0A= > > data fits in cache, dump ZFS, create a memory backed file system=0A= > > and repeat your tests. This will purge a large portion of your=0A= > > graph. LRO/TSO changes may do so also.=0A= > >=0A= > > You also state you are using a Linux client. Are you using=0A= > > the MLX affinity scripts, buffer sizing suggestions, etc, etc.=0A= > > Have you swapped the Linux system for a fbsd system?=0A= > >=0A= > > And as a final note, I regularly use Chelsio T62100 cards=0A= > > in dual home and/or LACP environments in Supermicro boxes with 100's=0A= > > of nfs boot (Bhyve, QEMU, and physical system) clients per server=0A= > > with no network starvation or cpu bottlenecks. Clients boot, perform= =0A= > > their work, and then remotely request image rollback.=0A= > >=0A= > >=0A= > > Hopefully the above will help and provide pointers.=0A= > >=0A= > > Cheers=0A= > >=0A= >=0A=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQBPR0101MB9742056AFEF03C6CAF2B7F56DDD59>