Date: Wed, 25 May 2022 10:04:28 -0400 From: Adam Stylinski <kungfujesus06@gmail.com> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: freebsd-fs@freebsd.org Subject: Re: zfs/nfsd performance limiter Message-ID: <CAJwHY9Vz4kQ=sTG5-KGYsAD3SFjuJdy4ihCK%2BcxyM1cdiQAU_g@mail.gmail.com> In-Reply-To: <YQBPR0101MB97420C3CFE6F84D44B9A020FDDD49@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM> References: <CAJwHY9WMOOLy=rb9FNjExQtYej21Zv=Po9Cbg=19gkw1SLFSww@mail.gmail.com> <YonqGfJST09cUV6W@FreeBSD.org> <CAJwHY9W-3eEXR%2BjTw40thcio65Ukjw8qgnp-qPiS3bdeZS0kLw@mail.gmail.com> <YQBPR0101MB97429323AD5F921BE76C613EDDD59@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM> <CAJwHY9WHE4MFScuhry7v9MqRQBSTNY5XYCH5qfO4xEn6Swwtrw@mail.gmail.com> <YQBPR0101MB9742056AFEF03C6CAF2B7F56DDD59@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM> <CAJwHY9VOkOAv5ewRTmiyNudMDn3%2Bju15O-NGQWeZumt%2Bg2J6=g@mail.gmail.com> <CAJwHY9VLLP76dATL3kdHA4DZ3y4%2B_dAKH3Y3fSKx9DwsOQmw7A@mail.gmail.com> <YQBPR0101MB97420C3CFE6F84D44B9A020FDDD49@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM>
next in thread | previous in thread | raw e-mail | index | archive | help
Obviously I don't expect you to know the answer for why Linux is hiding this option in their sysfs tree, but it wasn't entirely rhetorical. If you search for ways to optimize NFS throughput, the readahead option usually doesn't show up unless you search "NFS readahead". I figured for the multiplexed connection configuration to actually work it had to be doing parallel reads in some fashion but it wasn't obvious that there was another option client side that had to take place for it to work (all of the little blurbs about the new nconnect feature seem to just imply it works magically out of the box, even with 100gbps links). The closest thing I could find is somebody complaining on bugzillas that at one point it was too high and caused some thrashing for them so they capped the auto calculated readahead to be 128KB max. It's also a bit odd that Linux exposes this tuneable as a size rather than a number of parallel reads of the multiple of rsize/wsize like FreeBSD does. Anyway, thanks for the help, CC'ing the list for posterity and to help future people on 40gbps links that seem to hit that 20gbps wall. On Sun, May 22, 2022 at 10:08 PM Rick Macklem <rmacklem@uoguelph.ca> wrote: > > Adam Stylinski <kungfujesus06@gmail.com> wrote: > > Good call on the readahead option. Evidently Linux by default sets > > their NFS client to only 128kb for their readahead and the setting is > > not a mount option but something buried in sysfs. Setting that to > > 1024kb, I was able to get ~30ish gbps. Is there a reason this feature > > is so scantily documented and well hidden? > I'm not sure if this was meant to be a rhetorical question, but for FreeBSD > it is described in "man mount_nfs" along with the rest of the mount > options. Also, for FreeBSD, the default of 1 block of readahead is > normally adequate for LAN network connections. > > I'll leave the answer w.r.t. Linux for others to ponder, rick > > https://docs.microsoft.com/en-us/azure/azure-netapp-files/performance-linux-nfs-read-ahead > > On Sun, May 22, 2022 at 8:29 PM Adam Stylinski <kungfujesus06@gmail.com> wrote: > > > > I've actually seen this server drop interrupts on the floor when too > > many things are hitting the NVMe interface and the mellanox NIC at the > > same time, so I wouldn't be _too_ shocked if the ability to service > > interrupts was a limiting factor in all of this. > > > > My ping is roughly 310-480 us: > > 64 bytes from 10.5.5.1: icmp_seq=1 ttl=64 time=0.039 ms > > 64 bytes from 10.5.5.1: icmp_seq=2 ttl=64 time=0.035 ms > > 64 bytes from 10.5.5.1: icmp_seq=3 ttl=64 time=0.048 ms > > 64 bytes from 10.5.5.1: icmp_seq=4 ttl=64 time=0.031 ms > > > > I'll have to look at the nfsmount man pages for Linux to see if I can > > find a readahead parameter. > > > > On Sun, May 22, 2022 at 6:26 PM Rick Macklem <rmacklem@uoguelph.ca> wrote: > > > > > > Adam Stylinski <kungfujesus06@gmail.com> wrote: > > > [stuff snipped] > > > > > > > > However, in general, RPC RTT will define how well NFS performs and not > > > > the I/O rate for a bulk file read/write. > > > Lets take this RPC RTT thing a step further... > > > - If I got the math right, at 40Gbps, 1Mbyte takes about 200usec on the wire. > > > Without readahead, the protocol looks like this: > > > Client Server (time going down the screen) > > > small Read request ---> > > > <-- 1Mbyte reply > > > small Read request --> > > > <-- 1Mbyte reply > > > The 1Mbyte replies take 200usec on the wire. > > > > > > Then suppose your ping time is 400usec (I see about 350usec on my little lan). > > > - The wire is only transferring data about half of the time, because the small > > > request message takes almost as long as the 1Mbyte reply. > > > > > > As you can see, readahead (where multiple reads are done concurrently) > > > is critical for this case. I have no idea how Linux decides to do readahead. > > > (FreeBSD defaults to 1 readahead, with a mount option that can increase > > > that.) > > > > > > Now, net interfaces normally do interrupt moderation. This is done to > > > avoid an interrupt storm during bulk data transfer. However, interrupt > > > moderation results in interrupt delay for handling the small Read request > > > message. > > > --> Interrupt moderation can increase RPC RTT. Turning it off, if possible, > > > might help. > > > > > > So, ping the server from the client to see what your RTT roughly is. > > > Also, you could look at some traffic in wireshark, to see what readahead > > > is happening and what the RPC RTT is. > > > (You can capture with "tcpdump", but wireshark knows how to decode > > > NFS properly.) > > > > > > As you can see, RPC traffic is very different from bulk data transfer. > > > > > > rick > > > > > > > Btw, writing is a very different story than reading, largely due to the need > > > > to commit data/metadata to stable storage while writing. > > > > > > > > I can't help w.r.t. ZFS nor high performance nets (my fastest is 1Gbps), rick > > > > > > > > > You mention iperf. Please post the options you used when invoking iperf and it's output. > > > > > > > > Setting up the NFS client as a "server", since it seems that the > > > > terminology is a little bit flipped with iperf, here's the output: > > > > > > > > ----------------------------------------------------------- > > > > Server listening on 5201 (test #1) > > > > ----------------------------------------------------------- > > > > Accepted connection from 10.5.5.1, port 11534 > > > > [ 5] local 10.5.5.4 port 5201 connected to 10.5.5.1 port 43931 > > > > [ ID] Interval Transfer Bitrate > > > > [ 5] 0.00-1.00 sec 3.81 GBytes 32.7 Gbits/sec > > > > [ 5] 1.00-2.00 sec 4.20 GBytes 36.1 Gbits/sec > > > > [ 5] 2.00-3.00 sec 4.18 GBytes 35.9 Gbits/sec > > > > [ 5] 3.00-4.00 sec 4.21 GBytes 36.1 Gbits/sec > > > > [ 5] 4.00-5.00 sec 4.20 GBytes 36.1 Gbits/sec > > > > [ 5] 5.00-6.00 sec 4.21 GBytes 36.2 Gbits/sec > > > > [ 5] 6.00-7.00 sec 4.10 GBytes 35.2 Gbits/sec > > > > [ 5] 7.00-8.00 sec 4.20 GBytes 36.1 Gbits/sec > > > > [ 5] 8.00-9.00 sec 4.21 GBytes 36.1 Gbits/sec > > > > [ 5] 9.00-10.00 sec 4.20 GBytes 36.1 Gbits/sec > > > > [ 5] 10.00-10.00 sec 7.76 MBytes 35.3 Gbits/sec > > > > - - - - - - - - - - - - - - - - - - - - - - - - - > > > > [ ID] Interval Transfer Bitrate > > > > [ 5] 0.00-10.00 sec 41.5 GBytes 35.7 Gbits/sec receiver > > > > ----------------------------------------------------------- > > > > Server listening on 5201 (test #2) > > > > ----------------------------------------------------------- > > > > > > > > On Sun, May 22, 2022 at 3:45 AM John <jwd@freebsd.org> wrote: > > > > > > > > > > ----- Adam Stylinski's Original Message ----- > > > > > > Hello, > > > > > > > > > > > > I have two systems connected via ConnectX-3 mellanox cards in ethernet > > > > > > mode. They have their MTU's maxed at 9000, their ring buffers maxed > > > > > > at 8192, and I can hit around 36 gbps with iperf. > > > > > > > > > > > > When using an NFS client (client = linux, server = freebsd), I see a > > > > > > maximum rate of around 20gbps. The test file is fully in ARC. The > > > > > > test is performed with an NFS mount nconnect=4 and an rsize/wsize of > > > > > > 1MB. > > > > > > > > > > > > Here's the flame graph of the kernel of the system in question, with > > > > > > idle stacks removed: > > > > > > > > > > > > https://gist.github.com/KungFuJesus/918c6dcf40ae07767d5382deafab3a52#file-nfs_fg-svg > > > > > > > > > > > > The longest functions seems like maybe it's the ERMS aware memcpy > > > > > > happening from the ARC? Is there maybe a missing fast path that could > > > > > > take fewer copies into the socket buffer? > > > > > > > > > > Hi Adam - > > > > > > > > > > Some items to look at and possibly include for more responses.... > > > > > > > > > > - What is your server system? Make/model/ram/etc. What is your > > > > > overall 'top' cpu utilization 'top -aH' ... > > > > > > > > > > - It looks like you're using a 40gb/s card. Posting the output of > > > > > 'ifconfig -vm' would provide additional information. > > > > > > > > > > - Are the interfaces running cleanly? 'netstat -i' is helpful. > > > > > > > > > > - Inspect 'netstat -s'. Duplicate pkts? Resends? Out-of-order? > > > > > > > > > > - Inspect 'netstat -m'. Denied? Delayed? > > > > > > > > > > > > > > > - You mention iperf. Please post the options you used when > > > > > invoking iperf and it's output. > > > > > > > > > > - You appear to be looking for through-put vs low-latency. Have > > > > > you looked at window-size vs the amount of memory allocated to the > > > > > streams. These values vary based on the bit-rate of the connection. > > > > > Tcp connections require outstanding un-ack'd data to be held. > > > > > Effects values below. > > > > > > > > > > > > > > > - What are your values for: > > > > > > > > > > -- kern.ipc.maxsockbuf > > > > > -- net.inet.tcp.sendbuf_max > > > > > -- net.inet.tcp.recvbuf_max > > > > > > > > > > -- net.inet.tcp.sendspace > > > > > -- net.inet.tcp.recvspace > > > > > > > > > > -- net.inet.tcp.delayed_ack > > > > > > > > > > - What threads/irq are allocated to your NIC? 'vmstat -i' > > > > > > > > > > - Are the above threads floating or mapped? 'cpuset -g ...' > > > > > > > > > > - Determine best settings for LRO/TSO for your card. > > > > > > > > > > - Disable nfs tcp drc > > > > > > > > > > - What is your atime setting? > > > > > > > > > > > > > > > If you really think you have a ZFS/Kernel issue, and you're > > > > > data fits in cache, dump ZFS, create a memory backed file system > > > > > and repeat your tests. This will purge a large portion of your > > > > > graph. LRO/TSO changes may do so also. > > > > > > > > > > You also state you are using a Linux client. Are you using > > > > > the MLX affinity scripts, buffer sizing suggestions, etc, etc. > > > > > Have you swapped the Linux system for a fbsd system? > > > > > > > > > > And as a final note, I regularly use Chelsio T62100 cards > > > > > in dual home and/or LACP environments in Supermicro boxes with 100's > > > > > of nfs boot (Bhyve, QEMU, and physical system) clients per server > > > > > with no network starvation or cpu bottlenecks. Clients boot, perform > > > > > their work, and then remotely request image rollback. > > > > > > > > > > > > > > > Hopefully the above will help and provide pointers. > > > > > > > > > > Cheers > > > > > > > > >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJwHY9Vz4kQ=sTG5-KGYsAD3SFjuJdy4ihCK%2BcxyM1cdiQAU_g>