Date: Sun, 22 May 2022 10:20:37 -0400 From: Adam Stylinski <kungfujesus06@gmail.com> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: John <jwd@freebsd.org>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: Re: zfs/nfsd performance limiter Message-ID: <CAJwHY9WHE4MFScuhry7v9MqRQBSTNY5XYCH5qfO4xEn6Swwtrw@mail.gmail.com> In-Reply-To: <YQBPR0101MB97429323AD5F921BE76C613EDDD59@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM> References: <CAJwHY9WMOOLy=rb9FNjExQtYej21Zv=Po9Cbg=19gkw1SLFSww@mail.gmail.com> <YonqGfJST09cUV6W@FreeBSD.org> <CAJwHY9W-3eEXR%2BjTw40thcio65Ukjw8qgnp-qPiS3bdeZS0kLw@mail.gmail.com> <YQBPR0101MB97429323AD5F921BE76C613EDDD59@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM>
next in thread | previous in thread | raw e-mail | index | archive | help
hw.mlxen0.conf.tx_size: 8192 hw.mlxen0.conf.rx_size: 8192 hw.mlxen0.conf.tx_rings: 6 hw.mlxen0.conf.rx_rings: 4 (So, should I use 6 connections?) I tried to eliminate ZFS from the equation by exporting a tmpfs backed file system but I got IO errors when I tried to stat the mount point from the Linux client (perhaps a bug with tmpfs and mountd?). > If you have not already done so, do a "nfsstat -m" on the client to find out what options it is actually using (works on both Linux and FreeBSD). /mnt/nasshare from 10.5.5.1:/mnt/share Flags: rw,noatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=120,acregmax=120,acdirmin=120,acdirmax=120,hard,proto=tcp,nconnect=4,timeo=600,retrans=2,sec=sys,clientaddr=10.5.5.4,fsc,local_lock=none,addr=10.5.5.1 I had tried prior to this the NFSv3 server but found that it actually performed worse than the v4 one. On Sun, May 22, 2022 at 10:12 AM Rick Macklem <rmacklem@uoguelph.ca> wrote: > > Adam Stylinski <kungfujesus06@gmail.com> wrote: > > jwd wrote: > > > What is your server system? Make/model/ram/etc. > > Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (6 cores, a little starved > > on the clock but the load at least is basically zero during this test) > > 128GB of memory > > > > > top -aH > > During the copy load (for brevity, only did the real top contenders > > for CPU here): > > > > last pid: 15560; load averages: 0.25, 0.39, 0.27 > > > > > > > > up 4+15:48:54 > > 09:17:38 > > 98 threads: 2 running, 96 sleeping > > CPU: 0.0% user, 0.0% nice, 19.1% system, 5.6% interrupt, 75.3% idle > > Mem: 12M Active, 4405M Inact, 8284K Laundry, 115G Wired, 1148M Buf, 4819M Free > > ARC: 98G Total, 80G MFU, 15G MRU, 772K Anon, 1235M Header, 1042M Other > > 91G Compressed, 189G Uncompressed, 2.09:1 Ratio > > Swap: 5120M Total, 5120M Free > > > > PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND > > 3830 root 20 0 12M 2700K rpcsvc 2 1:16 53.26% > > nfsd: server (nfsd){nfsd: service} > > 3830 root 20 0 12M 2700K CPU5 5 5:42 52.96% > > nfsd: server (nfsd){nfsd: master} > > 15560 adam 20 0 17M 5176K CPU2 2 0:00 0.12% top -aH > > 1493 root 20 0 13M 2260K select 3 0:36 0.01% > > /usr/sbin/powerd > > 1444 root 20 0 75M 2964K select 5 0:19 0.01% > > /usr/sbin/mountd -r /etc/exports /etc/zfs/exports > > 1215 uucp 20 0 13M 2820K select 5 0:27 0.01% > > /usr/local/libexec/nut/usbhid-ups -a cyberpower > > 93424 adam 20 0 21M 9900K select 0 0:00 0.01% > > sshd: adam@pts/0 (sshd) > > > > > ifconfig -vm > > mlxen0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000 > options=ed07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6> > capabilities=ed07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6> > > ether 00:02:c9:35:df:20 > > inet 10.5.5.1 netmask 0xffffff00 broadcast 10.5.5.255 > > media: Ethernet autoselect (40Gbase-CR4 <full-duplex,rxpause,txpause>) > > status: active > > supported media: > > media autoselect > > media 40Gbase-CR4 mediaopt full-duplex > > media 10Gbase-CX4 mediaopt full-duplex > > media 10Gbase-SR mediaopt full-duplex > > media 1000baseT mediaopt full-duplex > > nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> > > plugged: QSFP+ 40GBASE-CR4 (No separable connector) > > vendor: Mellanox PN: MC2207130-002 SN: MT1419VS07971 DATE: 2014-06-06 > > module temperature: 0.00 C voltage: 0.00 Volts > > lane 1: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA > > lane 2: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA > > lane 3: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA > > lane 4: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA > > > > > - What are your values for: > > > > > > -- kern.ipc.maxsockbuf > > > -- net.inet.tcp.sendbuf_max > > > -- net.inet.tcp.recvbuf_max > > > > > > -- net.inet.tcp.sendspace > > > -- net.inet.tcp.recvspace > > > > > > -- net.inet.tcp.delayed_ack > > kern.ipc.maxsockbuf: 16777216 > > net.inet.tcp.sendbuf_max: 16777216 > > net.inet.tcp.recvbuf_max: 16777216 > > net.inet.tcp.sendspace: 32768 # This is interesting? I'm not sure why > > the discrepancy here > > net.inet.tcp.recvspace: 4194304 > > net.inet.tcp.delayed_ack: 0 > > > > > netstat -i > > Name Mtu Network Address Ipkts Ierrs Idrop > Opkts Oerrs Coll > > igb0 9000 <Link#1> ac:1f:6b:b0:60:bc 18230625 0 0 > 24178283 0 0 > > igb1 9000 <Link#2> ac:1f:6b:b0:60:bc 14341213 0 0 > 8447249 0 0 > > lo0 16384 <Link#3> lo0 367691 0 0 > 367691 0 0 > > lo0 - localhost localhost 68 - - > 68 - - > > lo0 - fe80::%lo0/64 fe80::1%lo0 0 - - > 0 - - > > lo0 - your-net localhost 348944 - - > 348944 - - > > mlxen 9000 <Link#4> 00:02:c9:35:df:20 13138046 0 12 > 26308206 0 0 > > mlxen - 10.5.5.0/24 10.5.5.1 11592389 - - > 24345184 - - > > vm-pu 9000 <Link#6> 56:3e:55:8a:2a:f8 7270 0 0 > 962249 102 0 > > lagg0 9000 <Link#5> ac:1f:6b:b0:60:bc 31543941 0 0 > 31623674 0 0 > > lagg0 - 192.168.0.0/2 nasbox 27967582 - - > 41779731 - - > > > > > What threads/irq are allocated to your NIC? 'vmstat -i' > > > > Doesn't seem perfectly balanced but not terribly imbalanced, either: > > > > interrupt total rate > > irq9: acpi0 3 0 > > irq18: ehci0 ehci1+ 803162 2 > > cpu0:timer 67465114 167 > > cpu1:timer 65068819 161 > > cpu2:timer 65535300 163 > > cpu3:timer 63408731 157 > > cpu4:timer 63026304 156 > > cpu5:timer 63431412 157 > > irq56: nvme0:admin 18 0 > > irq57: nvme0:io0 544999 1 > > irq58: nvme0:io1 465816 1 > > irq59: nvme0:io2 487486 1 > > irq60: nvme0:io3 474616 1 > > irq61: nvme0:io4 452527 1 > > irq62: nvme0:io5 467807 1 > > irq63: mps0 36110415 90 > > irq64: mps1 112328723 279 > > irq65: mps2 54845974 136 > > irq66: mps3 50770215 126 > > irq68: xhci0 3122136 8 > > irq70: igb0:rxq0 1974562 5 > > irq71: igb0:rxq1 3034190 8 > > irq72: igb0:rxq2 28703842 71 > > irq73: igb0:rxq3 1126533 3 > > irq74: igb0:aq 7 0 > > irq75: igb1:rxq0 1852321 5 > > irq76: igb1:rxq1 2946722 7 > > irq77: igb1:rxq2 9602613 24 > > irq78: igb1:rxq3 4101258 10 > > irq79: igb1:aq 8 0 > > irq80: ahci1 37386191 93 > > irq81: mlx4_core0 4748775 12 > > irq82: mlx4_core0 13754442 34 > > irq83: mlx4_core0 3551629 9 > > irq84: mlx4_core0 2595850 6 > > irq85: mlx4_core0 4947424 12 > > Total 769135944 1908 > > > > > Are the above threads floating or mapped? 'cpuset -g ...' > > > > I suspect I was supposed to run this against the argument of a pid, > > maybe nfsd? Here's the output without an argument > > > > pid -1 mask: 0, 1, 2, 3, 4, 5 > > pid -1 domain policy: first-touch mask: 0 > > > > > Disable nfs tcp drc > > > > This is the first I've even seen a duplicate request cache mentioned. > > It seems counter-intuitive for why that'd help but maybe I'll try > > doing that. What exactly is the benefit? > The DRC improves correctness for NFSv3 and NFSv4.0 mounts. It is a > performance hit. However, for a read mostly load it won't add too > much overhead. Turning it off increases the likelyhood of data corruption > due to retried non-idempotent RPCs, but the failure will be rare over TCP. > > If your mount is NFSv4.1 or 4.2, the DRC is not used, so don't worry about it. > > > > What is your atime setting? > > > > Disabled at both the file system and the client mounts. > > > > > You also state you are using a Linux client. Are you using the MLX affinity > scripts, buffer sizing suggestions, etc, etc. Have you swapped the Linux system for a fbsd system? > > I've not, though I do vaguely recall mellanox supplying some scripts > > in their documentation that fixed interrupt handling on specific cores > > at one point. Is this what you're referring to? I could give that a > > try. I don't at present have any FreeBSD client systems with enough > > PCI express bandwidth to swap things out for a Linux vs FreeBSD test. > If you have not already done so, do a "nfsstat -m" on the client to find > out what options it is actually using (works on both Linux and FreeBSD). > > If the Linux client has a way of manually adjusting readahead, then try > increasing it. (FreeBSD has a "readahead" mount option, but I can't recall > if Linux has one?) > > You can try mounting the server on the server, but that will use lo0 and not > the mellanox, so it might be irrelevant. > > Also, I don't know how many queues the mellanox driver used. You'd want > an "nconnect" at least as high as the number of queues, since each TCP > connection will be serviced by one queue and that limits its bandwidth. > > However, in general, RPC RTT will define how well NFS performs and not > the I/O rate for a bulk file read/write. > Btw, writing is a very different story than reading, largely due to the need > to commit data/metadata to stable storage while writing. > > I can't help w.r.t. ZFS nor high performance nets (my fastest is 1Gbps), rick > > > You mention iperf. Please post the options you used when invoking iperf and it's output. > > Setting up the NFS client as a "server", since it seems that the > terminology is a little bit flipped with iperf, here's the output: > > ----------------------------------------------------------- > Server listening on 5201 (test #1) > ----------------------------------------------------------- > Accepted connection from 10.5.5.1, port 11534 > [ 5] local 10.5.5.4 port 5201 connected to 10.5.5.1 port 43931 > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.00 sec 3.81 GBytes 32.7 Gbits/sec > [ 5] 1.00-2.00 sec 4.20 GBytes 36.1 Gbits/sec > [ 5] 2.00-3.00 sec 4.18 GBytes 35.9 Gbits/sec > [ 5] 3.00-4.00 sec 4.21 GBytes 36.1 Gbits/sec > [ 5] 4.00-5.00 sec 4.20 GBytes 36.1 Gbits/sec > [ 5] 5.00-6.00 sec 4.21 GBytes 36.2 Gbits/sec > [ 5] 6.00-7.00 sec 4.10 GBytes 35.2 Gbits/sec > [ 5] 7.00-8.00 sec 4.20 GBytes 36.1 Gbits/sec > [ 5] 8.00-9.00 sec 4.21 GBytes 36.1 Gbits/sec > [ 5] 9.00-10.00 sec 4.20 GBytes 36.1 Gbits/sec > [ 5] 10.00-10.00 sec 7.76 MBytes 35.3 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate > [ 5] 0.00-10.00 sec 41.5 GBytes 35.7 Gbits/sec receiver > ----------------------------------------------------------- > Server listening on 5201 (test #2) > ----------------------------------------------------------- > > On Sun, May 22, 2022 at 3:45 AM John <jwd@freebsd.org> wrote: > > > > ----- Adam Stylinski's Original Message ----- > > > Hello, > > > > > > I have two systems connected via ConnectX-3 mellanox cards in ethernet > > > mode. They have their MTU's maxed at 9000, their ring buffers maxed > > > at 8192, and I can hit around 36 gbps with iperf. > > > > > > When using an NFS client (client = linux, server = freebsd), I see a > > > maximum rate of around 20gbps. The test file is fully in ARC. The > > > test is performed with an NFS mount nconnect=4 and an rsize/wsize of > > > 1MB. > > > > > > Here's the flame graph of the kernel of the system in question, with > > > idle stacks removed: > > > > > > https://gist.github.com/KungFuJesus/918c6dcf40ae07767d5382deafab3a52#file-nfs_fg-svg > > > > > > The longest functions seems like maybe it's the ERMS aware memcpy > > > happening from the ARC? Is there maybe a missing fast path that could > > > take fewer copies into the socket buffer? > > > > Hi Adam - > > > > Some items to look at and possibly include for more responses.... > > > > - What is your server system? Make/model/ram/etc. What is your > > overall 'top' cpu utilization 'top -aH' ... > > > > - It looks like you're using a 40gb/s card. Posting the output of > > 'ifconfig -vm' would provide additional information. > > > > - Are the interfaces running cleanly? 'netstat -i' is helpful. > > > > - Inspect 'netstat -s'. Duplicate pkts? Resends? Out-of-order? > > > > - Inspect 'netstat -m'. Denied? Delayed? > > > > > > - You mention iperf. Please post the options you used when > > invoking iperf and it's output. > > > > - You appear to be looking for through-put vs low-latency. Have > > you looked at window-size vs the amount of memory allocated to the > > streams. These values vary based on the bit-rate of the connection. > > Tcp connections require outstanding un-ack'd data to be held. > > Effects values below. > > > > > > - What are your values for: > > > > -- kern.ipc.maxsockbuf > > -- net.inet.tcp.sendbuf_max > > -- net.inet.tcp.recvbuf_max > > > > -- net.inet.tcp.sendspace > > -- net.inet.tcp.recvspace > > > > -- net.inet.tcp.delayed_ack > > > > - What threads/irq are allocated to your NIC? 'vmstat -i' > > > > - Are the above threads floating or mapped? 'cpuset -g ...' > > > > - Determine best settings for LRO/TSO for your card. > > > > - Disable nfs tcp drc > > > > - What is your atime setting? > > > > > > If you really think you have a ZFS/Kernel issue, and you're > > data fits in cache, dump ZFS, create a memory backed file system > > and repeat your tests. This will purge a large portion of your > > graph. LRO/TSO changes may do so also. > > > > You also state you are using a Linux client. Are you using > > the MLX affinity scripts, buffer sizing suggestions, etc, etc. > > Have you swapped the Linux system for a fbsd system? > > > > And as a final note, I regularly use Chelsio T62100 cards > > in dual home and/or LACP environments in Supermicro boxes with 100's > > of nfs boot (Bhyve, QEMU, and physical system) clients per server > > with no network starvation or cpu bottlenecks. Clients boot, perform > > their work, and then remotely request image rollback. > > > > > > Hopefully the above will help and provide pointers. > > > > Cheers > > >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJwHY9WHE4MFScuhry7v9MqRQBSTNY5XYCH5qfO4xEn6Swwtrw>