Date: Wed, 25 May 2022 15:41:48 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: Adam Stylinski <kungfujesus06@gmail.com>, John <jwd@freebsd.org> Cc: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: Re: zfs/nfsd performance limiter Message-ID: <YQBPR0101MB9742A3D546254D116DAA416CDDD69@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <CAJwHY9W-3eEXR%2BjTw40thcio65Ukjw8qgnp-qPiS3bdeZS0kLw@mail.gmail.com> References: <CAJwHY9WMOOLy=rb9FNjExQtYej21Zv=Po9Cbg=19gkw1SLFSww@mail.gmail.com> <YonqGfJST09cUV6W@FreeBSD.org> <CAJwHY9W-3eEXR%2BjTw40thcio65Ukjw8qgnp-qPiS3bdeZS0kLw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Adam Stylinski <kungfujesus06@gmail.com> wrote:=0A= [stuff snipped]=0A= =0A= > > ifconfig -vm=0A= > mlxen0: flags=3D8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu= 9000=0A= Just in case you (or someone else reading this) is not aware of it,=0A= use of 9K jumbo clusters causes fragmentation of the memory pool=0A= clusters are allocated from and, therefore, their use is not recommended.= =0A= =0A= Now, it may be that the mellanox driver doesn't use 9K clusters (it could= =0A= put the received frame in multiple smaller clusters), but if it does, you= =0A= should consider reducing the mtu.=0A= If you:=0A= # vmstat -z | fgrep mbuf_jumbo_9k=0A= it will show you if they are being used.=0A= =0A= rick=0A= =0A= =0A= > netstat -i=0A= Name Mtu Network Address Ipkts Ierrs Idrop=0A= Opkts Oerrs Coll=0A= igb0 9000 <Link#1> ac:1f:6b:b0:60:bc 18230625 0 0=0A= 24178283 0 0=0A= igb1 9000 <Link#2> ac:1f:6b:b0:60:bc 14341213 0 0=0A= 8447249 0 0=0A= lo0 16384 <Link#3> lo0 367691 0 0=0A= 367691 0 0=0A= lo0 - localhost localhost 68 - -=0A= 68 - -=0A= lo0 - fe80::%lo0/64 fe80::1%lo0 0 - -=0A= 0 - -=0A= lo0 - your-net localhost 348944 - -=0A= 348944 - -=0A= mlxen 9000 <Link#4> 00:02:c9:35:df:20 13138046 0 12=0A= 26308206 0 0=0A= mlxen - 10.5.5.0/24 10.5.5.1 11592389 - -=0A= 24345184 - -=0A= vm-pu 9000 <Link#6> 56:3e:55:8a:2a:f8 7270 0 0=0A= 962249 102 0=0A= lagg0 9000 <Link#5> ac:1f:6b:b0:60:bc 31543941 0 0=0A= 31623674 0 0=0A= lagg0 - 192.168.0.0/2 nasbox 27967582 - -=0A= 41779731 - -=0A= =0A= > What threads/irq are allocated to your NIC? 'vmstat -i'=0A= =0A= Doesn't seem perfectly balanced but not terribly imbalanced, either:=0A= =0A= interrupt total rate=0A= irq9: acpi0 3 0=0A= irq18: ehci0 ehci1+ 803162 2=0A= cpu0:timer 67465114 167=0A= cpu1:timer 65068819 161=0A= cpu2:timer 65535300 163=0A= cpu3:timer 63408731 157=0A= cpu4:timer 63026304 156=0A= cpu5:timer 63431412 157=0A= irq56: nvme0:admin 18 0=0A= irq57: nvme0:io0 544999 1=0A= irq58: nvme0:io1 465816 1=0A= irq59: nvme0:io2 487486 1=0A= irq60: nvme0:io3 474616 1=0A= irq61: nvme0:io4 452527 1=0A= irq62: nvme0:io5 467807 1=0A= irq63: mps0 36110415 90=0A= irq64: mps1 112328723 279=0A= irq65: mps2 54845974 136=0A= irq66: mps3 50770215 126=0A= irq68: xhci0 3122136 8=0A= irq70: igb0:rxq0 1974562 5=0A= irq71: igb0:rxq1 3034190 8=0A= irq72: igb0:rxq2 28703842 71=0A= irq73: igb0:rxq3 1126533 3=0A= irq74: igb0:aq 7 0=0A= irq75: igb1:rxq0 1852321 5=0A= irq76: igb1:rxq1 2946722 7=0A= irq77: igb1:rxq2 9602613 24=0A= irq78: igb1:rxq3 4101258 10=0A= irq79: igb1:aq 8 0=0A= irq80: ahci1 37386191 93=0A= irq81: mlx4_core0 4748775 12=0A= irq82: mlx4_core0 13754442 34=0A= irq83: mlx4_core0 3551629 9=0A= irq84: mlx4_core0 2595850 6=0A= irq85: mlx4_core0 4947424 12=0A= Total 769135944 1908=0A= =0A= > Are the above threads floating or mapped? 'cpuset -g ...'=0A= =0A= I suspect I was supposed to run this against the argument of a pid,=0A= maybe nfsd? Here's the output without an argument=0A= =0A= pid -1 mask: 0, 1, 2, 3, 4, 5=0A= pid -1 domain policy: first-touch mask: 0=0A= =0A= > Disable nfs tcp drc=0A= =0A= This is the first I've even seen a duplicate request cache mentioned.=0A= It seems counter-intuitive for why that'd help but maybe I'll try=0A= doing that. What exactly is the benefit?=0A= =0A= > What is your atime setting?=0A= =0A= Disabled at both the file system and the client mounts.=0A= =0A= > You also state you are using a Linux client. Are you using the MLX affini= ty scripts, buffer sizing suggestions, etc, etc. Have you swapped the Linux= system for a fbsd system?=0A= I've not, though I do vaguely recall mellanox supplying some scripts=0A= in their documentation that fixed interrupt handling on specific cores=0A= at one point. Is this what you're referring to? I could give that a=0A= try. I don't at present have any FreeBSD client systems with enough=0A= PCI express bandwidth to swap things out for a Linux vs FreeBSD test.=0A= =0A= > You mention iperf. Please post the options you used when invoking iperf = and it's output.=0A= =0A= Setting up the NFS client as a "server", since it seems that the=0A= terminology is a little bit flipped with iperf, here's the output:=0A= =0A= -----------------------------------------------------------=0A= Server listening on 5201 (test #1)=0A= -----------------------------------------------------------=0A= Accepted connection from 10.5.5.1, port 11534=0A= [ 5] local 10.5.5.4 port 5201 connected to 10.5.5.1 port 43931=0A= [ ID] Interval Transfer Bitrate=0A= [ 5] 0.00-1.00 sec 3.81 GBytes 32.7 Gbits/sec=0A= [ 5] 1.00-2.00 sec 4.20 GBytes 36.1 Gbits/sec=0A= [ 5] 2.00-3.00 sec 4.18 GBytes 35.9 Gbits/sec=0A= [ 5] 3.00-4.00 sec 4.21 GBytes 36.1 Gbits/sec=0A= [ 5] 4.00-5.00 sec 4.20 GBytes 36.1 Gbits/sec=0A= [ 5] 5.00-6.00 sec 4.21 GBytes 36.2 Gbits/sec=0A= [ 5] 6.00-7.00 sec 4.10 GBytes 35.2 Gbits/sec=0A= [ 5] 7.00-8.00 sec 4.20 GBytes 36.1 Gbits/sec=0A= [ 5] 8.00-9.00 sec 4.21 GBytes 36.1 Gbits/sec=0A= [ 5] 9.00-10.00 sec 4.20 GBytes 36.1 Gbits/sec=0A= [ 5] 10.00-10.00 sec 7.76 MBytes 35.3 Gbits/sec=0A= - - - - - - - - - - - - - - - - - - - - - - - - -=0A= [ ID] Interval Transfer Bitrate=0A= [ 5] 0.00-10.00 sec 41.5 GBytes 35.7 Gbits/sec recei= ver=0A= -----------------------------------------------------------=0A= Server listening on 5201 (test #2)=0A= -----------------------------------------------------------=0A= =0A= On Sun, May 22, 2022 at 3:45 AM John <jwd@freebsd.org> wrote:=0A= >=0A= > ----- Adam Stylinski's Original Message -----=0A= > > Hello,=0A= > >=0A= > > I have two systems connected via ConnectX-3 mellanox cards in ethernet= =0A= > > mode. They have their MTU's maxed at 9000, their ring buffers maxed=0A= > > at 8192, and I can hit around 36 gbps with iperf.=0A= > >=0A= > > When using an NFS client (client =3D linux, server =3D freebsd), I see = a=0A= > > maximum rate of around 20gbps. The test file is fully in ARC. The=0A= > > test is performed with an NFS mount nconnect=3D4 and an rsize/wsize of= =0A= > > 1MB.=0A= > >=0A= > > Here's the flame graph of the kernel of the system in question, with=0A= > > idle stacks removed:=0A= > >=0A= > > https://gist.github.com/KungFuJesus/918c6dcf40ae07767d5382deafab3a52#fi= le-nfs_fg-svg=0A= > >=0A= > > The longest functions seems like maybe it's the ERMS aware memcpy=0A= > > happening from the ARC? Is there maybe a missing fast path that could= =0A= > > take fewer copies into the socket buffer?=0A= >=0A= > Hi Adam -=0A= >=0A= > Some items to look at and possibly include for more responses....=0A= >=0A= > - What is your server system? Make/model/ram/etc. What is your=0A= > overall 'top' cpu utilization 'top -aH' ...=0A= >=0A= > - It looks like you're using a 40gb/s card. Posting the output of=0A= > 'ifconfig -vm' would provide additional information.=0A= >=0A= > - Are the interfaces running cleanly? 'netstat -i' is helpful.=0A= >=0A= > - Inspect 'netstat -s'. Duplicate pkts? Resends? Out-of-order?=0A= >=0A= > - Inspect 'netstat -m'. Denied? Delayed?=0A= >=0A= >=0A= > - You mention iperf. Please post the options you used when=0A= > invoking iperf and it's output.=0A= >=0A= > - You appear to be looking for through-put vs low-latency. Have=0A= > you looked at window-size vs the amount of memory allocated to the=0A= > streams. These values vary based on the bit-rate of the connection.=0A= > Tcp connections require outstanding un-ack'd data to be held.=0A= > Effects values below.=0A= >=0A= >=0A= > - What are your values for:=0A= >=0A= > -- kern.ipc.maxsockbuf=0A= > -- net.inet.tcp.sendbuf_max=0A= > -- net.inet.tcp.recvbuf_max=0A= >=0A= > -- net.inet.tcp.sendspace=0A= > -- net.inet.tcp.recvspace=0A= >=0A= > -- net.inet.tcp.delayed_ack=0A= >=0A= > - What threads/irq are allocated to your NIC? 'vmstat -i'=0A= >=0A= > - Are the above threads floating or mapped? 'cpuset -g ...'=0A= >=0A= > - Determine best settings for LRO/TSO for your card.=0A= >=0A= > - Disable nfs tcp drc=0A= >=0A= > - What is your atime setting?=0A= >=0A= >=0A= > If you really think you have a ZFS/Kernel issue, and you're=0A= > data fits in cache, dump ZFS, create a memory backed file system=0A= > and repeat your tests. This will purge a large portion of your=0A= > graph. LRO/TSO changes may do so also.=0A= >=0A= > You also state you are using a Linux client. Are you using=0A= > the MLX affinity scripts, buffer sizing suggestions, etc, etc.=0A= > Have you swapped the Linux system for a fbsd system?=0A= >=0A= > And as a final note, I regularly use Chelsio T62100 cards=0A= > in dual home and/or LACP environments in Supermicro boxes with 100's=0A= > of nfs boot (Bhyve, QEMU, and physical system) clients per server=0A= > with no network starvation or cpu bottlenecks. Clients boot, perform=0A= > their work, and then remotely request image rollback.=0A= >=0A= >=0A= > Hopefully the above will help and provide pointers.=0A= >=0A= > Cheers=0A= >=0A= =0A=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQBPR0101MB9742A3D546254D116DAA416CDDD69>