FreeBSD Mail Archives

Date:      Wed, 25 May 2022 15:41:48 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Adam Stylinski <kungfujesus06@gmail.com>, John <jwd@freebsd.org>
Cc:        "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Subject:   Re: zfs/nfsd performance limiter
Message-ID:  <YQBPR0101MB9742A3D546254D116DAA416CDDD69@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <CAJwHY9W-3eEXR%2BjTw40thcio65Ukjw8qgnp-qPiS3bdeZS0kLw@mail.gmail.com>
References:  <CAJwHY9WMOOLy=rb9FNjExQtYej21Zv=Po9Cbg=19gkw1SLFSww@mail.gmail.com> <YonqGfJST09cUV6W@FreeBSD.org> <CAJwHY9W-3eEXR%2BjTw40thcio65Ukjw8qgnp-qPiS3bdeZS0kLw@mail.gmail.com>

Adam Stylinski <kungfujesus06@gmail.com> wrote:=0A=
[stuff snipped]=0A=
=0A=
> > ifconfig -vm=0A=
> mlxen0: flags=3D8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu=
 9000=0A=
Just in case you (or someone else reading this) is not aware of it,=0A=
use of 9K jumbo clusters causes fragmentation of the memory pool=0A=
clusters are allocated from and, therefore, their use is not recommended.=
=0A=
=0A=
Now, it may be that the mellanox driver doesn't use 9K clusters (it could=
=0A=
put the received frame in multiple smaller clusters), but if it does, you=
=0A=
should consider reducing the mtu.=0A=
If you:=0A=
# vmstat -z | fgrep mbuf_jumbo_9k=0A=
it will show you if they are being used.=0A=
=0A=
rick=0A=
=0A=
=0A=
> netstat -i=0A=
Name    Mtu Network       Address              Ipkts Ierrs Idrop=0A=
Opkts Oerrs  Coll=0A=
igb0   9000 <Link#1>      ac:1f:6b:b0:60:bc 18230625     0     0=0A=
24178283     0     0=0A=
igb1   9000 <Link#2>      ac:1f:6b:b0:60:bc 14341213     0     0=0A=
8447249     0     0=0A=
lo0   16384 <Link#3>      lo0                 367691     0     0=0A=
367691     0     0=0A=
lo0       - localhost     localhost               68     -     -=0A=
68     -     -=0A=
lo0       - fe80::%lo0/64 fe80::1%lo0              0     -     -=0A=
 0     -     -=0A=
lo0       - your-net      localhost           348944     -     -=0A=
348944     -     -=0A=
mlxen  9000 <Link#4>      00:02:c9:35:df:20 13138046     0    12=0A=
26308206     0     0=0A=
mlxen     - 10.5.5.0/24   10.5.5.1          11592389     -     -=0A=
24345184     -     -=0A=
vm-pu  9000 <Link#6>      56:3e:55:8a:2a:f8     7270     0     0=0A=
962249   102     0=0A=
lagg0  9000 <Link#5>      ac:1f:6b:b0:60:bc 31543941     0     0=0A=
31623674     0     0=0A=
lagg0     - 192.168.0.0/2 nasbox            27967582     -     -=0A=
41779731     -     -=0A=
=0A=
> What threads/irq are allocated to your NIC? 'vmstat -i'=0A=
=0A=
Doesn't seem perfectly balanced but not terribly imbalanced, either:=0A=
=0A=
interrupt                          total       rate=0A=
irq9: acpi0                            3          0=0A=
irq18: ehci0 ehci1+               803162          2=0A=
cpu0:timer                      67465114        167=0A=
cpu1:timer                      65068819        161=0A=
cpu2:timer                      65535300        163=0A=
cpu3:timer                      63408731        157=0A=
cpu4:timer                      63026304        156=0A=
cpu5:timer                      63431412        157=0A=
irq56: nvme0:admin                    18          0=0A=
irq57: nvme0:io0                  544999          1=0A=
irq58: nvme0:io1                  465816          1=0A=
irq59: nvme0:io2                  487486          1=0A=
irq60: nvme0:io3                  474616          1=0A=
irq61: nvme0:io4                  452527          1=0A=
irq62: nvme0:io5                  467807          1=0A=
irq63: mps0                     36110415         90=0A=
irq64: mps1                    112328723        279=0A=
irq65: mps2                     54845974        136=0A=
irq66: mps3                     50770215        126=0A=
irq68: xhci0                     3122136          8=0A=
irq70: igb0:rxq0                 1974562          5=0A=
irq71: igb0:rxq1                 3034190          8=0A=
irq72: igb0:rxq2                28703842         71=0A=
irq73: igb0:rxq3                 1126533          3=0A=
irq74: igb0:aq                         7          0=0A=
irq75: igb1:rxq0                 1852321          5=0A=
irq76: igb1:rxq1                 2946722          7=0A=
irq77: igb1:rxq2                 9602613         24=0A=
irq78: igb1:rxq3                 4101258         10=0A=
irq79: igb1:aq                         8          0=0A=
irq80: ahci1                    37386191         93=0A=
irq81: mlx4_core0                4748775         12=0A=
irq82: mlx4_core0               13754442         34=0A=
irq83: mlx4_core0                3551629          9=0A=
irq84: mlx4_core0                2595850          6=0A=
irq85: mlx4_core0                4947424         12=0A=
Total                          769135944       1908=0A=
=0A=
> Are the above threads floating or mapped? 'cpuset -g ...'=0A=
=0A=
I suspect I was supposed to run this against the argument of a pid,=0A=
maybe nfsd?  Here's the output without an argument=0A=
=0A=
pid -1 mask: 0, 1, 2, 3, 4, 5=0A=
pid -1 domain policy: first-touch mask: 0=0A=
=0A=
> Disable nfs tcp drc=0A=
=0A=
This is the first I've even seen a duplicate request cache mentioned.=0A=
It seems counter-intuitive for why that'd help but maybe I'll try=0A=
doing that.  What exactly is the benefit?=0A=
=0A=
> What is your atime setting?=0A=
=0A=
Disabled at both the file system and the client mounts.=0A=
=0A=
> You also state you are using a Linux client. Are you using the MLX affini=
ty scripts, buffer sizing suggestions, etc, etc. Have you swapped the Linux=
 system for a fbsd system?=0A=
I've not, though I do vaguely recall mellanox supplying some scripts=0A=
in their documentation that fixed interrupt handling on specific cores=0A=
at one point.  Is this what you're referring to?  I could give that a=0A=
try.  I don't at present have any FreeBSD client systems with enough=0A=
PCI express bandwidth to swap things out for a Linux vs FreeBSD test.=0A=
=0A=
>  You mention iperf. Please post the options you used when invoking iperf =
and it's output.=0A=
=0A=
Setting up the NFS client as a "server", since it seems that the=0A=
terminology is a little bit flipped with iperf, here's the output:=0A=
=0A=
-----------------------------------------------------------=0A=
Server listening on 5201 (test #1)=0A=
-----------------------------------------------------------=0A=
Accepted connection from 10.5.5.1, port 11534=0A=
[  5] local 10.5.5.4 port 5201 connected to 10.5.5.1 port 43931=0A=
[ ID] Interval           Transfer     Bitrate=0A=
[  5]   0.00-1.00   sec  3.81 GBytes  32.7 Gbits/sec=0A=
[  5]   1.00-2.00   sec  4.20 GBytes  36.1 Gbits/sec=0A=
[  5]   2.00-3.00   sec  4.18 GBytes  35.9 Gbits/sec=0A=
[  5]   3.00-4.00   sec  4.21 GBytes  36.1 Gbits/sec=0A=
[  5]   4.00-5.00   sec  4.20 GBytes  36.1 Gbits/sec=0A=
[  5]   5.00-6.00   sec  4.21 GBytes  36.2 Gbits/sec=0A=
[  5]   6.00-7.00   sec  4.10 GBytes  35.2 Gbits/sec=0A=
[  5]   7.00-8.00   sec  4.20 GBytes  36.1 Gbits/sec=0A=
[  5]   8.00-9.00   sec  4.21 GBytes  36.1 Gbits/sec=0A=
[  5]   9.00-10.00  sec  4.20 GBytes  36.1 Gbits/sec=0A=
[  5]  10.00-10.00  sec  7.76 MBytes  35.3 Gbits/sec=0A=
- - - - - - - - - - - - - - - - - - - - - - - - -=0A=
[ ID] Interval           Transfer     Bitrate=0A=
[  5]   0.00-10.00  sec  41.5 GBytes  35.7 Gbits/sec                  recei=
ver=0A=
-----------------------------------------------------------=0A=
Server listening on 5201 (test #2)=0A=
-----------------------------------------------------------=0A=
=0A=
On Sun, May 22, 2022 at 3:45 AM John <jwd@freebsd.org> wrote:=0A=
>=0A=
> ----- Adam Stylinski's Original Message -----=0A=
> > Hello,=0A=
> >=0A=
> > I have two systems connected via ConnectX-3 mellanox cards in ethernet=
=0A=
> > mode.  They have their MTU's maxed at 9000, their ring buffers maxed=0A=
> > at 8192, and I can hit around 36 gbps with iperf.=0A=
> >=0A=
> > When using an NFS client (client =3D linux, server =3D freebsd), I see =
a=0A=
> > maximum rate of around 20gbps.  The test file is fully in ARC.  The=0A=
> > test is performed with an NFS mount nconnect=3D4 and an rsize/wsize of=
=0A=
> > 1MB.=0A=
> >=0A=
> > Here's the flame graph of the kernel of the system in question, with=0A=
> > idle stacks removed:=0A=
> >=0A=
> > https://gist.github.com/KungFuJesus/918c6dcf40ae07767d5382deafab3a52#fi=
le-nfs_fg-svg=0A=
> >=0A=
> > The longest functions seems like maybe it's the ERMS aware memcpy=0A=
> > happening from the ARC?  Is there maybe a missing fast path that could=
=0A=
> > take fewer copies into the socket buffer?=0A=
>=0A=
> Hi Adam -=0A=
>=0A=
>    Some items to look at and possibly include for more responses....=0A=
>=0A=
> - What is your server system? Make/model/ram/etc. What is your=0A=
>   overall 'top' cpu utilization 'top -aH' ...=0A=
>=0A=
> - It looks like you're using a 40gb/s card. Posting the output of=0A=
>   'ifconfig -vm' would provide additional information.=0A=
>=0A=
> - Are the interfaces running cleanly? 'netstat -i' is helpful.=0A=
>=0A=
> - Inspect 'netstat -s'. Duplicate pkts? Resends? Out-of-order?=0A=
>=0A=
> - Inspect 'netstat -m'. Denied? Delayed?=0A=
>=0A=
>=0A=
> - You mention iperf. Please post the options you used when=0A=
>   invoking iperf and it's output.=0A=
>=0A=
> - You appear to be looking for through-put vs low-latency. Have=0A=
>   you looked at window-size vs the amount of memory allocated to the=0A=
>   streams. These values vary based on the bit-rate of the connection.=0A=
>   Tcp connections require outstanding un-ack'd data to be held.=0A=
>   Effects values below.=0A=
>=0A=
>=0A=
> - What are your values for:=0A=
>=0A=
> -- kern.ipc.maxsockbuf=0A=
> -- net.inet.tcp.sendbuf_max=0A=
> -- net.inet.tcp.recvbuf_max=0A=
>=0A=
> -- net.inet.tcp.sendspace=0A=
> -- net.inet.tcp.recvspace=0A=
>=0A=
> -- net.inet.tcp.delayed_ack=0A=
>=0A=
> - What threads/irq are allocated to your NIC? 'vmstat -i'=0A=
>=0A=
> - Are the above threads floating or mapped? 'cpuset -g ...'=0A=
>=0A=
> - Determine best settings for LRO/TSO for your card.=0A=
>=0A=
> - Disable nfs tcp drc=0A=
>=0A=
> - What is your atime setting?=0A=
>=0A=
>=0A=
>    If you really think you have a ZFS/Kernel issue, and you're=0A=
> data fits in cache, dump ZFS, create a memory backed file system=0A=
> and repeat your tests. This will purge a large portion of your=0A=
> graph.  LRO/TSO changes may do so also.=0A=
>=0A=
>    You also state you are using a Linux client. Are you using=0A=
> the MLX affinity scripts, buffer sizing suggestions, etc, etc.=0A=
> Have you swapped the Linux system for a fbsd system?=0A=
>=0A=
>    And as a final note, I regularly use Chelsio T62100 cards=0A=
> in dual home and/or LACP environments in Supermicro boxes with 100's=0A=
> of nfs boot (Bhyve, QEMU, and physical system) clients per server=0A=
> with no network starvation or cpu bottlenecks.  Clients boot, perform=0A=
> their work, and then remotely request image rollback.=0A=
>=0A=
>=0A=
>    Hopefully the above will help and provide pointers.=0A=
>=0A=
> Cheers=0A=
>=0A=
=0A=

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQBPR0101MB9742A3D546254D116DAA416CDDD69>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation