FreeBSD Mail Archives

Date:      Wed, 25 May 2022 13:24:01 -0400
From:      Adam Stylinski <kungfujesus06@gmail.com>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        John <jwd@freebsd.org>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Subject:   Re: zfs/nfsd performance limiter
Message-ID:  <CAJwHY9X=GmdLQ1wMrVSs4NcPQrfk6%2Bz=e4rHSO2zmC5G=AxvCQ@mail.gmail.com>
In-Reply-To: <YQBPR0101MB9742A3D546254D116DAA416CDDD69@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM>
References:  <CAJwHY9WMOOLy=rb9FNjExQtYej21Zv=Po9Cbg=19gkw1SLFSww@mail.gmail.com> <YonqGfJST09cUV6W@FreeBSD.org> <CAJwHY9W-3eEXR%2BjTw40thcio65Ukjw8qgnp-qPiS3bdeZS0kLw@mail.gmail.com> <YQBPR0101MB9742A3D546254D116DAA416CDDD69@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM>

Hmm, I don't know that the present of jumbo 9k mbufs is indicative
that the mellanox drivers are using them or not, given that I have a
link aggregation on a different (1gbps) NIC that also could be the
cause of that:

mbuf:                   256, 52231134,   49500,   25931,1956138424,
0,   0,   0
mbuf_cluster:          2048, 8161114,    2794,    4352,700435355,   0,   0,   0
mbuf_jumbo_page:       4096, 4080557,   12288,    3977,155289291,   0,   0,   0
mbuf_jumbo_9k:         9216, 1609044,   32772,    4174,35785053,   0,   0,   0
mbuf_jumbo_16k:       16384, 680092,       0,       0,       0,   0,   0,   0

Early on, 9k MTUs did show significant advantages for throughput from
what I remember.  But of course, this is before trying any of the
aforementioned changes for multiplexing the connection.

On Wed, May 25, 2022 at 11:41 AM Rick Macklem <rmacklem@uoguelph.ca> wrote:
>
> Adam Stylinski <kungfujesus06@gmail.com> wrote:
> [stuff snipped]
>
> > > ifconfig -vm
> > mlxen0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
> Just in case you (or someone else reading this) is not aware of it,
> use of 9K jumbo clusters causes fragmentation of the memory pool
> clusters are allocated from and, therefore, their use is not recommended.
>
> Now, it may be that the mellanox driver doesn't use 9K clusters (it could
> put the received frame in multiple smaller clusters), but if it does, you
> should consider reducing the mtu.
> If you:
> # vmstat -z | fgrep mbuf_jumbo_9k
> it will show you if they are being used.
>
> rick
>
>
> > netstat -i
> Name    Mtu Network       Address              Ipkts Ierrs Idrop
> Opkts Oerrs  Coll
> igb0   9000 <Link#1>      ac:1f:6b:b0:60:bc 18230625     0     0
> 24178283     0     0
> igb1   9000 <Link#2>      ac:1f:6b:b0:60:bc 14341213     0     0
> 8447249     0     0
> lo0   16384 <Link#3>      lo0                 367691     0     0
> 367691     0     0
> lo0       - localhost     localhost               68     -     -
> 68     -     -
> lo0       - fe80::%lo0/64 fe80::1%lo0              0     -     -
>  0     -     -
> lo0       - your-net      localhost           348944     -     -
> 348944     -     -
> mlxen  9000 <Link#4>      00:02:c9:35:df:20 13138046     0    12
> 26308206     0     0
> mlxen     - 10.5.5.0/24   10.5.5.1          11592389     -     -
> 24345184     -     -
> vm-pu  9000 <Link#6>      56:3e:55:8a:2a:f8     7270     0     0
> 962249   102     0
> lagg0  9000 <Link#5>      ac:1f:6b:b0:60:bc 31543941     0     0
> 31623674     0     0
> lagg0     - 192.168.0.0/2 nasbox            27967582     -     -
> 41779731     -     -
>
> > What threads/irq are allocated to your NIC? 'vmstat -i'
>
> Doesn't seem perfectly balanced but not terribly imbalanced, either:
>
> interrupt                          total       rate
> irq9: acpi0                            3          0
> irq18: ehci0 ehci1+               803162          2
> cpu0:timer                      67465114        167
> cpu1:timer                      65068819        161
> cpu2:timer                      65535300        163
> cpu3:timer                      63408731        157
> cpu4:timer                      63026304        156
> cpu5:timer                      63431412        157
> irq56: nvme0:admin                    18          0
> irq57: nvme0:io0                  544999          1
> irq58: nvme0:io1                  465816          1
> irq59: nvme0:io2                  487486          1
> irq60: nvme0:io3                  474616          1
> irq61: nvme0:io4                  452527          1
> irq62: nvme0:io5                  467807          1
> irq63: mps0                     36110415         90
> irq64: mps1                    112328723        279
> irq65: mps2                     54845974        136
> irq66: mps3                     50770215        126
> irq68: xhci0                     3122136          8
> irq70: igb0:rxq0                 1974562          5
> irq71: igb0:rxq1                 3034190          8
> irq72: igb0:rxq2                28703842         71
> irq73: igb0:rxq3                 1126533          3
> irq74: igb0:aq                         7          0
> irq75: igb1:rxq0                 1852321          5
> irq76: igb1:rxq1                 2946722          7
> irq77: igb1:rxq2                 9602613         24
> irq78: igb1:rxq3                 4101258         10
> irq79: igb1:aq                         8          0
> irq80: ahci1                    37386191         93
> irq81: mlx4_core0                4748775         12
> irq82: mlx4_core0               13754442         34
> irq83: mlx4_core0                3551629          9
> irq84: mlx4_core0                2595850          6
> irq85: mlx4_core0                4947424         12
> Total                          769135944       1908
>
> > Are the above threads floating or mapped? 'cpuset -g ...'
>
> I suspect I was supposed to run this against the argument of a pid,
> maybe nfsd?  Here's the output without an argument
>
> pid -1 mask: 0, 1, 2, 3, 4, 5
> pid -1 domain policy: first-touch mask: 0
>
> > Disable nfs tcp drc
>
> This is the first I've even seen a duplicate request cache mentioned.
> It seems counter-intuitive for why that'd help but maybe I'll try
> doing that.  What exactly is the benefit?
>
> > What is your atime setting?
>
> Disabled at both the file system and the client mounts.
>
> > You also state you are using a Linux client. Are you using the MLX affinity scripts, buffer sizing suggestions, etc, etc. Have you swapped the Linux system for a fbsd system?
> I've not, though I do vaguely recall mellanox supplying some scripts
> in their documentation that fixed interrupt handling on specific cores
> at one point.  Is this what you're referring to?  I could give that a
> try.  I don't at present have any FreeBSD client systems with enough
> PCI express bandwidth to swap things out for a Linux vs FreeBSD test.
>
> >  You mention iperf. Please post the options you used when invoking iperf and it's output.
>
> Setting up the NFS client as a "server", since it seems that the
> terminology is a little bit flipped with iperf, here's the output:
>
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 10.5.5.1, port 11534
> [  5] local 10.5.5.4 port 5201 connected to 10.5.5.1 port 43931
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-1.00   sec  3.81 GBytes  32.7 Gbits/sec
> [  5]   1.00-2.00   sec  4.20 GBytes  36.1 Gbits/sec
> [  5]   2.00-3.00   sec  4.18 GBytes  35.9 Gbits/sec
> [  5]   3.00-4.00   sec  4.21 GBytes  36.1 Gbits/sec
> [  5]   4.00-5.00   sec  4.20 GBytes  36.1 Gbits/sec
> [  5]   5.00-6.00   sec  4.21 GBytes  36.2 Gbits/sec
> [  5]   6.00-7.00   sec  4.10 GBytes  35.2 Gbits/sec
> [  5]   7.00-8.00   sec  4.20 GBytes  36.1 Gbits/sec
> [  5]   8.00-9.00   sec  4.21 GBytes  36.1 Gbits/sec
> [  5]   9.00-10.00  sec  4.20 GBytes  36.1 Gbits/sec
> [  5]  10.00-10.00  sec  7.76 MBytes  35.3 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-10.00  sec  41.5 GBytes  35.7 Gbits/sec                  receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
>
> On Sun, May 22, 2022 at 3:45 AM John <jwd@freebsd.org> wrote:
> >
> > ----- Adam Stylinski's Original Message -----
> > > Hello,
> > >
> > > I have two systems connected via ConnectX-3 mellanox cards in ethernet
> > > mode.  They have their MTU's maxed at 9000, their ring buffers maxed
> > > at 8192, and I can hit around 36 gbps with iperf.
> > >
> > > When using an NFS client (client = linux, server = freebsd), I see a
> > > maximum rate of around 20gbps.  The test file is fully in ARC.  The
> > > test is performed with an NFS mount nconnect=4 and an rsize/wsize of
> > > 1MB.
> > >
> > > Here's the flame graph of the kernel of the system in question, with
> > > idle stacks removed:
> > >
> > > https://gist.github.com/KungFuJesus/918c6dcf40ae07767d5382deafab3a52#file-nfs_fg-svg
> > >
> > > The longest functions seems like maybe it's the ERMS aware memcpy
> > > happening from the ARC?  Is there maybe a missing fast path that could
> > > take fewer copies into the socket buffer?
> >
> > Hi Adam -
> >
> >    Some items to look at and possibly include for more responses....
> >
> > - What is your server system? Make/model/ram/etc. What is your
> >   overall 'top' cpu utilization 'top -aH' ...
> >
> > - It looks like you're using a 40gb/s card. Posting the output of
> >   'ifconfig -vm' would provide additional information.
> >
> > - Are the interfaces running cleanly? 'netstat -i' is helpful.
> >
> > - Inspect 'netstat -s'. Duplicate pkts? Resends? Out-of-order?
> >
> > - Inspect 'netstat -m'. Denied? Delayed?
> >
> >
> > - You mention iperf. Please post the options you used when
> >   invoking iperf and it's output.
> >
> > - You appear to be looking for through-put vs low-latency. Have
> >   you looked at window-size vs the amount of memory allocated to the
> >   streams. These values vary based on the bit-rate of the connection.
> >   Tcp connections require outstanding un-ack'd data to be held.
> >   Effects values below.
> >
> >
> > - What are your values for:
> >
> > -- kern.ipc.maxsockbuf
> > -- net.inet.tcp.sendbuf_max
> > -- net.inet.tcp.recvbuf_max
> >
> > -- net.inet.tcp.sendspace
> > -- net.inet.tcp.recvspace
> >
> > -- net.inet.tcp.delayed_ack
> >
> > - What threads/irq are allocated to your NIC? 'vmstat -i'
> >
> > - Are the above threads floating or mapped? 'cpuset -g ...'
> >
> > - Determine best settings for LRO/TSO for your card.
> >
> > - Disable nfs tcp drc
> >
> > - What is your atime setting?
> >
> >
> >    If you really think you have a ZFS/Kernel issue, and you're
> > data fits in cache, dump ZFS, create a memory backed file system
> > and repeat your tests. This will purge a large portion of your
> > graph.  LRO/TSO changes may do so also.
> >
> >    You also state you are using a Linux client. Are you using
> > the MLX affinity scripts, buffer sizing suggestions, etc, etc.
> > Have you swapped the Linux system for a fbsd system?
> >
> >    And as a final note, I regularly use Chelsio T62100 cards
> > in dual home and/or LACP environments in Supermicro boxes with 100's
> > of nfs boot (Bhyve, QEMU, and physical system) clients per server
> > with no network starvation or cpu bottlenecks.  Clients boot, perform
> > their work, and then remotely request image rollback.
> >
> >
> >    Hopefully the above will help and provide pointers.
> >
> > Cheers
> >
>

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJwHY9X=GmdLQ1wMrVSs4NcPQrfk6%2Bz=e4rHSO2zmC5G=AxvCQ>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation