FreeBSD Mail Archives

Date:      Sun, 22 May 2022 14:12:46 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Adam Stylinski <kungfujesus06@gmail.com>, John <jwd@freebsd.org>
Cc:        "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Subject:   Re: zfs/nfsd performance limiter
Message-ID:  <YQBPR0101MB97429323AD5F921BE76C613EDDD59@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <CAJwHY9W-3eEXR%2BjTw40thcio65Ukjw8qgnp-qPiS3bdeZS0kLw@mail.gmail.com>
References:  <CAJwHY9WMOOLy=rb9FNjExQtYej21Zv=Po9Cbg=19gkw1SLFSww@mail.gmail.com> <YonqGfJST09cUV6W@FreeBSD.org> <CAJwHY9W-3eEXR%2BjTw40thcio65Ukjw8qgnp-qPiS3bdeZS0kLw@mail.gmail.com>

Adam Stylinski <kungfujesus06@gmail.com> wrote:=0A=
> jwd wrote:=0A=
> > What is your server system? Make/model/ram/etc.=0A=
> Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (6 cores, a little starved=0A=
> on the clock but the load at least is basically zero during this test)=0A=
> 128GB of memory=0A=
>=0A=
> > top -aH=0A=
> During the copy load (for brevity, only did the real top contenders=0A=
> for CPU here):=0A=
>=0A=
> last pid: 15560;  load averages:  0.25,  0.39,  0.27=0A=
>=0A=
>=0A=
>=0A=
>                                                         up 4+15:48:54=0A=
>  09:17:38=0A=
> 98 threads:    2 running, 96 sleeping=0A=
> CPU:  0.0% user,  0.0% nice, 19.1% system,  5.6% interrupt, 75.3% idle=0A=
> Mem: 12M Active, 4405M Inact, 8284K Laundry, 115G Wired, 1148M Buf, 4819M=
 Free=0A=
> ARC: 98G Total, 80G MFU, 15G MRU, 772K Anon, 1235M Header, 1042M Other=0A=
>     91G Compressed, 189G Uncompressed, 2.09:1 Ratio=0A=
> Swap: 5120M Total, 5120M Free=0A=
>=0A=
>  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND=
=0A=
> 3830 root         20    0    12M  2700K rpcsvc   2   1:16  53.26%=0A=
> nfsd: server (nfsd){nfsd: service}=0A=
> 3830 root         20    0    12M  2700K CPU5     5   5:42  52.96%=0A=
> nfsd: server (nfsd){nfsd: master}=0A=
> 15560 adam         20    0    17M  5176K CPU2     2   0:00   0.12% top -a=
H=0A=
>  1493 root         20    0    13M  2260K select   3   0:36   0.01%=0A=
> /usr/sbin/powerd=0A=
> 1444 root         20    0    75M  2964K select   5   0:19   0.01%=0A=
> /usr/sbin/mountd -r /etc/exports /etc/zfs/exports=0A=
> 1215 uucp         20    0    13M  2820K select   5   0:27   0.01%=0A=
> /usr/local/libexec/nut/usbhid-ups -a cyberpower=0A=
> 93424 adam         20    0    21M  9900K select   0   0:00   0.01%=0A=
> sshd: adam@pts/0 (sshd)=0A=
>=0A=
> > ifconfig -vm=0A=
> mlxen0: flags=3D8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu=
 9000=0A=
options=3Ded07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSU=
M,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>=
=0A=
capabilities=3Ded07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_=
HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_=
IPV6>=0A=
> ether 00:02:c9:35:df:20=0A=
> inet 10.5.5.1 netmask 0xffffff00 broadcast 10.5.5.255=0A=
> media: Ethernet autoselect (40Gbase-CR4 <full-duplex,rxpause,txpause>)=0A=
> status: active=0A=
> supported media:=0A=
> media autoselect=0A=
> media 40Gbase-CR4 mediaopt full-duplex=0A=
> media 10Gbase-CX4 mediaopt full-duplex=0A=
> media 10Gbase-SR mediaopt full-duplex=0A=
> media 1000baseT mediaopt full-duplex=0A=
> nd6 options=3D29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>=0A=
> plugged: QSFP+ 40GBASE-CR4 (No separable connector)=0A=
> vendor: Mellanox PN: MC2207130-002 SN: MT1419VS07971 DATE: 2014-06-06=0A=
> module temperature: 0.00 C voltage: 0.00 Volts=0A=
> lane 1: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA=0A=
> lane 2: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA=0A=
> lane 3: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA=0A=
> lane 4: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA=0A=
>=0A=
> > - What are your values for:=0A=
> >=0A=
> > -- kern.ipc.maxsockbuf=0A=
> > -- net.inet.tcp.sendbuf_max=0A=
> > -- net.inet.tcp.recvbuf_max=0A=
> >=0A=
> > -- net.inet.tcp.sendspace=0A=
> > -- net.inet.tcp.recvspace=0A=
> >=0A=
> > -- net.inet.tcp.delayed_ack=0A=
> kern.ipc.maxsockbuf: 16777216=0A=
> net.inet.tcp.sendbuf_max: 16777216=0A=
> net.inet.tcp.recvbuf_max: 16777216=0A=
> net.inet.tcp.sendspace: 32768 # This is interesting?  I'm not sure why=0A=
> the discrepancy here=0A=
> net.inet.tcp.recvspace: 4194304=0A=
> net.inet.tcp.delayed_ack: 0=0A=
>=0A=
> > netstat -i=0A=
> Name    Mtu Network       Address              Ipkts Ierrs Idrop=0A=
Opkts Oerrs  Coll=0A=
> igb0   9000 <Link#1>      ac:1f:6b:b0:60:bc 18230625     0     0=0A=
24178283     0     0=0A=
> igb1   9000 <Link#2>      ac:1f:6b:b0:60:bc 14341213     0     0=0A=
8447249     0     0=0A=
> lo0   16384 <Link#3>      lo0                 367691     0     0=0A=
367691     0     0=0A=
> lo0       - localhost     localhost               68     -     -=0A=
68     -     -=0A=
> lo0       - fe80::%lo0/64 fe80::1%lo0              0     -     -=0A=
 0     -     -=0A=
> lo0       - your-net      localhost           348944     -     -=0A=
348944     -     -=0A=
> mlxen  9000 <Link#4>      00:02:c9:35:df:20 13138046     0    12=0A=
26308206     0     0=0A=
> mlxen     - 10.5.5.0/24   10.5.5.1          11592389     -     -=0A=
24345184     -     -=0A=
> vm-pu  9000 <Link#6>      56:3e:55:8a:2a:f8     7270     0     0=0A=
962249   102     0=0A=
> lagg0  9000 <Link#5>      ac:1f:6b:b0:60:bc 31543941     0     0=0A=
31623674     0     0=0A=
> lagg0     - 192.168.0.0/2 nasbox            27967582     -     -=0A=
41779731     -     -=0A=
> =0A=
> > What threads/irq are allocated to your NIC? 'vmstat -i'=0A=
>=0A=
> Doesn't seem perfectly balanced but not terribly imbalanced, either:=0A=
>=0A=
> interrupt                          total       rate=0A=
> irq9: acpi0                            3          0=0A=
> irq18: ehci0 ehci1+               803162          2=0A=
> cpu0:timer                      67465114        167=0A=
> cpu1:timer                      65068819        161=0A=
> cpu2:timer                      65535300        163=0A=
> cpu3:timer                      63408731        157=0A=
> cpu4:timer                      63026304        156=0A=
> cpu5:timer                      63431412        157=0A=
> irq56: nvme0:admin                    18          0=0A=
> irq57: nvme0:io0                  544999          1=0A=
> irq58: nvme0:io1                  465816          1=0A=
> irq59: nvme0:io2                  487486          1=0A=
> irq60: nvme0:io3                  474616          1=0A=
> irq61: nvme0:io4                  452527          1=0A=
> irq62: nvme0:io5                  467807          1=0A=
> irq63: mps0                     36110415         90=0A=
> irq64: mps1                    112328723        279=0A=
> irq65: mps2                     54845974        136=0A=
> irq66: mps3                     50770215        126=0A=
> irq68: xhci0                     3122136          8=0A=
> irq70: igb0:rxq0                 1974562          5=0A=
> irq71: igb0:rxq1                 3034190          8=0A=
> irq72: igb0:rxq2                28703842         71=0A=
> irq73: igb0:rxq3                 1126533          3=0A=
> irq74: igb0:aq                         7          0=0A=
> irq75: igb1:rxq0                 1852321          5=0A=
> irq76: igb1:rxq1                 2946722          7=0A=
> irq77: igb1:rxq2                 9602613         24=0A=
> irq78: igb1:rxq3                 4101258         10=0A=
> irq79: igb1:aq                         8          0=0A=
> irq80: ahci1                    37386191         93=0A=
> irq81: mlx4_core0                4748775         12=0A=
> irq82: mlx4_core0               13754442         34=0A=
> irq83: mlx4_core0                3551629          9=0A=
> irq84: mlx4_core0                2595850          6=0A=
> irq85: mlx4_core0                4947424         12=0A=
> Total                          769135944       1908=0A=
>=0A=
> > Are the above threads floating or mapped? 'cpuset -g ...'=0A=
>=0A=
> I suspect I was supposed to run this against the argument of a pid,=0A=
> maybe nfsd?  Here's the output without an argument=0A=
>=0A=
> pid -1 mask: 0, 1, 2, 3, 4, 5=0A=
> pid -1 domain policy: first-touch mask: 0=0A=
>=0A=
> > Disable nfs tcp drc=0A=
>=0A=
> This is the first I've even seen a duplicate request cache mentioned.=0A=
> It seems counter-intuitive for why that'd help but maybe I'll try=0A=
> doing that.  What exactly is the benefit?=0A=
The DRC improves correctness for NFSv3 and NFSv4.0 mounts. It is a=0A=
performance hit. However, for a read mostly load it won't add too=0A=
much overhead. Turning it off increases the likelyhood of data corruption=
=0A=
due to retried non-idempotent RPCs, but the failure will be rare over TCP.=
=0A=
=0A=
If your mount is NFSv4.1 or 4.2, the DRC is not used, so don't worry about =
it.=0A=
=0A=
> > What is your atime setting?=0A=
>=0A=
> Disabled at both the file system and the client mounts.=0A=
>=0A=
> > You also state you are using a Linux client. Are you using the MLX affi=
nity > scripts, buffer sizing suggestions, etc, etc. Have you swapped the L=
inux system for a fbsd system?=0A=
> I've not, though I do vaguely recall mellanox supplying some scripts=0A=
> in their documentation that fixed interrupt handling on specific cores=0A=
> at one point.  Is this what you're referring to?  I could give that a=0A=
> try.  I don't at present have any FreeBSD client systems with enough=0A=
> PCI express bandwidth to swap things out for a Linux vs FreeBSD test.=0A=
If you have not already done so, do a "nfsstat -m" on the client to find=0A=
out what options it is actually using (works on both Linux and FreeBSD).=0A=
=0A=
If the Linux client has a way of manually adjusting readahead, then try=0A=
increasing it. (FreeBSD has a "readahead" mount option, but I can't recall=
=0A=
if Linux has one?)=0A=
=0A=
You can try mounting the server on the server, but that will use lo0 and no=
t=0A=
the mellanox, so it might be irrelevant.=0A=
=0A=
Also, I don't know how many queues the mellanox driver used. You'd want=0A=
an "nconnect" at least as high as the number of queues, since each TCP=0A=
connection will be serviced by one queue and that limits its bandwidth.=0A=
=0A=
However, in general, RPC RTT will define how well NFS performs and not=0A=
the I/O rate for a bulk file read/write.=0A=
Btw, writing is a very different story than reading, largely due to the nee=
d=0A=
to commit data/metadata to stable storage while writing.=0A=
=0A=
I can't help w.r.t. ZFS nor high performance nets (my fastest is 1Gbps), ri=
ck=0A=
=0A=
>  You mention iperf. Please post the options you used when invoking iperf =
and it's output.=0A=
=0A=
Setting up the NFS client as a "server", since it seems that the=0A=
terminology is a little bit flipped with iperf, here's the output:=0A=
=0A=
-----------------------------------------------------------=0A=
Server listening on 5201 (test #1)=0A=
-----------------------------------------------------------=0A=
Accepted connection from 10.5.5.1, port 11534=0A=
[  5] local 10.5.5.4 port 5201 connected to 10.5.5.1 port 43931=0A=
[ ID] Interval           Transfer     Bitrate=0A=
[  5]   0.00-1.00   sec  3.81 GBytes  32.7 Gbits/sec=0A=
[  5]   1.00-2.00   sec  4.20 GBytes  36.1 Gbits/sec=0A=
[  5]   2.00-3.00   sec  4.18 GBytes  35.9 Gbits/sec=0A=
[  5]   3.00-4.00   sec  4.21 GBytes  36.1 Gbits/sec=0A=
[  5]   4.00-5.00   sec  4.20 GBytes  36.1 Gbits/sec=0A=
[  5]   5.00-6.00   sec  4.21 GBytes  36.2 Gbits/sec=0A=
[  5]   6.00-7.00   sec  4.10 GBytes  35.2 Gbits/sec=0A=
[  5]   7.00-8.00   sec  4.20 GBytes  36.1 Gbits/sec=0A=
[  5]   8.00-9.00   sec  4.21 GBytes  36.1 Gbits/sec=0A=
[  5]   9.00-10.00  sec  4.20 GBytes  36.1 Gbits/sec=0A=
[  5]  10.00-10.00  sec  7.76 MBytes  35.3 Gbits/sec=0A=
- - - - - - - - - - - - - - - - - - - - - - - - -=0A=
[ ID] Interval           Transfer     Bitrate=0A=
[  5]   0.00-10.00  sec  41.5 GBytes  35.7 Gbits/sec                  recei=
ver=0A=
-----------------------------------------------------------=0A=
Server listening on 5201 (test #2)=0A=
-----------------------------------------------------------=0A=
=0A=
On Sun, May 22, 2022 at 3:45 AM John <jwd@freebsd.org> wrote:=0A=
>=0A=
> ----- Adam Stylinski's Original Message -----=0A=
> > Hello,=0A=
> >=0A=
> > I have two systems connected via ConnectX-3 mellanox cards in ethernet=
=0A=
> > mode.  They have their MTU's maxed at 9000, their ring buffers maxed=0A=
> > at 8192, and I can hit around 36 gbps with iperf.=0A=
> >=0A=
> > When using an NFS client (client =3D linux, server =3D freebsd), I see =
a=0A=
> > maximum rate of around 20gbps.  The test file is fully in ARC.  The=0A=
> > test is performed with an NFS mount nconnect=3D4 and an rsize/wsize of=
=0A=
> > 1MB.=0A=
> >=0A=
> > Here's the flame graph of the kernel of the system in question, with=0A=
> > idle stacks removed:=0A=
> >=0A=
> > https://gist.github.com/KungFuJesus/918c6dcf40ae07767d5382deafab3a52#fi=
le-nfs_fg-svg=0A=
> >=0A=
> > The longest functions seems like maybe it's the ERMS aware memcpy=0A=
> > happening from the ARC?  Is there maybe a missing fast path that could=
=0A=
> > take fewer copies into the socket buffer?=0A=
>=0A=
> Hi Adam -=0A=
>=0A=
>    Some items to look at and possibly include for more responses....=0A=
>=0A=
> - What is your server system? Make/model/ram/etc. What is your=0A=
>   overall 'top' cpu utilization 'top -aH' ...=0A=
>=0A=
> - It looks like you're using a 40gb/s card. Posting the output of=0A=
>   'ifconfig -vm' would provide additional information.=0A=
>=0A=
> - Are the interfaces running cleanly? 'netstat -i' is helpful.=0A=
>=0A=
> - Inspect 'netstat -s'. Duplicate pkts? Resends? Out-of-order?=0A=
>=0A=
> - Inspect 'netstat -m'. Denied? Delayed?=0A=
>=0A=
>=0A=
> - You mention iperf. Please post the options you used when=0A=
>   invoking iperf and it's output.=0A=
>=0A=
> - You appear to be looking for through-put vs low-latency. Have=0A=
>   you looked at window-size vs the amount of memory allocated to the=0A=
>   streams. These values vary based on the bit-rate of the connection.=0A=
>   Tcp connections require outstanding un-ack'd data to be held.=0A=
>   Effects values below.=0A=
>=0A=
>=0A=
> - What are your values for:=0A=
>=0A=
> -- kern.ipc.maxsockbuf=0A=
> -- net.inet.tcp.sendbuf_max=0A=
> -- net.inet.tcp.recvbuf_max=0A=
>=0A=
> -- net.inet.tcp.sendspace=0A=
> -- net.inet.tcp.recvspace=0A=
>=0A=
> -- net.inet.tcp.delayed_ack=0A=
>=0A=
> - What threads/irq are allocated to your NIC? 'vmstat -i'=0A=
>=0A=
> - Are the above threads floating or mapped? 'cpuset -g ...'=0A=
>=0A=
> - Determine best settings for LRO/TSO for your card.=0A=
>=0A=
> - Disable nfs tcp drc=0A=
>=0A=
> - What is your atime setting?=0A=
>=0A=
>=0A=
>    If you really think you have a ZFS/Kernel issue, and you're=0A=
> data fits in cache, dump ZFS, create a memory backed file system=0A=
> and repeat your tests. This will purge a large portion of your=0A=
> graph.  LRO/TSO changes may do so also.=0A=
>=0A=
>    You also state you are using a Linux client. Are you using=0A=
> the MLX affinity scripts, buffer sizing suggestions, etc, etc.=0A=
> Have you swapped the Linux system for a fbsd system?=0A=
>=0A=
>    And as a final note, I regularly use Chelsio T62100 cards=0A=
> in dual home and/or LACP environments in Supermicro boxes with 100's=0A=
> of nfs boot (Bhyve, QEMU, and physical system) clients per server=0A=
> with no network starvation or cpu bottlenecks.  Clients boot, perform=0A=
> their work, and then remotely request image rollback.=0A=
>=0A=
>=0A=
>    Hopefully the above will help and provide pointers.=0A=
>=0A=
> Cheers=0A=
>=0A=
=0A=

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQBPR0101MB97429323AD5F921BE76C613EDDD59>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation