From nobody Sun May 22 13:35:52 2022 X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4CC831AE970E for ; Sun, 22 May 2022 13:36:05 +0000 (UTC) (envelope-from kungfujesus06@gmail.com) Received: from mail-yb1-xb2e.google.com (mail-yb1-xb2e.google.com [IPv6:2607:f8b0:4864:20::b2e]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4L5hLX5416z4Tvl; Sun, 22 May 2022 13:36:04 +0000 (UTC) (envelope-from kungfujesus06@gmail.com) Received: by mail-yb1-xb2e.google.com with SMTP id f16so21256813ybk.2; Sun, 22 May 2022 06:36:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=uvirHoJ8hZTuvjGTMqiv6oem3tGvANPR6TvTnTwYpKY=; b=YvPG3Gg2rrS182Ij1M0RR4X9VyrIvu9Ny+BSQ79sNMuococz2MyFN36Asfo/kcycDm TbeNBTk+RvIDvdt91mkkVL+PG3u5Rqzi+4+ZYAizqLylVpnl/cB/g4MKhqVL3Xtl6A86 oC2acNGjaWp6oc9N6nb+k5IiXCBBCgoMAATCEX1Wl/Vf11Rr3xlec1RBcfzwWXQuGM69 BL9ewQN6cGF+QJO087Qro0D18HTeMUlqkwMobpqGWgnOW55bnajt2GQFbIAqt80/4CUS Nv0WXT9Hkv2Be+sqGEaKceeGcUupfUEoeypIqw4Vcxs5dJz8sN/TOasamPxlsu8P7+Kl RFeg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=uvirHoJ8hZTuvjGTMqiv6oem3tGvANPR6TvTnTwYpKY=; b=2WPoyaRzR0rlwMv8W8ETssLO+X3JF+POOjHRDFp3TKZBO6ny4SnsOa9poS2yRSDqgT zBA5T2U9R+5Q1Coe3X20Pshfta8U1u7E+INDu5zKLyVUi+yu4ltGgzfZTASvvuaHMKZk JNJ2HdMZdkLkB2jNpZYUYAmZZlxctRlrkjiYDRN6UduOtxPIBBGF9POb6fK2Poi7JoQG IGHM3gYRgcE3KyQMPHdz+S9nJItqq0UAe1fi2Ci6ddGnmssojnaqNQ2OvBaZIkZb9q6s XGM1Q+FTzdpb8rXbUsKIeO425o17l1n0Ecjm2z6s8ywMXpKCMy69rD0qqC1uakMvFj8/ v/eA== X-Gm-Message-State: AOAM53140li1sk/51hj0uYFn1rhQulsE7ZnQi5/BaKLOSq4v1S2Zz9YB yQ7XQq7X6QZiY/Tj5kLH//46O/SrS9CBOdB0UZrm7TCO4tM= X-Google-Smtp-Source: ABdhPJzcOf0JcpvsoiqC+Ff2OWP9gQ08BlhCi2n08ix2al0XkqMhQSU8S1pcbgmUzVchLM9G+XI7bOeSFmWaP3tmhA4= X-Received: by 2002:a05:6902:54d:b0:64d:cc4f:111 with SMTP id z13-20020a056902054d00b0064dcc4f0111mr18351285ybs.148.1653226563760; Sun, 22 May 2022 06:36:03 -0700 (PDT) List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org MIME-Version: 1.0 References: In-Reply-To: From: Adam Stylinski Date: Sun, 22 May 2022 09:35:52 -0400 Message-ID: Subject: Re: zfs/nfsd performance limiter To: John Cc: freebsd-fs@freebsd.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 4L5hLX5416z4Tvl X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20210112 header.b=YvPG3Gg2; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of kungfujesus06@gmail.com designates 2607:f8b0:4864:20::b2e as permitted sender) smtp.mailfrom=kungfujesus06@gmail.com X-Spamd-Result: default: False [-4.00 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; FREEMAIL_FROM(0.00)[gmail.com]; TO_MATCH_ENVRCPT_ALL(0.00)[]; MIME_GOOD(-0.10)[text/plain]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; NEURAL_HAM_LONG(-1.00)[-1.000]; MID_RHS_MATCH_FROMTLD(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; RCPT_COUNT_TWO(0.00)[2]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::b2e:from]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; MLMMJ_DEST(0.00)[freebsd-fs]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim] X-ThisMailContainsUnwantedMimeParts: N > What is your server system? Make/model/ram/etc. Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (6 cores, a little starved on the clock but the load at least is basically zero during this test) 128GB of memory > top -aH During the copy load (for brevity, only did the real top contenders for CPU here): last pid: 15560; load averages: 0.25, 0.39, 0.27 up 4+15:48:54 09:17:38 98 threads: 2 running, 96 sleeping CPU: 0.0% user, 0.0% nice, 19.1% system, 5.6% interrupt, 75.3% idle Mem: 12M Active, 4405M Inact, 8284K Laundry, 115G Wired, 1148M Buf, 4819M Free ARC: 98G Total, 80G MFU, 15G MRU, 772K Anon, 1235M Header, 1042M Other 91G Compressed, 189G Uncompressed, 2.09:1 Ratio Swap: 5120M Total, 5120M Free PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 3830 root 20 0 12M 2700K rpcsvc 2 1:16 53.26% nfsd: server (nfsd){nfsd: service} 3830 root 20 0 12M 2700K CPU5 5 5:42 52.96% nfsd: server (nfsd){nfsd: master} 15560 adam 20 0 17M 5176K CPU2 2 0:00 0.12% top -aH 1493 root 20 0 13M 2260K select 3 0:36 0.01% /usr/sbin/powerd 1444 root 20 0 75M 2964K select 5 0:19 0.01% /usr/sbin/mountd -r /etc/exports /etc/zfs/exports 1215 uucp 20 0 13M 2820K select 5 0:27 0.01% /usr/local/libexec/nut/usbhid-ups -a cyberpower 93424 adam 20 0 21M 9900K select 0 0:00 0.01% sshd: adam@pts/0 (sshd) > ifconfig -vm mlxen0: flags=8843 metric 0 mtu 9000 options=ed07bb capabilities=ed07bb ether 00:02:c9:35:df:20 inet 10.5.5.1 netmask 0xffffff00 broadcast 10.5.5.255 media: Ethernet autoselect (40Gbase-CR4 ) status: active supported media: media autoselect media 40Gbase-CR4 mediaopt full-duplex media 10Gbase-CX4 mediaopt full-duplex media 10Gbase-SR mediaopt full-duplex media 1000baseT mediaopt full-duplex nd6 options=29 plugged: QSFP+ 40GBASE-CR4 (No separable connector) vendor: Mellanox PN: MC2207130-002 SN: MT1419VS07971 DATE: 2014-06-06 module temperature: 0.00 C voltage: 0.00 Volts lane 1: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA lane 2: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA lane 3: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA lane 4: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA > - What are your values for: > > -- kern.ipc.maxsockbuf > -- net.inet.tcp.sendbuf_max > -- net.inet.tcp.recvbuf_max > > -- net.inet.tcp.sendspace > -- net.inet.tcp.recvspace > > -- net.inet.tcp.delayed_ack kern.ipc.maxsockbuf: 16777216 net.inet.tcp.sendbuf_max: 16777216 net.inet.tcp.recvbuf_max: 16777216 net.inet.tcp.sendspace: 32768 # This is interesting? I'm not sure why the discrepancy here net.inet.tcp.recvspace: 4194304 net.inet.tcp.delayed_ack: 0 > netstat -i Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll igb0 9000 ac:1f:6b:b0:60:bc 18230625 0 0 24178283 0 0 igb1 9000 ac:1f:6b:b0:60:bc 14341213 0 0 8447249 0 0 lo0 16384 lo0 367691 0 0 367691 0 0 lo0 - localhost localhost 68 - - 68 - - lo0 - fe80::%lo0/64 fe80::1%lo0 0 - - 0 - - lo0 - your-net localhost 348944 - - 348944 - - mlxen 9000 00:02:c9:35:df:20 13138046 0 12 26308206 0 0 mlxen - 10.5.5.0/24 10.5.5.1 11592389 - - 24345184 - - vm-pu 9000 56:3e:55:8a:2a:f8 7270 0 0 962249 102 0 lagg0 9000 ac:1f:6b:b0:60:bc 31543941 0 0 31623674 0 0 lagg0 - 192.168.0.0/2 nasbox 27967582 - - 41779731 - - > What threads/irq are allocated to your NIC? 'vmstat -i' Doesn't seem perfectly balanced but not terribly imbalanced, either: interrupt total rate irq9: acpi0 3 0 irq18: ehci0 ehci1+ 803162 2 cpu0:timer 67465114 167 cpu1:timer 65068819 161 cpu2:timer 65535300 163 cpu3:timer 63408731 157 cpu4:timer 63026304 156 cpu5:timer 63431412 157 irq56: nvme0:admin 18 0 irq57: nvme0:io0 544999 1 irq58: nvme0:io1 465816 1 irq59: nvme0:io2 487486 1 irq60: nvme0:io3 474616 1 irq61: nvme0:io4 452527 1 irq62: nvme0:io5 467807 1 irq63: mps0 36110415 90 irq64: mps1 112328723 279 irq65: mps2 54845974 136 irq66: mps3 50770215 126 irq68: xhci0 3122136 8 irq70: igb0:rxq0 1974562 5 irq71: igb0:rxq1 3034190 8 irq72: igb0:rxq2 28703842 71 irq73: igb0:rxq3 1126533 3 irq74: igb0:aq 7 0 irq75: igb1:rxq0 1852321 5 irq76: igb1:rxq1 2946722 7 irq77: igb1:rxq2 9602613 24 irq78: igb1:rxq3 4101258 10 irq79: igb1:aq 8 0 irq80: ahci1 37386191 93 irq81: mlx4_core0 4748775 12 irq82: mlx4_core0 13754442 34 irq83: mlx4_core0 3551629 9 irq84: mlx4_core0 2595850 6 irq85: mlx4_core0 4947424 12 Total 769135944 1908 > Are the above threads floating or mapped? 'cpuset -g ...' I suspect I was supposed to run this against the argument of a pid, maybe nfsd? Here's the output without an argument pid -1 mask: 0, 1, 2, 3, 4, 5 pid -1 domain policy: first-touch mask: 0 > Disable nfs tcp drc This is the first I've even seen a duplicate request cache mentioned. It seems counter-intuitive for why that'd help but maybe I'll try doing that. What exactly is the benefit? > What is your atime setting? Disabled at both the file system and the client mounts. > You also state you are using a Linux client. Are you using the MLX affinity scripts, buffer sizing suggestions, etc, etc. Have you swapped the Linux system for a fbsd system? I've not, though I do vaguely recall mellanox supplying some scripts in their documentation that fixed interrupt handling on specific cores at one point. Is this what you're referring to? I could give that a try. I don't at present have any FreeBSD client systems with enough PCI express bandwidth to swap things out for a Linux vs FreeBSD test. > You mention iperf. Please post the options you used when invoking iperf and it's output. Setting up the NFS client as a "server", since it seems that the terminology is a little bit flipped with iperf, here's the output: ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 10.5.5.1, port 11534 [ 5] local 10.5.5.4 port 5201 connected to 10.5.5.1 port 43931 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 3.81 GBytes 32.7 Gbits/sec [ 5] 1.00-2.00 sec 4.20 GBytes 36.1 Gbits/sec [ 5] 2.00-3.00 sec 4.18 GBytes 35.9 Gbits/sec [ 5] 3.00-4.00 sec 4.21 GBytes 36.1 Gbits/sec [ 5] 4.00-5.00 sec 4.20 GBytes 36.1 Gbits/sec [ 5] 5.00-6.00 sec 4.21 GBytes 36.2 Gbits/sec [ 5] 6.00-7.00 sec 4.10 GBytes 35.2 Gbits/sec [ 5] 7.00-8.00 sec 4.20 GBytes 36.1 Gbits/sec [ 5] 8.00-9.00 sec 4.21 GBytes 36.1 Gbits/sec [ 5] 9.00-10.00 sec 4.20 GBytes 36.1 Gbits/sec [ 5] 10.00-10.00 sec 7.76 MBytes 35.3 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 5] 0.00-10.00 sec 41.5 GBytes 35.7 Gbits/sec receiver ----------------------------------------------------------- Server listening on 5201 (test #2) ----------------------------------------------------------- On Sun, May 22, 2022 at 3:45 AM John wrote: > > ----- Adam Stylinski's Original Message ----- > > Hello, > > > > I have two systems connected via ConnectX-3 mellanox cards in ethernet > > mode. They have their MTU's maxed at 9000, their ring buffers maxed > > at 8192, and I can hit around 36 gbps with iperf. > > > > When using an NFS client (client = linux, server = freebsd), I see a > > maximum rate of around 20gbps. The test file is fully in ARC. The > > test is performed with an NFS mount nconnect=4 and an rsize/wsize of > > 1MB. > > > > Here's the flame graph of the kernel of the system in question, with > > idle stacks removed: > > > > https://gist.github.com/KungFuJesus/918c6dcf40ae07767d5382deafab3a52#file-nfs_fg-svg > > > > The longest functions seems like maybe it's the ERMS aware memcpy > > happening from the ARC? Is there maybe a missing fast path that could > > take fewer copies into the socket buffer? > > Hi Adam - > > Some items to look at and possibly include for more responses.... > > - What is your server system? Make/model/ram/etc. What is your > overall 'top' cpu utilization 'top -aH' ... > > - It looks like you're using a 40gb/s card. Posting the output of > 'ifconfig -vm' would provide additional information. > > - Are the interfaces running cleanly? 'netstat -i' is helpful. > > - Inspect 'netstat -s'. Duplicate pkts? Resends? Out-of-order? > > - Inspect 'netstat -m'. Denied? Delayed? > > > - You mention iperf. Please post the options you used when > invoking iperf and it's output. > > - You appear to be looking for through-put vs low-latency. Have > you looked at window-size vs the amount of memory allocated to the > streams. These values vary based on the bit-rate of the connection. > Tcp connections require outstanding un-ack'd data to be held. > Effects values below. > > > - What are your values for: > > -- kern.ipc.maxsockbuf > -- net.inet.tcp.sendbuf_max > -- net.inet.tcp.recvbuf_max > > -- net.inet.tcp.sendspace > -- net.inet.tcp.recvspace > > -- net.inet.tcp.delayed_ack > > - What threads/irq are allocated to your NIC? 'vmstat -i' > > - Are the above threads floating or mapped? 'cpuset -g ...' > > - Determine best settings for LRO/TSO for your card. > > - Disable nfs tcp drc > > - What is your atime setting? > > > If you really think you have a ZFS/Kernel issue, and you're > data fits in cache, dump ZFS, create a memory backed file system > and repeat your tests. This will purge a large portion of your > graph. LRO/TSO changes may do so also. > > You also state you are using a Linux client. Are you using > the MLX affinity scripts, buffer sizing suggestions, etc, etc. > Have you swapped the Linux system for a fbsd system? > > And as a final note, I regularly use Chelsio T62100 cards > in dual home and/or LACP environments in Supermicro boxes with 100's > of nfs boot (Bhyve, QEMU, and physical system) clients per server > with no network starvation or cpu bottlenecks. Clients boot, perform > their work, and then remotely request image rollback. > > > Hopefully the above will help and provide pointers. > > Cheers >