Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 22 May 2022 22:26:07 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Adam Stylinski <kungfujesus06@gmail.com>
Cc:        John <jwd@freebsd.org>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Subject:   Re: zfs/nfsd performance limiter
Message-ID:  <YQBPR0101MB9742056AFEF03C6CAF2B7F56DDD59@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <CAJwHY9WHE4MFScuhry7v9MqRQBSTNY5XYCH5qfO4xEn6Swwtrw@mail.gmail.com>
References:  <CAJwHY9WMOOLy=rb9FNjExQtYej21Zv=Po9Cbg=19gkw1SLFSww@mail.gmail.com> <YonqGfJST09cUV6W@FreeBSD.org> <CAJwHY9W-3eEXR%2BjTw40thcio65Ukjw8qgnp-qPiS3bdeZS0kLw@mail.gmail.com> <YQBPR0101MB97429323AD5F921BE76C613EDDD59@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM> <CAJwHY9WHE4MFScuhry7v9MqRQBSTNY5XYCH5qfO4xEn6Swwtrw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Adam Stylinski <kungfujesus06@gmail.com> wrote:=0A=
[stuff snipped]=0A=
>=0A=
> However, in general, RPC RTT will define how well NFS performs and not=0A=
> the I/O rate for a bulk file read/write.=0A=
Lets take this RPC RTT thing a step further...=0A=
- If I got the math right, at 40Gbps, 1Mbyte takes about 200usec on the wir=
e.=0A=
Without readahead, the protocol looks like this:=0A=
Client                                     Server (time going down the scre=
en)=0A=
        small Read request --->=0A=
        <-- 1Mbyte reply=0A=
        small Read request -->=0A=
        <-- 1Mbyte reply=0A=
The 1Mbyte replies take 200usec on the wire.=0A=
=0A=
Then suppose your ping time is 400usec (I see about 350usec on my little la=
n).=0A=
- The wire is only transferring data about half of the time, because the sm=
all=0A=
  request message takes almost as long as the 1Mbyte reply.=0A=
=0A=
As you can see, readahead (where multiple reads are done concurrently)=0A=
is critical for this case. I have no idea how Linux decides to do readahead=
.=0A=
(FreeBSD defaults to 1 readahead, with a mount option that can increase=0A=
 that.)=0A=
=0A=
Now, net interfaces normally do interrupt  moderation. This is done to=0A=
avoid an interrupt storm during bulk data transfer. However, interrupt=0A=
moderation results in interrupt delay for handling the small Read request=
=0A=
message.=0A=
--> Interrupt moderation can increase RPC RTT. Turning it off, if possible,=
=0A=
      might help.=0A=
=0A=
So, ping the server from the client to see what your RTT roughly is.=0A=
Also, you could look at some traffic in wireshark, to see what readahead=0A=
is happening and what the RPC RTT is.=0A=
(You can capture with "tcpdump", but wireshark knows how to decode=0A=
 NFS properly.)=0A=
=0A=
As you can see, RPC traffic is very different from bulk data transfer.=0A=
=0A=
rick=0A=
=0A=
> Btw, writing is a very different story than reading, largely due to the n=
eed=0A=
> to commit data/metadata to stable storage while writing.=0A=
>=0A=
> I can't help w.r.t. ZFS nor high performance nets (my fastest is 1Gbps), =
rick=0A=
>=0A=
> >  You mention iperf. Please post the options you used when invoking iper=
f and it's output.=0A=
>=0A=
> Setting up the NFS client as a "server", since it seems that the=0A=
> terminology is a little bit flipped with iperf, here's the output:=0A=
>=0A=
> -----------------------------------------------------------=0A=
> Server listening on 5201 (test #1)=0A=
> -----------------------------------------------------------=0A=
> Accepted connection from 10.5.5.1, port 11534=0A=
> [  5] local 10.5.5.4 port 5201 connected to 10.5.5.1 port 43931=0A=
> [ ID] Interval           Transfer     Bitrate=0A=
> [  5]   0.00-1.00   sec  3.81 GBytes  32.7 Gbits/sec=0A=
> [  5]   1.00-2.00   sec  4.20 GBytes  36.1 Gbits/sec=0A=
> [  5]   2.00-3.00   sec  4.18 GBytes  35.9 Gbits/sec=0A=
> [  5]   3.00-4.00   sec  4.21 GBytes  36.1 Gbits/sec=0A=
> [  5]   4.00-5.00   sec  4.20 GBytes  36.1 Gbits/sec=0A=
> [  5]   5.00-6.00   sec  4.21 GBytes  36.2 Gbits/sec=0A=
> [  5]   6.00-7.00   sec  4.10 GBytes  35.2 Gbits/sec=0A=
> [  5]   7.00-8.00   sec  4.20 GBytes  36.1 Gbits/sec=0A=
> [  5]   8.00-9.00   sec  4.21 GBytes  36.1 Gbits/sec=0A=
> [  5]   9.00-10.00  sec  4.20 GBytes  36.1 Gbits/sec=0A=
> [  5]  10.00-10.00  sec  7.76 MBytes  35.3 Gbits/sec=0A=
> - - - - - - - - - - - - - - - - - - - - - - - - -=0A=
> [ ID] Interval           Transfer     Bitrate=0A=
> [  5]   0.00-10.00  sec  41.5 GBytes  35.7 Gbits/sec                  rec=
eiver=0A=
> -----------------------------------------------------------=0A=
> Server listening on 5201 (test #2)=0A=
> -----------------------------------------------------------=0A=
>=0A=
> On Sun, May 22, 2022 at 3:45 AM John <jwd@freebsd.org> wrote:=0A=
> >=0A=
> > ----- Adam Stylinski's Original Message -----=0A=
> > > Hello,=0A=
> > >=0A=
> > > I have two systems connected via ConnectX-3 mellanox cards in etherne=
t=0A=
> > > mode.  They have their MTU's maxed at 9000, their ring buffers maxed=
=0A=
> > > at 8192, and I can hit around 36 gbps with iperf.=0A=
> > >=0A=
> > > When using an NFS client (client =3D linux, server =3D freebsd), I se=
e a=0A=
> > > maximum rate of around 20gbps.  The test file is fully in ARC.  The=
=0A=
> > > test is performed with an NFS mount nconnect=3D4 and an rsize/wsize o=
f=0A=
> > > 1MB.=0A=
> > >=0A=
> > > Here's the flame graph of the kernel of the system in question, with=
=0A=
> > > idle stacks removed:=0A=
> > >=0A=
> > > https://gist.github.com/KungFuJesus/918c6dcf40ae07767d5382deafab3a52#=
file-nfs_fg-svg=0A=
> > >=0A=
> > > The longest functions seems like maybe it's the ERMS aware memcpy=0A=
> > > happening from the ARC?  Is there maybe a missing fast path that coul=
d=0A=
> > > take fewer copies into the socket buffer?=0A=
> >=0A=
> > Hi Adam -=0A=
> >=0A=
> >    Some items to look at and possibly include for more responses....=0A=
> >=0A=
> > - What is your server system? Make/model/ram/etc. What is your=0A=
> >   overall 'top' cpu utilization 'top -aH' ...=0A=
> >=0A=
> > - It looks like you're using a 40gb/s card. Posting the output of=0A=
> >   'ifconfig -vm' would provide additional information.=0A=
> >=0A=
> > - Are the interfaces running cleanly? 'netstat -i' is helpful.=0A=
> >=0A=
> > - Inspect 'netstat -s'. Duplicate pkts? Resends? Out-of-order?=0A=
> >=0A=
> > - Inspect 'netstat -m'. Denied? Delayed?=0A=
> >=0A=
> >=0A=
> > - You mention iperf. Please post the options you used when=0A=
> >   invoking iperf and it's output.=0A=
> >=0A=
> > - You appear to be looking for through-put vs low-latency. Have=0A=
> >   you looked at window-size vs the amount of memory allocated to the=0A=
> >   streams. These values vary based on the bit-rate of the connection.=
=0A=
> >   Tcp connections require outstanding un-ack'd data to be held.=0A=
> >   Effects values below.=0A=
> >=0A=
> >=0A=
> > - What are your values for:=0A=
> >=0A=
> > -- kern.ipc.maxsockbuf=0A=
> > -- net.inet.tcp.sendbuf_max=0A=
> > -- net.inet.tcp.recvbuf_max=0A=
> >=0A=
> > -- net.inet.tcp.sendspace=0A=
> > -- net.inet.tcp.recvspace=0A=
> >=0A=
> > -- net.inet.tcp.delayed_ack=0A=
> >=0A=
> > - What threads/irq are allocated to your NIC? 'vmstat -i'=0A=
> >=0A=
> > - Are the above threads floating or mapped? 'cpuset -g ...'=0A=
> >=0A=
> > - Determine best settings for LRO/TSO for your card.=0A=
> >=0A=
> > - Disable nfs tcp drc=0A=
> >=0A=
> > - What is your atime setting?=0A=
> >=0A=
> >=0A=
> >    If you really think you have a ZFS/Kernel issue, and you're=0A=
> > data fits in cache, dump ZFS, create a memory backed file system=0A=
> > and repeat your tests. This will purge a large portion of your=0A=
> > graph.  LRO/TSO changes may do so also.=0A=
> >=0A=
> >    You also state you are using a Linux client. Are you using=0A=
> > the MLX affinity scripts, buffer sizing suggestions, etc, etc.=0A=
> > Have you swapped the Linux system for a fbsd system?=0A=
> >=0A=
> >    And as a final note, I regularly use Chelsio T62100 cards=0A=
> > in dual home and/or LACP environments in Supermicro boxes with 100's=0A=
> > of nfs boot (Bhyve, QEMU, and physical system) clients per server=0A=
> > with no network starvation or cpu bottlenecks.  Clients boot, perform=
=0A=
> > their work, and then remotely request image rollback.=0A=
> >=0A=
> >=0A=
> >    Hopefully the above will help and provide pointers.=0A=
> >=0A=
> > Cheers=0A=
> >=0A=
>=0A=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQBPR0101MB9742056AFEF03C6CAF2B7F56DDD59>