From nobody Fri May 27 11:30:15 2022 X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4C8061B5C55E for ; Fri, 27 May 2022 11:30:26 +0000 (UTC) (envelope-from crest@rlwinm.de) Received: from mail.rlwinm.de (mail.rlwinm.de [IPv6:2a01:4f8:171:f902::5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4L8jKF3L0kz4knm for ; Fri, 27 May 2022 11:30:25 +0000 (UTC) (envelope-from crest@rlwinm.de) Received: from [IPV6:2001:9e8:94e:3b00:10c:30d:d129:e919] (unknown [IPv6:2001:9e8:94e:3b00:10c:30d:d129:e919]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-384) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by mail.rlwinm.de (Postfix) with ESMTPSA id B39261C4BB for ; Fri, 27 May 2022 11:30:16 +0000 (UTC) Message-ID: Date: Fri, 27 May 2022 13:30:15 +0200 List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.9.1 Subject: Re: zfs/nfsd performance limiter Content-Language: en-US To: freebsd-fs@freebsd.org References: From: Jan Bramkamp In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 4L8jKF3L0kz4knm X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of crest@rlwinm.de designates 2a01:4f8:171:f902::5 as permitted sender) smtp.mailfrom=crest@rlwinm.de X-Spamd-Result: default: False [-3.30 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[]; ARC_NA(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; R_SPF_ALLOW(-0.20)[+mx]; MIME_GOOD(-0.10)[text/plain]; TO_DN_NONE(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[freebsd-fs@freebsd.org]; RCPT_COUNT_ONE(0.00)[1]; NEURAL_HAM_LONG(-1.00)[-1.000]; DMARC_NA(0.00)[rlwinm.de]; NEURAL_HAM_SHORT(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; MLMMJ_DEST(0.00)[freebsd-fs]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:24940, ipnet:2a01:4f8::/32, country:DE]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[] X-ThisMailContainsUnwantedMimeParts: N On 23.05.22 00:26, Rick Macklem wrote: > Adam Stylinski wrote: > [stuff snipped] >> However, in general, RPC RTT will define how well NFS performs and not >> the I/O rate for a bulk file read/write. > Lets take this RPC RTT thing a step further... > - If I got the math right, at 40Gbps, 1Mbyte takes about 200usec on the wire. > Without readahead, the protocol looks like this: > Client Server (time going down the screen) > small Read request ---> > <-- 1Mbyte reply > small Read request --> > <-- 1Mbyte reply > The 1Mbyte replies take 200usec on the wire. > > Then suppose your ping time is 400usec (I see about 350usec on my little lan). > - The wire is only transferring data about half of the time, because the small > request message takes almost as long as the 1Mbyte reply. > > As you can see, readahead (where multiple reads are done concurrently) > is critical for this case. I have no idea how Linux decides to do readahead. > (FreeBSD defaults to 1 readahead, with a mount option that can increase > that.) > > Now, net interfaces normally do interrupt moderation. This is done to > avoid an interrupt storm during bulk data transfer. However, interrupt > moderation results in interrupt delay for handling the small Read request > message. > --> Interrupt moderation can increase RPC RTT. Turning it off, if possible, > might help. > > So, ping the server from the client to see what your RTT roughly is. > Also, you could look at some traffic in wireshark, to see what readahead > is happening and what the RPC RTT is. > (You can capture with "tcpdump", but wireshark knows how to decode > NFS properly.) > > As you can see, RPC traffic is very different from bulk data transfer. Would it make sense to extend nconnect to apply different QoS markings to the control connection and the bulk connections to prioritize small(ish) RPC calls over the bulk transfer RPCs? Failing that is it possible to connect to the NFS server through different addresses for small RPC and large RPCs to use different NICs and switch ports?