From owner-freebsd-net@freebsd.org  Tue Jun  2 13:12:49 2020
Return-Path: <owner-freebsd-net@freebsd.org>
Delivered-To: freebsd-net@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 3510F336B54
 for <freebsd-net@mailman.nyi.freebsd.org>;
 Tue,  2 Jun 2020 13:12:49 +0000 (UTC)
 (envelope-from pen@lysator.liu.se)
Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 49bssW09GPz3YVg;
 Tue,  2 Jun 2020 13:12:46 +0000 (UTC)
 (envelope-from pen@lysator.liu.se)
Received: from mail.lysator.liu.se (localhost [127.0.0.1])
 by mail.lysator.liu.se (Postfix) with ESMTP id D623340027;
 Tue,  2 Jun 2020 15:12:43 +0200 (CEST)
Received: by mail.lysator.liu.se (Postfix, from userid 1004)
 id B8DE940009; Tue,  2 Jun 2020 15:12:43 +0200 (CEST)
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 bernadotte.lysator.liu.se
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED, AWL, HTML_MESSAGE
 autolearn=disabled version=3.4.2
X-Spam-Score: -1.0
Received: from [192.168.1.132] (h-201-140.A785.priv.bahnhof.se
 [98.128.201.140])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mail.lysator.liu.se (Postfix) with ESMTPSA id E31A940006;
 Tue,  2 Jun 2020 15:12:40 +0200 (CEST)
From: Peter Eriksson <pen@lysator.liu.se>
Message-Id: <7E0A7D8E-72E6-4D32-B2A7-C4CE4127DDEF@lysator.liu.se>
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.80.23.2.2\))
Subject: Re: how to fix an interesting issue with mountd?
Date: Tue, 2 Jun 2020 15:12:40 +0200
In-Reply-To: <YTBPR01MB36647DA1465C7D35CDCE9681DD8B0@YTBPR01MB3664.CANPRD01.PROD.OUTLOOK.COM>
Cc: "Rodney W. Grimes" <freebsd-rwg@gndrsh.dnsmgr.net>,
 Mark Johnston <markj@FreeBSD.org>,
 "patrykkotlowski@gmail.com" <patrykkotlowski@gmail.com>,
 Rick Macklem <rmacklem@uoguelph.ca>
To: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
References: <YTBPR01MB36647DA1465C7D35CDCE9681DD8B0@YTBPR01MB3664.CANPRD01.PROD.OUTLOOK.COM>
X-Mailer: Apple Mail (2.3608.80.23.2.2)
X-Virus-Scanned: ClamAV using ClamSMTP
X-Rspamd-Queue-Id: 49bssW09GPz3YVg
X-Spamd-Bar: --
Authentication-Results: mx1.freebsd.org; dkim=none;
 dmarc=pass (policy=none) header.from=liu.se;
 spf=pass (mx1.freebsd.org: domain of pen@lysator.liu.se designates
 130.236.254.3 as permitted sender) smtp.mailfrom=pen@lysator.liu.se
X-Spamd-Result: default: False [-2.83 / 15.00]; TO_DN_EQ_ADDR_SOME(0.00)[];
 RCVD_VIA_SMTP_AUTH(0.00)[]; ARC_NA(0.00)[];
 FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; MV_CASE(0.50)[];
 R_SPF_ALLOW(-0.20)[+a:mail.lysator.liu.se];
 MIME_GOOD(-0.10)[multipart/alternative,text/plain];
 RCVD_TLS_LAST(0.00)[]; NEURAL_HAM_LONG(-1.03)[-1.026];
 RCPT_COUNT_FIVE(0.00)[5]; RCVD_COUNT_THREE(0.00)[4];
 TO_MATCH_ENVRCPT_SOME(0.00)[];
 RCVD_IN_DNSWL_MED(-0.20)[130.236.254.3:from];
 DMARC_POLICY_ALLOW(-0.50)[liu.se,none];
 NEURAL_HAM_SHORT(-1.32)[-1.318];
 NEURAL_HAM_MEDIUM(-0.99)[-0.989]; FROM_EQ_ENVFROM(0.00)[];
 R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~];
 ASN(0.00)[asn:2843, ipnet:130.236.0.0/16, country:SE];
 SUBJECT_ENDS_QUESTION(1.00)[]; MID_RHS_MATCH_FROM(0.00)[];
 FREEMAIL_CC(0.00)[gndrsh.dnsmgr.net,FreeBSD.org,gmail.com,uoguelph.ca]
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.33
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.33
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 02 Jun 2020 13:12:49 -0000

I once reported that we had a server with many thousands (typically =
23000 or so per server) ZFS filesystems (and 300+ snapshots per =
filesystem) where mountd was 100% busy reading and updating the kernel =
(and while doing that holding the NFS lock for a very long time) every =
hour (when we took snapshots of all the filesystems - the code in the =
zfs commands send a lot of SIGHUPs to mountd it seems)=E2=80=A6.=20

(Causing NFS users to complain quite a bit)

I have also seen the effect that when there are a lot of updates to =
filesystems that some exports can get =E2=80=9Cmissed=E2=80=9D if mountd =
is bombarded with multiple SIGHUPS - but with the new incremental update =
code in mountd this window (for SIGHUPs to get lost) is much smaller =
(and I now also have a Nagios check that verifies that all exports in =
/etc/zfs/exports also is visible in the kernel).


But while we had this problem it I also investigated going to a DB based =
exports =E2=80=9Cfile=E2=80=9D in order to make the code in the =
=E2=80=9Czfs=E2=80=9D commands that read and update /etc/zfs/exports a =
lot faster too. As Rick says there is room for _huge_ improvements =
there.=20

For every change of =E2=80=9Csharenfs=E2=80=9D per filesystem it would =
open and read and parse, line-by-line, /etc/zfs/exports *two* times and =
then rewrite the whole file. Now imagine doing that recursively for =
23000 filesystems=E2=80=A6 My change to the zfs code simple opened a DB =
file and just did a =E2=80=9Cput=E2=80=9D of a record for the filesystem =
(and then sent mountd a SIGHUP).

(And even worse - when doing the boot-time =E2=80=9Czfs share -a=E2=80=9D =
- for each filesystem it would open(/etc/zfs/exports, read it line by =
line and check to make sure the filesystem isn=E2=80=99t already in the =
file, then open a tmp file, write out all the old filesystem + plus the =
new one, rename it to /etc/zfs/exports, send a SIGHUP) and the go on to =
the next one.. Repeat.  Pretty fast for 1-10 filesystems, not so fast =
for 20000+ ones=E2=80=A6 And tests the boot disk I/O a bit :-)
=20

I have seen that the (ZFS-on-Linux) OpenZFS code has changed a bit =
regarding this and I think for Linux they are going the route of =
directly updating the kernel instead of going via some external updater =
(like mountd). That probably would be an even better way (for ZFS) but a =
DB database might be useful anyway. It=E2=80=99s a very simple change =
(especially in mountd - it just opens the DB file and reads the records =
sequentially instead of the text file).

- Peter

> On 2 Jun 2020, at 06:30, Rick Macklem <rmacklem@uoguelph.ca> wrote:
>=20
> Rodney Grimes wrote:
>>> Hi,
>>>=20
>>> I'm posting this one to freebsd-net@ since it seems vaguely similar
>>> to a network congestion problem and thought that network types
>>> might have some ideas w.r.t. fixing it?
>>>=20
>>> PR#246597 - Reports a problem (which if I understand it is) where a =
sighup
>>>   is posted to mountd and then another sighup is posted to mountd =
while
>>>   it is reloading exports and the exports are not reloaded again.
>>>   --> The simple patch in the PR fixes the above problem, but I =
think will
>>>          aggravate another one.
>>> For some NFS servers, it can take minutes to reload the exports =
file(s).
>>> (I believe Peter Erriksonn has a server with 80000+ file systems =
exported.)
>>> r348590 reduced the time taken, but it is still minutes, if I recall =
correctly.
> Actually, my recollection w.r.t. the times was way off.
> I just looked at the old PR#237860 and, without r348590, it was =
16seconds
> (aka seconds, not minutes) and with r348590 that went down to a =
fraction
> of a second (there was no exact number in the PR, but I noted =
milliseconds in
> the commit log entry.
>=20
> I still think there is a risk of doing the reloads repeatedly.
>=20
>>> --> If you apply the patch in the PR and sighups are posted to =
mountd as
>>>       often as it takes to reload the exports file(s), it will =
simply reload the
>>>       exports file(s) over and over and over again, instead of =
processing
>>>       Mount RPC requests.
>>>=20
>>> So, finally to the interesting part...
>>> - It seems that the code needs to be changed so that it won't =
"forget"
>>>  sighup(s) posted to it, but it should not reload the exports =
file(s) too
>>>  frequently.
>>> --> My thoughts are something like:
>>>  - Note that sighup(s) were posted while reloading the exports =
file(s) and
>>>    do the reload again, after some minimum delay.
>>>    --> The minimum delay might only need to be 1second to allow some
>>>           RPCs to be processed before reload happens again.
>>>     Or
>>>    --> The minimum delay could be some fraction of how long a reload =
takes.
>>>          (The code could time the reload and use that to calculate =
how long to
>>>           delay before doing the reload again.)
>>>=20
>>> Any ideas or suggestions? rick
>>> ps: I've actually known about this for some time, but since I didn't =
have a good
>>>     solution...
>>=20
>> Build a system that allows adding and removing entries from the
>> in mountd exports data so that you do not have to do a full
>> reload every time one is added or removed?
>>=20
>> Build a system that used 2 exports tables, the active one, and the
>> one that was being loaded, so that you can process RPC's and reloads
>> at the same time.
> Well, r348590 modified mountd so that it built a new set of linked =
list
> structures from the modified exports file(s) and then compared them =
with
> the old ones, only doing updates to the kernel exports for changes.
>=20
> It still processes the entire exports file each time, to produce the =
in mountd
> memory linked lists (using hash tables and a binary tree).
>=20
> Peter did send me a patch to use a db frontend, but he felt the only
> performance improvements would be related to ZFS.
> Since ZFS is something I avoid like the plague I never pursued it.
> (If anyone willing to ZFS stuff wants to pursue this,
> just email and I can send you the patch.)
> Here's a snippet of what he said about it.
>> It looks like a very simple patch to create and even though it =
wouldn=E2=80=99t really        >  improve the speed for the work that =
mountd does it would make possible really > drastic speed improvements =
in the zfs commands. They (zfs commands) currently >  reads the thru =
text-based exports file multiple times when you do work with zfs  > =
filesystems (mounting/sharing/changing share options etc). Using a db =
based =20
>> exports file for the zfs exports (b-tree based probably) would allow =
the zfs code > to be much faster.
>=20
> At this point, I am just interested in fixing the problem in the PR, =
rick
>=20
> _______________________________________________
> freebsd-net@freebsd.org <mailto:freebsd-net@freebsd.org> mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net =
<https://lists.freebsd.org/mailman/listinfo/freebsd-net>
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org =
<mailto:freebsd-net-unsubscribe@freebsd.org>"