Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 2 Jun 2020 15:12:40 +0200
From:      Peter Eriksson <pen@lysator.liu.se>
To:        "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Cc:        "Rodney W. Grimes" <freebsd-rwg@gndrsh.dnsmgr.net>, Mark Johnston <markj@FreeBSD.org>, "patrykkotlowski@gmail.com" <patrykkotlowski@gmail.com>, Rick Macklem <rmacklem@uoguelph.ca>
Subject:   Re: how to fix an interesting issue with mountd?
Message-ID:  <7E0A7D8E-72E6-4D32-B2A7-C4CE4127DDEF@lysator.liu.se>
In-Reply-To: <YTBPR01MB36647DA1465C7D35CDCE9681DD8B0@YTBPR01MB3664.CANPRD01.PROD.OUTLOOK.COM>
References:  <YTBPR01MB36647DA1465C7D35CDCE9681DD8B0@YTBPR01MB3664.CANPRD01.PROD.OUTLOOK.COM>

next in thread | previous in thread | raw e-mail | index | archive | help
I once reported that we had a server with many thousands (typically =
23000 or so per server) ZFS filesystems (and 300+ snapshots per =
filesystem) where mountd was 100% busy reading and updating the kernel =
(and while doing that holding the NFS lock for a very long time) every =
hour (when we took snapshots of all the filesystems - the code in the =
zfs commands send a lot of SIGHUPs to mountd it seems)=E2=80=A6.=20

(Causing NFS users to complain quite a bit)

I have also seen the effect that when there are a lot of updates to =
filesystems that some exports can get =E2=80=9Cmissed=E2=80=9D if mountd =
is bombarded with multiple SIGHUPS - but with the new incremental update =
code in mountd this window (for SIGHUPs to get lost) is much smaller =
(and I now also have a Nagios check that verifies that all exports in =
/etc/zfs/exports also is visible in the kernel).


But while we had this problem it I also investigated going to a DB based =
exports =E2=80=9Cfile=E2=80=9D in order to make the code in the =
=E2=80=9Czfs=E2=80=9D commands that read and update /etc/zfs/exports a =
lot faster too. As Rick says there is room for _huge_ improvements =
there.=20

For every change of =E2=80=9Csharenfs=E2=80=9D per filesystem it would =
open and read and parse, line-by-line, /etc/zfs/exports *two* times and =
then rewrite the whole file. Now imagine doing that recursively for =
23000 filesystems=E2=80=A6 My change to the zfs code simple opened a DB =
file and just did a =E2=80=9Cput=E2=80=9D of a record for the filesystem =
(and then sent mountd a SIGHUP).

(And even worse - when doing the boot-time =E2=80=9Czfs share -a=E2=80=9D =
- for each filesystem it would open(/etc/zfs/exports, read it line by =
line and check to make sure the filesystem isn=E2=80=99t already in the =
file, then open a tmp file, write out all the old filesystem + plus the =
new one, rename it to /etc/zfs/exports, send a SIGHUP) and the go on to =
the next one.. Repeat.  Pretty fast for 1-10 filesystems, not so fast =
for 20000+ ones=E2=80=A6 And tests the boot disk I/O a bit :-)
=20

I have seen that the (ZFS-on-Linux) OpenZFS code has changed a bit =
regarding this and I think for Linux they are going the route of =
directly updating the kernel instead of going via some external updater =
(like mountd). That probably would be an even better way (for ZFS) but a =
DB database might be useful anyway. It=E2=80=99s a very simple change =
(especially in mountd - it just opens the DB file and reads the records =
sequentially instead of the text file).

- Peter

> On 2 Jun 2020, at 06:30, Rick Macklem <rmacklem@uoguelph.ca> wrote:
>=20
> Rodney Grimes wrote:
>>> Hi,
>>>=20
>>> I'm posting this one to freebsd-net@ since it seems vaguely similar
>>> to a network congestion problem and thought that network types
>>> might have some ideas w.r.t. fixing it?
>>>=20
>>> PR#246597 - Reports a problem (which if I understand it is) where a =
sighup
>>>   is posted to mountd and then another sighup is posted to mountd =
while
>>>   it is reloading exports and the exports are not reloaded again.
>>>   --> The simple patch in the PR fixes the above problem, but I =
think will
>>>          aggravate another one.
>>> For some NFS servers, it can take minutes to reload the exports =
file(s).
>>> (I believe Peter Erriksonn has a server with 80000+ file systems =
exported.)
>>> r348590 reduced the time taken, but it is still minutes, if I recall =
correctly.
> Actually, my recollection w.r.t. the times was way off.
> I just looked at the old PR#237860 and, without r348590, it was =
16seconds
> (aka seconds, not minutes) and with r348590 that went down to a =
fraction
> of a second (there was no exact number in the PR, but I noted =
milliseconds in
> the commit log entry.
>=20
> I still think there is a risk of doing the reloads repeatedly.
>=20
>>> --> If you apply the patch in the PR and sighups are posted to =
mountd as
>>>       often as it takes to reload the exports file(s), it will =
simply reload the
>>>       exports file(s) over and over and over again, instead of =
processing
>>>       Mount RPC requests.
>>>=20
>>> So, finally to the interesting part...
>>> - It seems that the code needs to be changed so that it won't =
"forget"
>>>  sighup(s) posted to it, but it should not reload the exports =
file(s) too
>>>  frequently.
>>> --> My thoughts are something like:
>>>  - Note that sighup(s) were posted while reloading the exports =
file(s) and
>>>    do the reload again, after some minimum delay.
>>>    --> The minimum delay might only need to be 1second to allow some
>>>           RPCs to be processed before reload happens again.
>>>     Or
>>>    --> The minimum delay could be some fraction of how long a reload =
takes.
>>>          (The code could time the reload and use that to calculate =
how long to
>>>           delay before doing the reload again.)
>>>=20
>>> Any ideas or suggestions? rick
>>> ps: I've actually known about this for some time, but since I didn't =
have a good
>>>     solution...
>>=20
>> Build a system that allows adding and removing entries from the
>> in mountd exports data so that you do not have to do a full
>> reload every time one is added or removed?
>>=20
>> Build a system that used 2 exports tables, the active one, and the
>> one that was being loaded, so that you can process RPC's and reloads
>> at the same time.
> Well, r348590 modified mountd so that it built a new set of linked =
list
> structures from the modified exports file(s) and then compared them =
with
> the old ones, only doing updates to the kernel exports for changes.
>=20
> It still processes the entire exports file each time, to produce the =
in mountd
> memory linked lists (using hash tables and a binary tree).
>=20
> Peter did send me a patch to use a db frontend, but he felt the only
> performance improvements would be related to ZFS.
> Since ZFS is something I avoid like the plague I never pursued it.
> (If anyone willing to ZFS stuff wants to pursue this,
> just email and I can send you the patch.)
> Here's a snippet of what he said about it.
>> It looks like a very simple patch to create and even though it =
wouldn=E2=80=99t really        >  improve the speed for the work that =
mountd does it would make possible really > drastic speed improvements =
in the zfs commands. They (zfs commands) currently >  reads the thru =
text-based exports file multiple times when you do work with zfs  > =
filesystems (mounting/sharing/changing share options etc). Using a db =
based =20
>> exports file for the zfs exports (b-tree based probably) would allow =
the zfs code > to be much faster.
>=20
> At this point, I am just interested in fixing the problem in the PR, =
rick
>=20
> _______________________________________________
> freebsd-net@freebsd.org <mailto:freebsd-net@freebsd.org> mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net =
<https://lists.freebsd.org/mailman/listinfo/freebsd-net>;
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org =
<mailto:freebsd-net-unsubscribe@freebsd.org>"




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7E0A7D8E-72E6-4D32-B2A7-C4CE4127DDEF>