Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 3 Jun 2020 00:50:09 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Peter Eriksson <pen@lysator.liu.se>, "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Cc:        "Rodney W. Grimes" <freebsd-rwg@gndrsh.dnsmgr.net>, Mark Johnston <markj@FreeBSD.org>, "patrykkotlowski@gmail.com" <patrykkotlowski@gmail.com>
Subject:   Re: how to fix an interesting issue with mountd?
Message-ID:  <QB1PR01MB3649FD4CDDCF26148322BF01DD880@QB1PR01MB3649.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <7E0A7D8E-72E6-4D32-B2A7-C4CE4127DDEF@lysator.liu.se>
References:  <YTBPR01MB36647DA1465C7D35CDCE9681DD8B0@YTBPR01MB3664.CANPRD01.PROD.OUTLOOK.COM>, <7E0A7D8E-72E6-4D32-B2A7-C4CE4127DDEF@lysator.liu.se>

next in thread | previous in thread | raw e-mail | index | archive | help
Peter Eriksson wrote:=0A=
>I once reported that we had a server with many thousands (typically 23000 =
or so >per server) ZFS filesystems (and 300+ snapshots per filesystem) wher=
e mountd >was 100% busy reading and updating the kernel (and while doing th=
at holding the >NFS lock for a very long time) every hour (when we took sna=
pshots of all the >filesystems - the code in the zfs commands send a lot of=
 SIGHUPs to mountd it >seems)=85.=0A=
>=0A=
>(Causing NFS users to complain quite a bit)=0A=
>=0A=
>I have also seen the effect that when there are a lot of updates to filesy=
stems that >some exports can get =93missed=94 if mountd is bombarded with m=
ultiple SIGHUPS - >but with the new incremental update code in mountd this =
window (for SIGHUPs to >get lost) is much smaller (and I now also have a Na=
gios check that verifies that all >exports in /etc/zfs/exports also is visi=
ble in the kernel).=0A=
I just put a patch up in PR#246597, which you might want to try.=0A=
=0A=
>But while we had this problem it I also investigated going to a DB based e=
xports >=93file=94 in order to make the code in the =93zfs=94 commands that=
 read and update >/etc/zfs/exports a lot faster too. As Rick says there is =
room for _huge_ >improvements there.=0A=
>=0A=
>For every change of =93sharenfs=94 per filesystem it would open and read a=
nd parse, >line-by-line, /etc/zfs/exports *two* times and then rewrite the =
whole file. Now >imagine doing that recursively for 23000 filesystems=85 My=
 change to the zfs code >simple opened a DB file and just did a =93put=94 o=
f a record for the filesystem (and >then sent mountd a SIGHUP).=0A=
Just to clarify, if someone else can put Peter's patch in ZFS, I am willing=
 to=0A=
put the required changes in mountd.=0A=
=0A=
>=0A=
>(And even worse - when doing the boot-time =93zfs share -a=94 - for each f=
ilesystem it >would open(/etc/zfs/exports, read it line by line and check t=
o make sure the >filesystem isn=92t already in the file, then open a tmp fi=
le, write out all the old >filesystem + plus the new one, rename it to /etc=
/zfs/exports, send a SIGHUP) and >the go on to the next one.. Repeat.  Pret=
ty fast for 1-10 filesystems, not so fast for >20000+ ones=85 And tests the=
 boot disk I/O a bit :-)=0A=
>=0A=
>=0A=
>I have seen that the (ZFS-on-Linux) OpenZFS code has changed a bit regardi=
ng >this and I think for Linux they are going the route of directly updatin=
g the kernel >instead of going via some external updater (like mountd).=0A=
The problem here is NFSv3, where something (currently mountd) needs to know=
=0A=
about this stuff, so it can do the Mount protocol (used for NFSv3 mounting =
and=0A=
done with Mount RPCs, not NFS ones).=0A=
=0A=
>That probably would be an even better way (for ZFS) but a DB database migh=
t be >useful anyway. It=92s a very simple change (especially in mountd - it=
 just opens the >DB file and reads the records sequentially instead of the =
text file).=0A=
I think what you have, which puts the info in a db file and then SIGHUPs mo=
untd=0A=
is a good start.=0A=
Again, if someone else can get this into ZFS, I can put the bits in mountd.=
=0A=
=0A=
Thanks for posting this, rick=0A=
ps: Do you happen to know how long a reload of exports in mountd is current=
ly=0A=
      taking, with the patches done to it last year?=0A=
=0A=
- Peter=0A=
=0A=
On 2 Jun 2020, at 06:30, Rick Macklem <rmacklem@uoguelph.ca<mailto:rmacklem=
@uoguelph.ca>> wrote:=0A=
=0A=
Rodney Grimes wrote:=0A=
Hi,=0A=
=0A=
I'm posting this one to freebsd-net@ since it seems vaguely similar=0A=
to a network congestion problem and thought that network types=0A=
might have some ideas w.r.t. fixing it?=0A=
=0A=
PR#246597 - Reports a problem (which if I understand it is) where a sighup=
=0A=
  is posted to mountd and then another sighup is posted to mountd while=0A=
  it is reloading exports and the exports are not reloaded again.=0A=
  --> The simple patch in the PR fixes the above problem, but I think will=
=0A=
         aggravate another one.=0A=
For some NFS servers, it can take minutes to reload the exports file(s).=0A=
(I believe Peter Erriksonn has a server with 80000+ file systems exported.)=
=0A=
r348590 reduced the time taken, but it is still minutes, if I recall correc=
tly.=0A=
Actually, my recollection w.r.t. the times was way off.=0A=
I just looked at the old PR#237860 and, without r348590, it was 16seconds=
=0A=
(aka seconds, not minutes) and with r348590 that went down to a fraction=0A=
of a second (there was no exact number in the PR, but I noted milliseconds =
in=0A=
the commit log entry.=0A=
=0A=
I still think there is a risk of doing the reloads repeatedly.=0A=
=0A=
--> If you apply the patch in the PR and sighups are posted to mountd as=0A=
      often as it takes to reload the exports file(s), it will simply reloa=
d the=0A=
      exports file(s) over and over and over again, instead of processing=
=0A=
      Mount RPC requests.=0A=
=0A=
So, finally to the interesting part...=0A=
- It seems that the code needs to be changed so that it won't "forget"=0A=
 sighup(s) posted to it, but it should not reload the exports file(s) too=
=0A=
 frequently.=0A=
--> My thoughts are something like:=0A=
 - Note that sighup(s) were posted while reloading the exports file(s) and=
=0A=
   do the reload again, after some minimum delay.=0A=
   --> The minimum delay might only need to be 1second to allow some=0A=
          RPCs to be processed before reload happens again.=0A=
    Or=0A=
   --> The minimum delay could be some fraction of how long a reload takes.=
=0A=
         (The code could time the reload and use that to calculate how long=
 to=0A=
          delay before doing the reload again.)=0A=
=0A=
Any ideas or suggestions? rick=0A=
ps: I've actually known about this for some time, but since I didn't have a=
 good=0A=
    solution...=0A=
=0A=
Build a system that allows adding and removing entries from the=0A=
in mountd exports data so that you do not have to do a full=0A=
reload every time one is added or removed?=0A=
=0A=
Build a system that used 2 exports tables, the active one, and the=0A=
one that was being loaded, so that you can process RPC's and reloads=0A=
at the same time.=0A=
Well, r348590 modified mountd so that it built a new set of linked list=0A=
structures from the modified exports file(s) and then compared them with=0A=
the old ones, only doing updates to the kernel exports for changes.=0A=
=0A=
It still processes the entire exports file each time, to produce the in mou=
ntd=0A=
memory linked lists (using hash tables and a binary tree).=0A=
=0A=
Peter did send me a patch to use a db frontend, but he felt the only=0A=
performance improvements would be related to ZFS.=0A=
Since ZFS is something I avoid like the plague I never pursued it.=0A=
(If anyone willing to ZFS stuff wants to pursue this,=0A=
just email and I can send you the patch.)=0A=
Here's a snippet of what he said about it.=0A=
It looks like a very simple patch to create and even though it wouldn=92t r=
eally        >  improve the speed for the work that mountd does it would ma=
ke possible really > drastic speed improvements in the zfs commands. They (=
zfs commands) currently >  reads the thru text-based exports file multiple =
times when you do work with zfs  > filesystems (mounting/sharing/changing s=
hare options etc). Using a db based=0A=
exports file for the zfs exports (b-tree based probably) would allow the zf=
s code > to be much faster.=0A=
=0A=
At this point, I am just interested in fixing the problem in the PR, rick=
=0A=
=0A=
_______________________________________________=0A=
freebsd-net@freebsd.org<mailto:freebsd-net@freebsd.org> mailing list=0A=
https://lists.freebsd.org/mailman/listinfo/freebsd-net=0A=
To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org<mailt=
o:freebsd-net-unsubscribe@freebsd.org>"=0A=
=0A=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?QB1PR01MB3649FD4CDDCF26148322BF01DD880>