Date: Wed, 3 Jun 2020 00:50:09 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: Peter Eriksson <pen@lysator.liu.se>, "freebsd-net@freebsd.org" <freebsd-net@freebsd.org> Cc: "Rodney W. Grimes" <freebsd-rwg@gndrsh.dnsmgr.net>, Mark Johnston <markj@FreeBSD.org>, "patrykkotlowski@gmail.com" <patrykkotlowski@gmail.com> Subject: Re: how to fix an interesting issue with mountd? Message-ID: <QB1PR01MB3649FD4CDDCF26148322BF01DD880@QB1PR01MB3649.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <7E0A7D8E-72E6-4D32-B2A7-C4CE4127DDEF@lysator.liu.se> References: <YTBPR01MB36647DA1465C7D35CDCE9681DD8B0@YTBPR01MB3664.CANPRD01.PROD.OUTLOOK.COM>, <7E0A7D8E-72E6-4D32-B2A7-C4CE4127DDEF@lysator.liu.se>
next in thread | previous in thread | raw e-mail | index | archive | help
Peter Eriksson wrote:=0A= >I once reported that we had a server with many thousands (typically 23000 = or so >per server) ZFS filesystems (and 300+ snapshots per filesystem) wher= e mountd >was 100% busy reading and updating the kernel (and while doing th= at holding the >NFS lock for a very long time) every hour (when we took sna= pshots of all the >filesystems - the code in the zfs commands send a lot of= SIGHUPs to mountd it >seems)=85.=0A= >=0A= >(Causing NFS users to complain quite a bit)=0A= >=0A= >I have also seen the effect that when there are a lot of updates to filesy= stems that >some exports can get =93missed=94 if mountd is bombarded with m= ultiple SIGHUPS - >but with the new incremental update code in mountd this = window (for SIGHUPs to >get lost) is much smaller (and I now also have a Na= gios check that verifies that all >exports in /etc/zfs/exports also is visi= ble in the kernel).=0A= I just put a patch up in PR#246597, which you might want to try.=0A= =0A= >But while we had this problem it I also investigated going to a DB based e= xports >=93file=94 in order to make the code in the =93zfs=94 commands that= read and update >/etc/zfs/exports a lot faster too. As Rick says there is = room for _huge_ >improvements there.=0A= >=0A= >For every change of =93sharenfs=94 per filesystem it would open and read a= nd parse, >line-by-line, /etc/zfs/exports *two* times and then rewrite the = whole file. Now >imagine doing that recursively for 23000 filesystems=85 My= change to the zfs code >simple opened a DB file and just did a =93put=94 o= f a record for the filesystem (and >then sent mountd a SIGHUP).=0A= Just to clarify, if someone else can put Peter's patch in ZFS, I am willing= to=0A= put the required changes in mountd.=0A= =0A= >=0A= >(And even worse - when doing the boot-time =93zfs share -a=94 - for each f= ilesystem it >would open(/etc/zfs/exports, read it line by line and check t= o make sure the >filesystem isn=92t already in the file, then open a tmp fi= le, write out all the old >filesystem + plus the new one, rename it to /etc= /zfs/exports, send a SIGHUP) and >the go on to the next one.. Repeat. Pret= ty fast for 1-10 filesystems, not so fast for >20000+ ones=85 And tests the= boot disk I/O a bit :-)=0A= >=0A= >=0A= >I have seen that the (ZFS-on-Linux) OpenZFS code has changed a bit regardi= ng >this and I think for Linux they are going the route of directly updatin= g the kernel >instead of going via some external updater (like mountd).=0A= The problem here is NFSv3, where something (currently mountd) needs to know= =0A= about this stuff, so it can do the Mount protocol (used for NFSv3 mounting = and=0A= done with Mount RPCs, not NFS ones).=0A= =0A= >That probably would be an even better way (for ZFS) but a DB database migh= t be >useful anyway. It=92s a very simple change (especially in mountd - it= just opens the >DB file and reads the records sequentially instead of the = text file).=0A= I think what you have, which puts the info in a db file and then SIGHUPs mo= untd=0A= is a good start.=0A= Again, if someone else can get this into ZFS, I can put the bits in mountd.= =0A= =0A= Thanks for posting this, rick=0A= ps: Do you happen to know how long a reload of exports in mountd is current= ly=0A= taking, with the patches done to it last year?=0A= =0A= - Peter=0A= =0A= On 2 Jun 2020, at 06:30, Rick Macklem <rmacklem@uoguelph.ca<mailto:rmacklem= @uoguelph.ca>> wrote:=0A= =0A= Rodney Grimes wrote:=0A= Hi,=0A= =0A= I'm posting this one to freebsd-net@ since it seems vaguely similar=0A= to a network congestion problem and thought that network types=0A= might have some ideas w.r.t. fixing it?=0A= =0A= PR#246597 - Reports a problem (which if I understand it is) where a sighup= =0A= is posted to mountd and then another sighup is posted to mountd while=0A= it is reloading exports and the exports are not reloaded again.=0A= --> The simple patch in the PR fixes the above problem, but I think will= =0A= aggravate another one.=0A= For some NFS servers, it can take minutes to reload the exports file(s).=0A= (I believe Peter Erriksonn has a server with 80000+ file systems exported.)= =0A= r348590 reduced the time taken, but it is still minutes, if I recall correc= tly.=0A= Actually, my recollection w.r.t. the times was way off.=0A= I just looked at the old PR#237860 and, without r348590, it was 16seconds= =0A= (aka seconds, not minutes) and with r348590 that went down to a fraction=0A= of a second (there was no exact number in the PR, but I noted milliseconds = in=0A= the commit log entry.=0A= =0A= I still think there is a risk of doing the reloads repeatedly.=0A= =0A= --> If you apply the patch in the PR and sighups are posted to mountd as=0A= often as it takes to reload the exports file(s), it will simply reloa= d the=0A= exports file(s) over and over and over again, instead of processing= =0A= Mount RPC requests.=0A= =0A= So, finally to the interesting part...=0A= - It seems that the code needs to be changed so that it won't "forget"=0A= sighup(s) posted to it, but it should not reload the exports file(s) too= =0A= frequently.=0A= --> My thoughts are something like:=0A= - Note that sighup(s) were posted while reloading the exports file(s) and= =0A= do the reload again, after some minimum delay.=0A= --> The minimum delay might only need to be 1second to allow some=0A= RPCs to be processed before reload happens again.=0A= Or=0A= --> The minimum delay could be some fraction of how long a reload takes.= =0A= (The code could time the reload and use that to calculate how long= to=0A= delay before doing the reload again.)=0A= =0A= Any ideas or suggestions? rick=0A= ps: I've actually known about this for some time, but since I didn't have a= good=0A= solution...=0A= =0A= Build a system that allows adding and removing entries from the=0A= in mountd exports data so that you do not have to do a full=0A= reload every time one is added or removed?=0A= =0A= Build a system that used 2 exports tables, the active one, and the=0A= one that was being loaded, so that you can process RPC's and reloads=0A= at the same time.=0A= Well, r348590 modified mountd so that it built a new set of linked list=0A= structures from the modified exports file(s) and then compared them with=0A= the old ones, only doing updates to the kernel exports for changes.=0A= =0A= It still processes the entire exports file each time, to produce the in mou= ntd=0A= memory linked lists (using hash tables and a binary tree).=0A= =0A= Peter did send me a patch to use a db frontend, but he felt the only=0A= performance improvements would be related to ZFS.=0A= Since ZFS is something I avoid like the plague I never pursued it.=0A= (If anyone willing to ZFS stuff wants to pursue this,=0A= just email and I can send you the patch.)=0A= Here's a snippet of what he said about it.=0A= It looks like a very simple patch to create and even though it wouldn=92t r= eally > improve the speed for the work that mountd does it would ma= ke possible really > drastic speed improvements in the zfs commands. They (= zfs commands) currently > reads the thru text-based exports file multiple = times when you do work with zfs > filesystems (mounting/sharing/changing s= hare options etc). Using a db based=0A= exports file for the zfs exports (b-tree based probably) would allow the zf= s code > to be much faster.=0A= =0A= At this point, I am just interested in fixing the problem in the PR, rick= =0A= =0A= _______________________________________________=0A= freebsd-net@freebsd.org<mailto:freebsd-net@freebsd.org> mailing list=0A= https://lists.freebsd.org/mailman/listinfo/freebsd-net=0A= To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org<mailt= o:freebsd-net-unsubscribe@freebsd.org>"=0A= =0A=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?QB1PR01MB3649FD4CDDCF26148322BF01DD880>