Date: Thu, 5 May 2022 00:56:05 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: Alan Somers <asomers@freebsd.org> Cc: FreeBSD Stable ML <stable@freebsd.org> Subject: Re: nfs client's OpenOwner count increases without bounds Message-ID: <YT2PR01MB9730E95FC8997CC2A3FE5AEBDDC29@YT2PR01MB9730.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <CAOtMX2hNp3%2B0Zs1jvpVAW07KLxStX0z-khZ4Y_-GaPnO%2BYkM5g@mail.gmail.com> References: <CAOtMX2jX8gC8xEr%2BfsQjZz8YmWX6haQxRe_-Jr5RSTdw14jkFQ@mail.gmail.com> <YT3PR01MB97376472A2BAF2FA0643F4F2DDC39@YT3PR01MB9737.CANPRD01.PROD.OUTLOOK.COM> <CAOtMX2hNp3%2B0Zs1jvpVAW07KLxStX0z-khZ4Y_-GaPnO%2BYkM5g@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Alan Somers <asomers@freebsd.org> wrote:=0A= > On Wed, May 4, 2022 at 5:23 PM Rick Macklem <rmacklem@uoguelph.ca> wrote:= =0A= > >=0A= > > Alan Somers <asomers@freebsd.org> wrote:=0A= > > > I have a FreeBSD 13 (tested on both 13.0-RELEASE and 13.1-RC5) deskto= p=0A= > > > mounting /usr/home over NFS 4.2 from an 13.0-RELEASE server. It=0A= > > > worked fine until a few weeks ago. Now, the desktop's performance=0A= > > > slowly degrades. It becomes less and less responsive until I restart= =0A= > > > X after 2-3 days. /var/log/Xorg.0.log shows plenty of entries like= =0A= > > > "AT keyboard: client bug: event processing lagging behind by 112ms,= =0A= > > > your system is too slow". "top -S" shows that the busiest process is= =0A= > > > nfscl. A dtrace profile shows that nfscl is spending most of its tim= e=0A= > > > in nfscl_cleanup_common, in the loop over all nfsclowner objects.=0A= > > > Running "nfsdumpstate" on the server shows thousands of OpenOwners fo= r=0A= > > > that client, and < 10 for any other NFS client. The OpenOwners=0A= > > > increases by about 3000 per day. And yet, "fstat" shows only a coupl= e=0A= > > > hundred open files on the NFS file system. Why are OpenOwners so=0A= > > > high? Killing most of my desktop processes doesn't seem to make a=0A= > > > difference. Restarting X does improve the perceived responsiveness,= =0A= > > > though it does not change the number of OpenOwners.=0A= > > >=0A= > > > How can I figure out which process(es) are responsible for the=0A= > > > excessive OpenOwners?=0A= > > An OpenOwner represents a process on the client. The OpenOwner=0A= > > name is an encoding of pid + process startup time.=0A= > > However, I can't think of an easy way to get at the OpenOwner name.=0A= > >=0A= > > Now, why aren't they going away, hmm..=0A= > >=0A= > > I'm assuming the # of Opens is not large?=0A= > > (Openowners cannot go away until all associated opens=0A= > > are closed.)=0A= > =0A= > Oh, I didn't mention that yes the number of Opens is large. Right=0A= > now, for example, I have 7950 OpenOwner and 8277 Open.=0A= Well, the openowners cannot go away until the opens go away,=0A= so the problem is that the opens are not getting closed.=0A= =0A= Close happens when the v_usecount on the vnode goes to zero.=0A= Something is retaining the v_usecount. One possibility is that most=0A= of the opens are for the same file, but with different openowners.=0A= If that is the case, the "oneopenown" mount option will deal with it.=0A= =0A= Another possibility is that something is retaining a v_usecount=0A= reference on a lot of the vnodes. (This used to happen when a nullfs=0A= mount with caching enabled was on top of the nfs mount.)=0A= I don't know what other things might do that?=0A= =0A= > >=0A= > > Commit 1cedb4ea1a79 in main changed the semantics of this=0A= > > a little, to avoid a use-after-free bug. However, it is dated=0A= > > Feb. 25, 2022 and is not in 13.0, so I don't think it could=0A= > > be the culprit.=0A= > >=0A= > > Essentially, the function called nfscl_cleanupkext() should call=0A= > > nfscl_procdoesntexist(), which returns true after the process has=0A= > > exited and when that is the case, calls nfscl_cleanup_common().=0A= > > --> nfscl_cleanup_common() will either get rid of the openowner or,=0A= > > if there are still children with open file descriptors, mark it "= defunct"=0A= > > so it can be free'd once the children close the file.=0A= > >=0A= > > It could be that X is now somehow creating a long chain of processes=0A= > > where the children inherit a file descriptor and that delays the cleanu= p=0A= > > indefinitely?=0A= > > Even then, everything should get cleaned up once you kill off X?=0A= > > (It might take a couple of seconds after killing all the processes off.= )=0A= > >=0A= > > Another possibility is that the "nfscl" thread is wedged somehow.=0A= > > It is the one that will call nfscl_cleanupkext() once/sec. If it never= =0A= > > gets called, the openowners will never go away.=0A= > >=0A= > > Being old fashioned, I'd probably try to figure this out by adding=0A= > > some printf()s to nfscl_cleanupkext() and nfscl_cleanup_common().=0A= > =0A= > dtrace shows that nfscl_cleanupkext() is getting called at about 0.6 hz.= =0A= That sounds ok. Since there are a lot of opens/openowners, it probably=0A= is getting behind.=0A= =0A= > >=0A= > > To avoid the problem, you can probably just use the "oneopenown"=0A= > > mount option. With that option, only one openowner is used for=0A= > > all opens. (Having separate openowners for each process was needed=0A= > > for NFSv4.0, but not NFSv4.1/4.2.)=0A= > >=0A= > > > Or is it just a red herring and I shouldn't=0A= > > > worry?=0A= > > Well, you can probably avoid the problem by using the "oneopenown"=0A= > > mount option.=0A= > =0A= > Ok, I'm trying that now. After unmounting and remounting NFS,=0A= > "nfsstat -cE" reports 1 OpenOwner and 11 Opens". But on the server,=0A= > "nfsdumpstate" still reports thousands. Will those go away=0A= > eventually?=0A= If the opens are gone then, yes, they will go away. They are retained for= =0A= a little while so that another Open against the openowner does not need=0A= to recreate the openowner (which also implied an extra RPC to confirm=0A= the openowner in NFSv4.0).=0A= =0A= I think they go away after a few minutes, if I recall correctly.=0A= If the server thinks there are still Opens, then they will not go away.=0A= =0A= rick=0A= =0A= >=0A= > Thanks for reporting this, rick=0A= > ps: And, yes, large numbers of openowners will slow things down,=0A= > since the code ends up doing linear scans of them all in a linked= =0A= > list in various places.=0A= >=0A= > -Alan=0A= >=0A=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YT2PR01MB9730E95FC8997CC2A3FE5AEBDDC29>