Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 5 May 2022 14:49:32 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Alan Somers <asomers@freebsd.org>
Cc:        FreeBSD Stable ML <stable@freebsd.org>
Subject:   Re: nfs client's OpenOwner count increases without bounds
Message-ID:  <YT2PR01MB9730CC1008ED2B5450AC02A9DDC29@YT2PR01MB9730.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <CAOtMX2gf-qxJkLCdfvXgLnNE_8jQU2-JwZxb-meDYVm0WKFH-A@mail.gmail.com>
References:  <CAOtMX2jX8gC8xEr%2BfsQjZz8YmWX6haQxRe_-Jr5RSTdw14jkFQ@mail.gmail.com> <YT3PR01MB97376472A2BAF2FA0643F4F2DDC39@YT3PR01MB9737.CANPRD01.PROD.OUTLOOK.COM> <CAOtMX2hNp3%2B0Zs1jvpVAW07KLxStX0z-khZ4Y_-GaPnO%2BYkM5g@mail.gmail.com> <YT2PR01MB9730E95FC8997CC2A3FE5AEBDDC29@YT2PR01MB9730.CANPRD01.PROD.OUTLOOK.COM> <CAOtMX2gf-qxJkLCdfvXgLnNE_8jQU2-JwZxb-meDYVm0WKFH-A@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Alan Somers <asomers@freebsd.org> wrote:=0A=
> On Wed, May 4, 2022 at 6:56 PM Rick Macklem <rmacklem@uoguelph.ca> wrote:=
=0A=
> >=0A=
> > Alan Somers <asomers@freebsd.org> wrote:=0A=
> > > On Wed, May 4, 2022 at 5:23 PM Rick Macklem <rmacklem@uoguelph.ca> wr=
ote:=0A=
> > > >=0A=
> > > > Alan Somers <asomers@freebsd.org> wrote:=0A=
> > > > > I have a FreeBSD 13 (tested on both 13.0-RELEASE and 13.1-RC5) de=
sktop=0A=
> > > > > mounting /usr/home over NFS 4.2 from an 13.0-RELEASE server.  It=
=0A=
> > > > > worked fine until a few weeks ago.  Now, the desktop's performanc=
e=0A=
> > > > > slowly degrades.  It becomes less and less responsive until I res=
tart=0A=
> > > > > X after 2-3 days.  /var/log/Xorg.0.log shows plenty of entries li=
ke=0A=
> > > > > "AT keyboard: client bug: event processing lagging behind by 112m=
s,=0A=
> > > > > your system is too slow".  "top -S" shows that the busiest proces=
s is=0A=
> > > > > nfscl.  A dtrace profile shows that nfscl is spending most of its=
 time=0A=
> > > > > in nfscl_cleanup_common, in the loop over all nfsclowner objects.=
=0A=
> > > > > Running "nfsdumpstate" on the server shows thousands of OpenOwner=
s for=0A=
> > > > > that client, and < 10 for any other NFS client.  The OpenOwners=
=0A=
> > > > > increases by about 3000 per day.  And yet, "fstat" shows only a c=
ouple=0A=
> > > > > hundred open files on the NFS file system.  Why are OpenOwners so=
=0A=
> > > > > high?  Killing most of my desktop processes doesn't seem to make =
a=0A=
> > > > > difference.  Restarting X does improve the perceived responsivene=
ss,=0A=
> > > > > though it does not change the number of OpenOwners.=0A=
> > > > >=0A=
> > > > > How can I figure out which process(es) are responsible for the=0A=
> > > > > excessive OpenOwners?=0A=
> > > > An OpenOwner represents a process on the client. The OpenOwner=0A=
> > > > name is an encoding of pid + process startup time.=0A=
> > > > However, I can't think of an easy way to get at the OpenOwner name.=
=0A=
> > > >=0A=
> > > > Now, why aren't they going away, hmm..=0A=
> > > >=0A=
> > > > I'm assuming the # of Opens is not large?=0A=
> > > > (Openowners cannot go away until all associated opens=0A=
> > > >  are closed.)=0A=
> > >=0A=
> > > Oh, I didn't mention that yes the number of Opens is large.  Right=0A=
> > > now, for example, I have 7950 OpenOwner and 8277 Open.=0A=
> > Well, the openowners cannot go away until the opens go away,=0A=
> > so the problem is that the opens are not getting closed.=0A=
> >=0A=
> > Close happens when the v_usecount on the vnode goes to zero.=0A=
> > Something is retaining the v_usecount. One possibility is that most=0A=
> > of the opens are for the same file, but with different openowners.=0A=
> > If that is the case, the "oneopenown" mount option will deal with it.=
=0A=
> >=0A=
> > Another possibility is that something is retaining a v_usecount=0A=
> > reference on a lot of the vnodes. (This used to happen when a nullfs=0A=
> > mount with caching enabled was on top of the nfs mount.)=0A=
> > I don't know what other things might do that?=0A=
>=0A=
> Yeah, I remember the nullfs problem.  But I'm not using nullfs on this=0A=
> computer anymore.  Is there any debugging facility that can list=0A=
> vnodes?  All I know of is "fstat", and that doesn't show anywhere near=0A=
> the number of NFS Opens.=0A=
Don't ask me. My debugging technology consists of printf()s.=0A=
=0A=
An NFSv4 Open is for a <clientid, openowner (represents a process on the=0A=
client), file>. It is probably opening the same file by many different=0A=
processes. The "oneopenown" option makes the client use the same=0A=
openowner for all opens, so that there is one open per file.=0A=
=0A=
> >=0A=
> > > >=0A=
> > > > Commit 1cedb4ea1a79 in main changed the semantics of this=0A=
> > > > a little, to avoid a use-after-free bug. However, it is dated=0A=
> > > > Feb. 25, 2022 and is not in 13.0, so I don't think it could=0A=
> > > > be the culprit.=0A=
> > > >=0A=
> > > > Essentially, the function called nfscl_cleanupkext() should call=0A=
> > > > nfscl_procdoesntexist(), which returns true after the process has=
=0A=
> > > > exited and when that is the case, calls nfscl_cleanup_common().=0A=
> > > > --> nfscl_cleanup_common() will either get rid of the openowner or,=
=0A=
> > > >       if there are still children with open file descriptors, mark =
it "defunct"=0A=
> > > >       so it can be free'd once the children close the file.=0A=
> > > >=0A=
> > > > It could be that X is now somehow creating a long chain of processe=
s=0A=
> > > > where the children inherit a file descriptor and that delays the cl=
eanup=0A=
> > > > indefinitely?=0A=
> > > > Even then, everything should get cleaned up once you kill off X?=0A=
> > > > (It might take a couple of seconds after killing all the processes =
off.)=0A=
> > > >=0A=
> > > > Another possibility is that the "nfscl" thread is wedged somehow.=
=0A=
> > > > It is the one that will call nfscl_cleanupkext() once/sec. If it ne=
ver=0A=
> > > > gets called, the openowners will never go away.=0A=
> > > >=0A=
> > > > Being old fashioned, I'd probably try to figure this out by adding=
=0A=
> > > > some printf()s to nfscl_cleanupkext() and nfscl_cleanup_common().=
=0A=
> > >=0A=
> > > dtrace shows that nfscl_cleanupkext() is getting called at about 0.6 =
hz.=0A=
> > That sounds ok. Since there are a lot of opens/openowners, it probably=
=0A=
> > is getting behind.=0A=
> >=0A=
> > > >=0A=
> > > > To avoid the problem, you can probably just use the "oneopenown"=0A=
> > > > mount option. With that option, only one openowner is used for=0A=
> > > > all opens. (Having separate openowners for each process was needed=
=0A=
> > > > for NFSv4.0, but not NFSv4.1/4.2.)=0A=
> > > >=0A=
> > > > > Or is it just a red herring and I shouldn't=0A=
> > > > > worry?=0A=
> > > > Well, you can probably avoid the problem by using the "oneopenown"=
=0A=
> > > > mount option.=0A=
> > >=0A=
> > > Ok, I'm trying that now.  After unmounting and remounting NFS,=0A=
> > > "nfsstat -cE" reports 1 OpenOwner and 11 Opens".  But on the server,=
=0A=
> > > "nfsdumpstate" still reports thousands.  Will those go away=0A=
> > > eventually?=0A=
> > If the opens are gone then, yes, they will go away. They are retained f=
or=0A=
> > a little while so that another Open against the openowner does not need=
=0A=
> > to recreate the openowner (which also implied an extra RPC to confirm=
=0A=
> > the openowner in NFSv4.0).=0A=
> >=0A=
> > I think they go away after a few minutes, if I recall correctly.=0A=
> > If the server thinks there are still Opens, then they will not go away.=
=0A=
> =0A=
> Uh, they aren't going away.  It's been a few hours now, and the NFS=0A=
> server still reports the same number of opens and openowners.=0A=
Yes, the openowners won't go away until the opens go away and the=0A=
opens don't go away until the client closes them. (Once the opens are=0A=
closed, the openowners go away after something like 5minutes.)=0A=
=0A=
For NFSv4.0, the unmount does a SetclientID/SetclientIDconfirm, which=0A=
gets rid of all opens at the server. However, NFSv4.1/4.2 does not have=0A=
this. It has a DestroyClient, but it is required to return NFSERR_CLIENTBUS=
Y=0A=
if there are outstanding opens (servers are not supposed to "forget" opens,=
=0A=
except when they crash. Even then, if they have something like non-volatile=
=0A=
ram, they can remember opens through a reboot. (FreeBSD does forget them=0A=
upon reboot.)=0A=
Maybe for 4.1/4.2 the client should try and close any outstanding opens.=0A=
(Normally, they should all be closed once all files are POSIX closed. I=0A=
 suspect that it didn't happen because the "nfscl" thread was killed off=0A=
 during unmount before it got around to doing all of them.)=0A=
I'll look at this.=0A=
=0A=
How to get rid of them now...=0A=
- I think a nfsrevoke(8) on the clientid will do so. However, if the same=
=0A=
   clientid is in use for your current mount, you'll need to unmount before=
=0A=
   doing so.=0A=
=0A=
Otherwise, I think they'll be there until a server reboot (or kldunload/kld=
load=0A=
of the nfsd, if it is not built into the kernel. Even a restart of the nfsd=
 daemon=0A=
does not get rid of them, since the "server should never forget opens" rule=
=0A=
is applied.=0A=
=0A=
rick=0A=
=0A=
>=0A=
> rick=0A=
>=0A=
> >=0A=
> > Thanks for reporting this, rick=0A=
> > ps: And, yes, large numbers of openowners will slow things down,=0A=
> >       since the code ends up doing linear scans of them all in a linked=
=0A=
> >       list in various places.=0A=
> >=0A=
> > -Alan=0A=
> >=0A=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YT2PR01MB9730CC1008ED2B5450AC02A9DDC29>