From nobody Thu May  5 02:51:23 2022
X-Original-To: stable@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 709431AC172E
	for <stable@mlmmj.nyi.freebsd.org>; Thu,  5 May 2022 02:51:41 +0000 (UTC)
	(envelope-from asomers@gmail.com)
Received: from mail-ot1-f51.google.com (mail-ot1-f51.google.com [209.85.210.51])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4Ktyrr4xxCz3vtY
	for <stable@freebsd.org>; Thu,  5 May 2022 02:51:40 +0000 (UTC)
	(envelope-from asomers@gmail.com)
Received: by mail-ot1-f51.google.com with SMTP id s18-20020a056830149200b006063fef3e17so2129348otq.12
        for <stable@freebsd.org>; Wed, 04 May 2022 19:51:40 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=8tdX3l05+HecR+TIzKVHzoWrSPmL/i6kTW/2k2Fcodg=;
        b=mN/POMUJekPkjLA5dAeG5OFUOzMzJ5qmv+MAectfE8awkFaxyCFeWX7fAE7uhaGbvZ
         Hc1M2/pLCNMOs/9qkk4jVBo7pxsz6fFnTlSbLTh6O9hGZC1xkjkbKCS8gJHGMENWLtpo
         NrDboJ2Q4amtBiOdcnmx8h6LlWbf7QBKhNo9N3l5RHd4fEM9f2bBJtwNwDo4wQs+j8GY
         b/kk9+Odnbyo7PvxI2ZksmVl32jboU8ZnjnED1xfje6MDbeuLlCh4w5uWoyZNftY6jH9
         7rqykTSox+Llx9wa1m2U44mQxcst6BThmF6wuUgFu80u3/eHZYTFkg++nm6sIJU5Rj7P
         LLiw==
X-Gm-Message-State: AOAM533+7UO6R3xvCb/7eswUXlgsiPnx8NXIuds53uiY0LME+FWNTYIu
	l/z/f9IJU4Qxv7coEPufsSEqDhDJqH7D8xP5+U0=
X-Google-Smtp-Source: ABdhPJxVS3ZgGNgay4pc3w2JysQmoHacB+IuMvNcKZRaIZKvSrJiYfwz7dM/a8SXgbc02qSyzbTmQhU9ETSH4IRBmi8=
X-Received: by 2002:a05:6830:1310:b0:606:4832:ea77 with SMTP id
 p16-20020a056830131000b006064832ea77mr1840727otq.114.1651719094043; Wed, 04
 May 2022 19:51:34 -0700 (PDT)
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-stable
List-Help: <mailto:stable+help@freebsd.org>
List-Post: <mailto:stable@freebsd.org>
List-Subscribe: <mailto:stable+subscribe@freebsd.org>
List-Unsubscribe: <mailto:stable+unsubscribe@freebsd.org>
Sender: owner-freebsd-stable@freebsd.org
X-BeenThere: freebsd-stable@freebsd.org
MIME-Version: 1.0
References: <CAOtMX2jX8gC8xEr+fsQjZz8YmWX6haQxRe_-Jr5RSTdw14jkFQ@mail.gmail.com>
 <YT3PR01MB97376472A2BAF2FA0643F4F2DDC39@YT3PR01MB9737.CANPRD01.PROD.OUTLOOK.COM>
 <CAOtMX2hNp3+0Zs1jvpVAW07KLxStX0z-khZ4Y_-GaPnO+YkM5g@mail.gmail.com> <YT2PR01MB9730E95FC8997CC2A3FE5AEBDDC29@YT2PR01MB9730.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <YT2PR01MB9730E95FC8997CC2A3FE5AEBDDC29@YT2PR01MB9730.CANPRD01.PROD.OUTLOOK.COM>
From: Alan Somers <asomers@freebsd.org>
Date: Wed, 4 May 2022 20:51:23 -0600
Message-ID: <CAOtMX2gf-qxJkLCdfvXgLnNE_8jQU2-JwZxb-meDYVm0WKFH-A@mail.gmail.com>
Subject: Re: nfs client's OpenOwner count increases without bounds
To: Rick Macklem <rmacklem@uoguelph.ca>
Cc: FreeBSD Stable ML <stable@freebsd.org>
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Queue-Id: 4Ktyrr4xxCz3vtY
X-Spamd-Bar: ---
Authentication-Results: mx1.freebsd.org;
	dkim=none;
	dmarc=none;
	spf=pass (mx1.freebsd.org: domain of asomers@gmail.com designates 209.85.210.51 as permitted sender) smtp.mailfrom=asomers@gmail.com
X-Spamd-Result: default: False [-3.00 / 15.00];
	 RCVD_TLS_ALL(0.00)[];
	 ARC_NA(0.00)[];
	 FREEFALL_USER(0.00)[asomers];
	 FROM_HAS_DN(0.00)[];
	 RWL_MAILSPIKE_GOOD(0.00)[209.85.210.51:from];
	 R_SPF_ALLOW(-0.20)[+ip4:209.85.128.0/17];
	 NEURAL_HAM_LONG(-1.00)[-1.000];
	 MIME_GOOD(-0.10)[text/plain];
	 PREVIOUSLY_DELIVERED(0.00)[stable@freebsd.org];
	 DMARC_NA(0.00)[freebsd.org];
	 TO_MATCH_ENVRCPT_SOME(0.00)[];
	 TO_DN_ALL(0.00)[];
	 NEURAL_HAM_SHORT(-1.00)[-1.000];
	 RCPT_COUNT_TWO(0.00)[2];
	 RCVD_IN_DNSWL_NONE(0.00)[209.85.210.51:from];
	 NEURAL_HAM_MEDIUM(-1.00)[-1.000];
	 MLMMJ_DEST(0.00)[stable];
	 FORGED_SENDER(0.30)[asomers@freebsd.org,asomers@gmail.com];
	 R_DKIM_NA(0.00)[];
	 MIME_TRACE(0.00)[0:+];
	 ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US];
	 FROM_NEQ_ENVFROM(0.00)[asomers@freebsd.org,asomers@gmail.com];
	 FREEMAIL_ENVFROM(0.00)[gmail.com];
	 RCVD_COUNT_TWO(0.00)[2]
X-ThisMailContainsUnwantedMimeParts: N

On Wed, May 4, 2022 at 6:56 PM Rick Macklem <rmacklem@uoguelph.ca> wrote:
>
> Alan Somers <asomers@freebsd.org> wrote:
> > On Wed, May 4, 2022 at 5:23 PM Rick Macklem <rmacklem@uoguelph.ca> wrote:
> > >
> > > Alan Somers <asomers@freebsd.org> wrote:
> > > > I have a FreeBSD 13 (tested on both 13.0-RELEASE and 13.1-RC5) desktop
> > > > mounting /usr/home over NFS 4.2 from an 13.0-RELEASE server.  It
> > > > worked fine until a few weeks ago.  Now, the desktop's performance
> > > > slowly degrades.  It becomes less and less responsive until I restart
> > > > X after 2-3 days.  /var/log/Xorg.0.log shows plenty of entries like
> > > > "AT keyboard: client bug: event processing lagging behind by 112ms,
> > > > your system is too slow".  "top -S" shows that the busiest process is
> > > > nfscl.  A dtrace profile shows that nfscl is spending most of its time
> > > > in nfscl_cleanup_common, in the loop over all nfsclowner objects.
> > > > Running "nfsdumpstate" on the server shows thousands of OpenOwners for
> > > > that client, and < 10 for any other NFS client.  The OpenOwners
> > > > increases by about 3000 per day.  And yet, "fstat" shows only a couple
> > > > hundred open files on the NFS file system.  Why are OpenOwners so
> > > > high?  Killing most of my desktop processes doesn't seem to make a
> > > > difference.  Restarting X does improve the perceived responsiveness,
> > > > though it does not change the number of OpenOwners.
> > > >
> > > > How can I figure out which process(es) are responsible for the
> > > > excessive OpenOwners?
> > > An OpenOwner represents a process on the client. The OpenOwner
> > > name is an encoding of pid + process startup time.
> > > However, I can't think of an easy way to get at the OpenOwner name.
> > >
> > > Now, why aren't they going away, hmm..
> > >
> > > I'm assuming the # of Opens is not large?
> > > (Openowners cannot go away until all associated opens
> > >  are closed.)
> >
> > Oh, I didn't mention that yes the number of Opens is large.  Right
> > now, for example, I have 7950 OpenOwner and 8277 Open.
> Well, the openowners cannot go away until the opens go away,
> so the problem is that the opens are not getting closed.
>
> Close happens when the v_usecount on the vnode goes to zero.
> Something is retaining the v_usecount. One possibility is that most
> of the opens are for the same file, but with different openowners.
> If that is the case, the "oneopenown" mount option will deal with it.
>
> Another possibility is that something is retaining a v_usecount
> reference on a lot of the vnodes. (This used to happen when a nullfs
> mount with caching enabled was on top of the nfs mount.)
> I don't know what other things might do that?

Yeah, I remember the nullfs problem.  But I'm not using nullfs on this
computer anymore.  Is there any debugging facility that can list
vnodes?  All I know of is "fstat", and that doesn't show anywhere near
the number of NFS Opens.

>
> > >
> > > Commit 1cedb4ea1a79 in main changed the semantics of this
> > > a little, to avoid a use-after-free bug. However, it is dated
> > > Feb. 25, 2022 and is not in 13.0, so I don't think it could
> > > be the culprit.
> > >
> > > Essentially, the function called nfscl_cleanupkext() should call
> > > nfscl_procdoesntexist(), which returns true after the process has
> > > exited and when that is the case, calls nfscl_cleanup_common().
> > > --> nfscl_cleanup_common() will either get rid of the openowner or,
> > >       if there are still children with open file descriptors, mark it "defunct"
> > >       so it can be free'd once the children close the file.
> > >
> > > It could be that X is now somehow creating a long chain of processes
> > > where the children inherit a file descriptor and that delays the cleanup
> > > indefinitely?
> > > Even then, everything should get cleaned up once you kill off X?
> > > (It might take a couple of seconds after killing all the processes off.)
> > >
> > > Another possibility is that the "nfscl" thread is wedged somehow.
> > > It is the one that will call nfscl_cleanupkext() once/sec. If it never
> > > gets called, the openowners will never go away.
> > >
> > > Being old fashioned, I'd probably try to figure this out by adding
> > > some printf()s to nfscl_cleanupkext() and nfscl_cleanup_common().
> >
> > dtrace shows that nfscl_cleanupkext() is getting called at about 0.6 hz.
> That sounds ok. Since there are a lot of opens/openowners, it probably
> is getting behind.
>
> > >
> > > To avoid the problem, you can probably just use the "oneopenown"
> > > mount option. With that option, only one openowner is used for
> > > all opens. (Having separate openowners for each process was needed
> > > for NFSv4.0, but not NFSv4.1/4.2.)
> > >
> > > > Or is it just a red herring and I shouldn't
> > > > worry?
> > > Well, you can probably avoid the problem by using the "oneopenown"
> > > mount option.
> >
> > Ok, I'm trying that now.  After unmounting and remounting NFS,
> > "nfsstat -cE" reports 1 OpenOwner and 11 Opens".  But on the server,
> > "nfsdumpstate" still reports thousands.  Will those go away
> > eventually?
> If the opens are gone then, yes, they will go away. They are retained for
> a little while so that another Open against the openowner does not need
> to recreate the openowner (which also implied an extra RPC to confirm
> the openowner in NFSv4.0).
>
> I think they go away after a few minutes, if I recall correctly.
> If the server thinks there are still Opens, then they will not go away.

Uh, they aren't going away.  It's been a few hours now, and the NFS
server still reports the same number of opens and openowners.

>
> rick
>
> >
> > Thanks for reporting this, rick
> > ps: And, yes, large numbers of openowners will slow things down,
> >       since the code ends up doing linear scans of them all in a linked
> >       list in various places.
> >
> > -Alan
> >