From nobody Thu May 5 15:23:00 2022 X-Original-To: stable@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id E7D921AB5D3C for ; Thu, 5 May 2022 15:23:19 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-oi1-f182.google.com (mail-oi1-f182.google.com [209.85.167.182]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4KvHX66plSz3j7l for ; Thu, 5 May 2022 15:23:18 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-oi1-f182.google.com with SMTP id w194so3450383oie.9 for ; Thu, 05 May 2022 08:23:18 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=EDvT+L2aoYH6zIG8kPSK0LmHgzzq1rHRka3mIM6lE0w=; b=xqag2r/nwV1JZOb3yIae0YZXSn07tZ2MERnKnA+r62/c7qBeNIfhGRkZ3viBcofq4I WUDAHwvVvFEH1ME+O54Q4lxgefeSGSVtbRv9Bh0/KJMA8jhbGiu0lDvTaPUq/kWg2kAe HHhNoKRa/F/xDeyfxiPcyJnsDjYEjTdYgXnkgIhln5YFik4psQfgAm6nAM2IFKMItkZp Mgq9Pn1WwErbY6bgSeMW+wAJBYHUzuczSEXjAOymhPIw3PcIhR68ZQInHjzbpMGPQQ01 Yugb5yEOWlhnNtZSD3PeJsx2T5mwBDyLdUm3q+LegnM7T7zOlAMdcMLvMQiPqByRJQVJ M+KA== X-Gm-Message-State: AOAM531uwaH6iGIpdl4mHdNVAp5ZDK/jz0NYNu6SQNF6xvDV50OEiBmd HbbgO9VJny/x9Znc3jLLow9lAjN/xwAwM9ja7o8eG1Hm X-Google-Smtp-Source: ABdhPJxZk3aQ7Ngy/lsr8kO5bJHelNuEXL9y6Of+h2pj5uxtfmlmCYwCq4oTU8wZBKXaWGPne3ZkpvoBbLnVLIiCGHE= X-Received: by 2002:a05:6808:1302:b0:325:f20f:8943 with SMTP id y2-20020a056808130200b00325f20f8943mr2721991oiv.222.1651764191688; Thu, 05 May 2022 08:23:11 -0700 (PDT) List-Id: Production branch of FreeBSD source code List-Archive: https://lists.freebsd.org/archives/freebsd-stable List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-stable@freebsd.org X-BeenThere: freebsd-stable@freebsd.org MIME-Version: 1.0 References: In-Reply-To: From: Alan Somers Date: Thu, 5 May 2022 09:23:00 -0600 Message-ID: Subject: Re: nfs client's OpenOwner count increases without bounds To: Rick Macklem Cc: FreeBSD Stable ML Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 4KvHX66plSz3j7l X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of asomers@gmail.com designates 209.85.167.182 as permitted sender) smtp.mailfrom=asomers@gmail.com X-Spamd-Result: default: False [-1.06 / 15.00]; RCVD_TLS_ALL(0.00)[]; ARC_NA(0.00)[]; FREEFALL_USER(0.00)[asomers]; FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:209.85.128.0/17]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[stable@freebsd.org]; DMARC_NA(0.00)[freebsd.org]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-0.999]; RCPT_COUNT_TWO(0.00)[2]; RCVD_IN_DNSWL_NONE(0.00)[209.85.167.182:from]; NEURAL_SPAM_LONG(0.94)[0.936]; MLMMJ_DEST(0.00)[stable]; FORGED_SENDER(0.30)[asomers@freebsd.org,asomers@gmail.com]; RWL_MAILSPIKE_POSSIBLE(0.00)[209.85.167.182:from]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US]; FROM_NEQ_ENVFROM(0.00)[asomers@freebsd.org,asomers@gmail.com]; FREEMAIL_ENVFROM(0.00)[gmail.com]; RCVD_COUNT_TWO(0.00)[2] X-ThisMailContainsUnwantedMimeParts: N On Thu, May 5, 2022 at 8:49 AM Rick Macklem wrote: > > Alan Somers wrote: > > On Wed, May 4, 2022 at 6:56 PM Rick Macklem wrote: > > > > > > Alan Somers wrote: > > > > On Wed, May 4, 2022 at 5:23 PM Rick Macklem wrote: > > > > > > > > > > Alan Somers wrote: > > > > > > I have a FreeBSD 13 (tested on both 13.0-RELEASE and 13.1-RC5) desktop > > > > > > mounting /usr/home over NFS 4.2 from an 13.0-RELEASE server. It > > > > > > worked fine until a few weeks ago. Now, the desktop's performance > > > > > > slowly degrades. It becomes less and less responsive until I restart > > > > > > X after 2-3 days. /var/log/Xorg.0.log shows plenty of entries like > > > > > > "AT keyboard: client bug: event processing lagging behind by 112ms, > > > > > > your system is too slow". "top -S" shows that the busiest process is > > > > > > nfscl. A dtrace profile shows that nfscl is spending most of its time > > > > > > in nfscl_cleanup_common, in the loop over all nfsclowner objects. > > > > > > Running "nfsdumpstate" on the server shows thousands of OpenOwners for > > > > > > that client, and < 10 for any other NFS client. The OpenOwners > > > > > > increases by about 3000 per day. And yet, "fstat" shows only a couple > > > > > > hundred open files on the NFS file system. Why are OpenOwners so > > > > > > high? Killing most of my desktop processes doesn't seem to make a > > > > > > difference. Restarting X does improve the perceived responsiveness, > > > > > > though it does not change the number of OpenOwners. > > > > > > > > > > > > How can I figure out which process(es) are responsible for the > > > > > > excessive OpenOwners? > > > > > An OpenOwner represents a process on the client. The OpenOwner > > > > > name is an encoding of pid + process startup time. > > > > > However, I can't think of an easy way to get at the OpenOwner name. > > > > > > > > > > Now, why aren't they going away, hmm.. > > > > > > > > > > I'm assuming the # of Opens is not large? > > > > > (Openowners cannot go away until all associated opens > > > > > are closed.) > > > > > > > > Oh, I didn't mention that yes the number of Opens is large. Right > > > > now, for example, I have 7950 OpenOwner and 8277 Open. > > > Well, the openowners cannot go away until the opens go away, > > > so the problem is that the opens are not getting closed. > > > > > > Close happens when the v_usecount on the vnode goes to zero. > > > Something is retaining the v_usecount. One possibility is that most > > > of the opens are for the same file, but with different openowners. > > > If that is the case, the "oneopenown" mount option will deal with it. > > > > > > Another possibility is that something is retaining a v_usecount > > > reference on a lot of the vnodes. (This used to happen when a nullfs > > > mount with caching enabled was on top of the nfs mount.) > > > I don't know what other things might do that? > > > > Yeah, I remember the nullfs problem. But I'm not using nullfs on this > > computer anymore. Is there any debugging facility that can list > > vnodes? All I know of is "fstat", and that doesn't show anywhere near > > the number of NFS Opens. > Don't ask me. My debugging technology consists of printf()s. > > An NFSv4 Open is for a client), file>. It is probably opening the same file by many different > processes. The "oneopenown" option makes the client use the same > openowner for all opens, so that there is one open per file. > > > > > > > > > > > > > > Commit 1cedb4ea1a79 in main changed the semantics of this > > > > > a little, to avoid a use-after-free bug. However, it is dated > > > > > Feb. 25, 2022 and is not in 13.0, so I don't think it could > > > > > be the culprit. > > > > > > > > > > Essentially, the function called nfscl_cleanupkext() should call > > > > > nfscl_procdoesntexist(), which returns true after the process has > > > > > exited and when that is the case, calls nfscl_cleanup_common(). > > > > > --> nfscl_cleanup_common() will either get rid of the openowner or, > > > > > if there are still children with open file descriptors, mark it "defunct" > > > > > so it can be free'd once the children close the file. > > > > > > > > > > It could be that X is now somehow creating a long chain of processes > > > > > where the children inherit a file descriptor and that delays the cleanup > > > > > indefinitely? > > > > > Even then, everything should get cleaned up once you kill off X? > > > > > (It might take a couple of seconds after killing all the processes off.) > > > > > > > > > > Another possibility is that the "nfscl" thread is wedged somehow. > > > > > It is the one that will call nfscl_cleanupkext() once/sec. If it never > > > > > gets called, the openowners will never go away. > > > > > > > > > > Being old fashioned, I'd probably try to figure this out by adding > > > > > some printf()s to nfscl_cleanupkext() and nfscl_cleanup_common(). > > > > > > > > dtrace shows that nfscl_cleanupkext() is getting called at about 0.6 hz. > > > That sounds ok. Since there are a lot of opens/openowners, it probably > > > is getting behind. > > > > > > > > > > > > > To avoid the problem, you can probably just use the "oneopenown" > > > > > mount option. With that option, only one openowner is used for > > > > > all opens. (Having separate openowners for each process was needed > > > > > for NFSv4.0, but not NFSv4.1/4.2.) > > > > > > > > > > > Or is it just a red herring and I shouldn't > > > > > > worry? > > > > > Well, you can probably avoid the problem by using the "oneopenown" > > > > > mount option. > > > > > > > > Ok, I'm trying that now. After unmounting and remounting NFS, > > > > "nfsstat -cE" reports 1 OpenOwner and 11 Opens". But on the server, > > > > "nfsdumpstate" still reports thousands. Will those go away > > > > eventually? > > > If the opens are gone then, yes, they will go away. They are retained for > > > a little while so that another Open against the openowner does not need > > > to recreate the openowner (which also implied an extra RPC to confirm > > > the openowner in NFSv4.0). > > > > > > I think they go away after a few minutes, if I recall correctly. > > > If the server thinks there are still Opens, then they will not go away. > > > > Uh, they aren't going away. It's been a few hours now, and the NFS > > server still reports the same number of opens and openowners. > Yes, the openowners won't go away until the opens go away and the > opens don't go away until the client closes them. (Once the opens are > closed, the openowners go away after something like 5minutes.) > > For NFSv4.0, the unmount does a SetclientID/SetclientIDconfirm, which > gets rid of all opens at the server. However, NFSv4.1/4.2 does not have > this. It has a DestroyClient, but it is required to return NFSERR_CLIENTBUSY > if there are outstanding opens (servers are not supposed to "forget" opens, > except when they crash. Even then, if they have something like non-volatile > ram, they can remember opens through a reboot. (FreeBSD does forget them > upon reboot.) > Maybe for 4.1/4.2 the client should try and close any outstanding opens. > (Normally, they should all be closed once all files are POSIX closed. I > suspect that it didn't happen because the "nfscl" thread was killed off > during unmount before it got around to doing all of them.) > I'll look at this. > > How to get rid of them now... > - I think a nfsrevoke(8) on the clientid will do so. However, if the same > clientid is in use for your current mount, you'll need to unmount before > doing so. > > Otherwise, I think they'll be there until a server reboot (or kldunload/kldload > of the nfsd, if it is not built into the kernel. Even a restart of the nfsd daemon > does not get rid of them, since the "server should never forget opens" rule > is applied. As it turns out, the excessive opens disappeared from the serve sometime overnight. They disappeared eventually, but it took hours rather than minutes. And using oneopenowner on the client, there are now only a modest number of opens (133), and exactly one openowner. So I think it will certainly work for my use case. -Alan