From nobody Fri May 27 22:12:41 2022 X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 42B411B49257 for ; Fri, 27 May 2022 22:12:44 +0000 (UTC) (envelope-from kempe@lysator.liu.se) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4L8zZM513Yz3mN1 for ; Fri, 27 May 2022 22:12:43 +0000 (UTC) (envelope-from kempe@lysator.liu.se) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 2D75F66A8; Sat, 28 May 2022 00:12:42 +0200 (CEST) Received: from shipon.lysator.liu.se (unknown [IPv6:2001:6b0:17:f0a0::83]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 2AAB166A7; Sat, 28 May 2022 00:12:42 +0200 (CEST) Date: Sat, 28 May 2022 00:12:41 +0200 From: Andreas Kempe To: Rick Macklem Cc: "freebsd-fs@freebsd.org" Subject: Re: FreeBSD 12.3/13.1 NFS client hang Message-ID: References: List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Virus-Scanned: ClamAV using ClamSMTP X-Rspamd-Queue-Id: 4L8zZM513Yz3mN1 X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=none) header.from=lysator.liu.se; spf=pass (mx1.freebsd.org: domain of kempe@lysator.liu.se designates 130.236.254.3 as permitted sender) smtp.mailfrom=kempe@lysator.liu.se X-Spamd-Result: default: False [-2.11 / 15.00]; TO_DN_EQ_ADDR_SOME(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; ARC_NA(0.00)[]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+a:mail.lysator.liu.se]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; NEURAL_SPAM_SHORT(0.89)[0.892]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MID_RHS_MATCH_FROMTLD(0.00)[]; RCVD_IN_DNSWL_MED(-0.20)[130.236.254.3:from]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[lysator.liu.se,none]; MLMMJ_DEST(0.00)[freebsd-fs]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:2843, ipnet:130.236.0.0/16, country:SE]; RCVD_TLS_LAST(0.00)[] X-ThisMailContainsUnwantedMimeParts: N On Fri, May 27, 2022 at 08:59:57PM +0000, Rick Macklem wrote: > Andreas Kempe wrote: > > Hello everyone! > > > > I'm having issues with the NFS clients on FreeBSD 12.3 and 13.1 > > systems hanging when using a CentOS 7 server. > First, make sure you are using hard mounts. "soft" or "intr" mounts won't > work and will mess up the session sooner or later. (A messed up session could > result in no free slots on the session and that will wedge threads in > nfsv4_sequencelookup() as you describe. > (This is briefly described in the BUGS section of "man mount_nfs".) > I had totally missed that soft and interruptible mounts have these issues. I switched the FreeBSD-machines to soft and intr on purpose to be able to fix hung mounts without having to restart the machine on NFS hangs. Since they are shared machines, it is an inconvinience for other users if one user causes a hang. Switching our test machine back to hard mounts did prevent recursive grep from immediately causing the slot type hang again. > Do a: > # nfsstat -m > on the clients and look for "hard". > > Next, is there anything logged on the console for the 13.1 client(s)? > (13.1 has some diagnostics for things like a server replying with the > wrong session slot#.) > The one thing we have seen logged are messages along the lines of: kernel: newnfs: server 'mail' error: fileid changed. fsid 4240eca6003a052a:0: expected fileid 0x22, got 0x2. (BROKEN NFS SERVER OR MIDDLEWARE) > Also, maybe I'm old fashioned, but I find "ps axHl" useful, since it shows > where all the processes are sleeping. > And "procstat -kk" covers all of the locks. > I don't know if it is a matter of being old fashioned as much as one of taste. :) In future dumps, I can provide both ps axHl and procstat -kk. > > Below are procstat kstack $PID invocations showing where the processes > > have hung. In the nfsv4_sequencelookup it seems hung waiting for > > nfsess_slots to have an available slot. In the second nfs_lock case, > > it seems the processes are stuck waiting on vnode locks. > > > > These issues seem to appear seemingly at random, but also if > > operations that open a lot of files or create a lot of file locks are > > used. An example that can often provoke a hang is performing a > > recursive grep through a large file hierarchy like the FreeBSD > > codebase. > > > > The NFS code is large and complicated so any advice is appriciated! > Yea. I'm the author and I don't know exactly what it all does;-)\ > > > Cordially, > > Andreas Kempe > > > > [...] > > Not very useful unless you have all the processes and their locks to try and figure out what is holding > the vnode locks. > Yes, I sent this mostly in the hope that it might be something that someone has seen before. I understand that more verbose information is needed to track down the lock contention. I'll switch our machines back to using hard mounts and try to get as much diagnostic information as possible when the next lockup happens. Do you have any good suggestions for tracking down the issue? I've been contemplating enabling WITNESS or building with debug information to be able to hook in the kernel debugger. Thank you very much for your reply! Cordially, Andreas Kempe > rick > >