From owner-freebsd-net@freebsd.org Fri Jan 13 23:39:27 2017 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5D6FCCAE417 for ; Fri, 13 Jan 2017 23:39:27 +0000 (UTC) (envelope-from slw@zxy.spb.ru) Received: from zxy.spb.ru (zxy.spb.ru [195.70.199.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 1B16E1059 for ; Fri, 13 Jan 2017 23:39:27 +0000 (UTC) (envelope-from slw@zxy.spb.ru) Received: from slw by zxy.spb.ru with local (Exim 4.86 (FreeBSD)) (envelope-from ) id 1cSBR8-0006ye-EJ; Sat, 14 Jan 2017 02:39:10 +0300 Date: Sat, 14 Jan 2017 02:39:10 +0300 From: Slawa Olhovchenkov To: Rick Macklem Cc: Eugene Grosbein , Michael Sinatra , "freebsd-net@freebsd.org" Subject: Re: NFSv4 stuck Message-ID: <20170113233910.GQ30374@zxy.spb.ru> References: <20170111220818.GD30374@zxy.spb.ru> <20170111225922.GE30374@zxy.spb.ru> <20170111235020.GF30374@zxy.spb.ru> <58771EA6.1020104@grosbein.net> <20170112131504.GG30374@zxy.spb.ru> <20170112232016.GM30374@zxy.spb.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: slw@zxy.spb.ru X-SA-Exim-Scanned: No (on zxy.spb.ru); SAEximRunCond expanded to false X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 13 Jan 2017 23:39:27 -0000 On Fri, Jan 13, 2017 at 11:02:22PM +0000, Rick Macklem wrote: > Slawa Olhovchenkov wrote: > [stuff snipped] > >> > > >> >What data? In may case no data. > You have a file system with no files in it. (It is file data I am referring to.) > Admittedly a read-only file system won't get corrupted, but you will still have trouble > reading files, since NFSv4 require that they be Open'd before > reading. No problem if application can be interruptible and fs can be unmount. (in ideal, disk io also must be interruptible. but posix and k&r...) > >> Certain NFSv4 operations (such as open and byte range locking) are strictly ordered using a > >> seqid#. If you fail an RPC in progress (via a soft timeout or intr via a signal) then this seqid gets > >> out of sync between client and server and your mount is badly broken. > > > >Mount can be droped? Automatic forced unmount? > >Or application can be manual killed for manual unmount? > >This is will be perfect for me. This is will be best that current behavior. > Well, since recently written data could be lost, I can't see this ever being automatic. Data lost in any cases. > The manual "umount -f " should work, but only if a "umount " has > not already been done. (The latter gets stuck in the kernel, usually after locking the mounted-on I think "umount " may be called by autoumountd: in other cases "umount -f " also don't work. > vnode and that blocks the subsequent "umount -f ". > > Someday, I plan on adding a new option to "umount" that goes directly to NFS (via the nfssvc(2) > syscall) to force a dismount, but I haven't gotten around to doing it. > > Until then, it's "umount -f" or reboot. reboot also don't work. only power reset. this is sadly > And please don't use "soft,intr" options, they won't usually > help and will break the mount for opening files sooner or later. Break files is better locked system. For NSFv3 this is help and allow to kill apllication and do recover. > >> I do not believe this caused your hang though, since processes were sleeping on rpccon, which > >> means they were trying to do a new TCP connection to the server unsuccessfully. > >> - Which normally indicates a problem with your underlying network fabric. > > > >Network can fail always, at any time. > >This should not cause a blockage of the system. > Would you expect a local filesystem to keep working when the JBOD interface to a drive is broken. > For NFS, a broken network means "can't talk to the file system" just like a broken JBOD to a file > system's drive would mean this. broken JBOD now don't cause kernel panic. In RT11 broken JBOD don't cause any problem and RT11 allow application to cancel disk IO. > For NFS to work well, you want the most reliable network fabric > possible. Any relaible network can be failed. > One the network is fixed, it should again be possible for the mount to work. > (The processes in "rpccon" are trying to create a new TCP connection and when they succeed > the mount point should again start working.) Hm. This is may be missundertanding. Ay time of try unmount/interrapt/etc network will be stable and no issuses. Issuses may be in past, and after this NFS go to stuck. (or, may be no issuses. I am not sure) And no way to recover.