Date: Tue, 23 Feb 1999 19:32:47 -0600 (CST) From: Kevin Day <toasty@home.dragondata.com> To: tlambert@primenet.com (Terry Lambert) Cc: hackers@FreeBSD.ORG Subject: Re: ESTALE the best approach? Message-ID: <199902240132.TAA25796@home.dragondata.com> In-Reply-To: <199902240055.RAA22403@usr09.primenet.com> from Terry Lambert at "Feb 24, 1999 0:55:51 am"
next in thread | previous in thread | raw e-mail | index | archive | help
> > Forgetting standards and past practices, is ESTALE a good approach to > > dealing with an NFS outage/reboot/whatever? > > > > Very few programs know how to deal with ESTALE, and I really have yet to see > > one that knows how to recover from it. > > Programs have to be able to deal with errors. They either have > specific per-error strategies, or they have an "all other errors" > strategy. The problem I've seen is that some poorly designed software lists what it thinks are fatal, and everything else is to be retried. > > 28591 eggdrop 0.000037 CALL read(0x6,0xd6800,0x400) > > 28591 eggdrop 0.000379 RET read -1 errno 70 Stale NFS file handle > > 28591 eggdrop 0.000037 CALL read(0x6,0xd6800,0x400) > > 28591 eggdrop 0.000379 RET read -1 errno 70 Stale NFS file handle > > Whatever is calling this is not checking the return value for read. > Here for example, the executable checks for a few errors, if they don't match what it thinks are fatal, it retries. I'd think that good programming practice should be the reverse. Add what you know to be temporary, and everything else (and errors you don't even know about) should be fatal. > This is acceptable for tty's, since the tty is supposed to guarantee > a signal to the process group leader, which is then sent to each > process in the group as the group leade fails away, leaving each > process a group leader (and thus subject to signal requirements). > > Some people will argue against this, using POSIX as evidence, but > POSIX is a documentation of existing System V derived UNIX practice, > so those people are arguably wrong because That's What SVR4 Does. > I'm not even going to touch this argument waiting to happen. :) > > > 28591 eggdrop 0.000381 RET read -1 errno 70 Stale NFS file handle > > > > Because they not realize that ESTALE is a fatal condition, or lots of > > programs tend to just go bezerk at having a FD closed on them... > > It's not so much that they don't know how to handle ESTALE, it's > more that they are (evilly) ignoring the -1 return from read. Looking at the read(2) man page... The only error that could happen that isn't a result of a programming error that could happen on a standard file is 'EIO', which would probably be a hardware error. (At that point, when hardware fails, I consider any application's behavior as 'undefined', on the PC architecture) I can see why people aren't checking for errors on a read(2), really... Lots of FreeBSD's internal utilities don't do it, even. :) (however, most don't retry ad nausem in response to an error) > > > I've been experimenting here with making any ESTALE return something other > > than ESTALE, to see what happens. > > > > EBADF was nearly as bad, as most programs that couldn't deal with ESTALE > > probably didn't expect a fd that they had already opened to be suddenly > > closed. > > Plus it was never closed. The vnode is still hanging out. The > EBADF'ed descriptor is unrecoverable, since only a stupid program > would close an already closed fd. So the struct file * pointing to > the vnode is also still hanging out. > Eeep, I hadn't thought of that. I guess I realize that has to be true, otherwise you'd be getting EBADF. > > > > My next step is going to be to make nfsrv_fhtovp(?) actually kill the > > process instead of returning anything, in a final attempt to fix this, > > locally. Is there some justification for treating ESTALE like a transient > > error anyway? Did some implementation somewhere eventually restore things? > > Yes. The main problem is that the ESTALE does not trigger a remount > attempt, like it should, and a subsequent cleanup of the mount specific > portions of the outstanding NFSnodes. > > Generally, the remount attempt is signalled by statd. > > The ESTALE is actually a result of the mount going south, and not > getting reset like it's supposed to be. In theory, it should never > make user space, unless the remount attempts are time or attempt-count > constrained by mount options. > > The only way you get a stale node is a server reboot, and that's as a > result of the "security" cruft that was nont implemented link-layer > like it should have been (packet sequence randomization, etc.). > I'm actually seeing this on a network that has an 100MB ethernet segment exculsively for NFS. I'll see this: nfs server home.internal:/home: not responding nfs server home.internal:/home: is alive again in my syslog, with the timestamps being identical, and chances are, after that, one or two processes goes bezerk afterwards. I've mucked with the dynamic retry estimator, and other settings, and it still happens. I'm pretty sure the nfs server is responding just fine, because other nfs clients don't report the error at the same time. I realize I'm fixing the symptom, not the problem, but fixing NFS is not something I have the time for. :) (For reference, the server is 2.2.5, clients are 2.2.8 and 4.0... If I could get the 4.0 server to stay up long enough without getting deadlocked leaving all the clients in 'inode', I could tell you if the problem is in 4.0 as well) Kevin To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199902240132.TAA25796>