From owner-freebsd-hackers  Sat Feb 20 23:37:47 1999
Delivered-To: freebsd-hackers@freebsd.org
Received: from home.dragondata.com (home.dragondata.com [204.137.237.2])
	by hub.freebsd.org (Postfix) with ESMTP id DDA9D10E5A
	for <hackers@freebsd.org>; Sat, 20 Feb 1999 23:37:44 -0800 (PST)
	(envelope-from toasty@home.dragondata.com)
Received: (from toasty@localhost)
	by home.dragondata.com (8.9.2/8.9.2) id BAA21850
	for hackers@freebsd.org; Sun, 21 Feb 1999 01:37:43 -0600 (CST)
From: Kevin Day <toasty@home.dragondata.com>
Message-Id: <199902210737.BAA21850@home.dragondata.com>
Subject: ESTALE the best approach?
To: hackers@freebsd.org
Date: Sun, 21 Feb 1999 01:37:42 -0600 (CST)
X-Mailer: ELM [version 2.4ME+ PL43 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


Forgetting standards and past practices, is ESTALE a good approach to
dealing with an NFS outage/reboot/whatever?

Very few programs know how to deal with ESTALE, and I really have yet to see
one that knows how to recover from it.

I've come up to a machine with a load average of 20+, with lots of processes
spinning, doing something like:

 28591 eggdrop  0.000037 CALL  read(0x6,0xd6800,0x400)
 28591 eggdrop  0.000379 RET   read -1 errno 70 Stale NFS file handle
 28591 eggdrop  0.000036 CALL  read(0x6,0xd6800,0x400)
 28591 eggdrop  0.000378 RET   read -1 errno 70 Stale NFS file handle
 28591 eggdrop  0.000038 CALL  read(0x6,0xd6800,0x400)
 28591 eggdrop  0.000781 RET   read -1 errno 70 Stale NFS file handle
 28591 eggdrop  0.000044 CALL  read(0x6,0xd6800,0x400)
 28591 eggdrop  0.000363 RET   read -1 errno 70 Stale NFS file handle
 28591 eggdrop  0.000035 CALL  read(0x6,0xd6800,0x400)
 28591 eggdrop  0.000383 RET   read -1 errno 70 Stale NFS file handle
 28591 eggdrop  0.000039 CALL  read(0x6,0xd6800,0x400)
 28591 eggdrop  0.000432 RET   read -1 errno 70 Stale NFS file handle
 28591 eggdrop  0.000034 CALL  read(0x6,0xd6800,0x400)
 28591 eggdrop  0.000510 RET   read -1 errno 70 Stale NFS file handle
 28591 eggdrop  0.000029 CALL  read(0x6,0xd6800,0x400)
 28591 eggdrop  0.001082 RET   read -1 errno 70 Stale NFS file handle
 28591 eggdrop  0.000037 CALL  read(0x6,0xd6800,0x400)
 28591 eggdrop  0.000381 RET   read -1 errno 70 Stale NFS file handle


Because they not realize that ESTALE is a fatal condition, or lots of
programs tend to just go bezerk at having a FD closed on them...

I've been experimenting here with making any ESTALE return something other
than ESTALE, to see what happens.

EBADF was nearly as bad, as most programs that couldn't deal with ESTALE
probably didn't expect a fd that they had already opened to be suddenly
closed.

EINVAL seemed to make most programs die on their own, but not all. Some also
left some very cryptic/wrong diagnostics behind.

ENOSPC stopped programs that were stuck in write() mostly, but obviously not
if they were read()'ing like above.

EIO was probably the best of the bunch, for just getting the program to stop
freaking out, but even then, some programs didn't check for it.


My next step is going to be to make nfsrv_fhtovp(?) actually kill the
process instead of returning anything, in a final attempt to fix this,
locally. Is there some justification for treating ESTALE like a transient
error anyway? Did some implementation somewhere eventually restore things?


Anyone have any comments on this? 


I've found that at least these programs cannot deal with ESTALE in some
manner:

cron
apache httpd
eggdrop
afio
bnc
ircii
bitchx


Kevin


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message