From owner-freebsd-hackers Sat Feb 20 23:37:47 1999 Delivered-To: freebsd-hackers@freebsd.org Received: from home.dragondata.com (home.dragondata.com [204.137.237.2]) by hub.freebsd.org (Postfix) with ESMTP id DDA9D10E5A for ; Sat, 20 Feb 1999 23:37:44 -0800 (PST) (envelope-from toasty@home.dragondata.com) Received: (from toasty@localhost) by home.dragondata.com (8.9.2/8.9.2) id BAA21850 for hackers@freebsd.org; Sun, 21 Feb 1999 01:37:43 -0600 (CST) From: Kevin Day Message-Id: <199902210737.BAA21850@home.dragondata.com> Subject: ESTALE the best approach? To: hackers@freebsd.org Date: Sun, 21 Feb 1999 01:37:42 -0600 (CST) X-Mailer: ELM [version 2.4ME+ PL43 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Forgetting standards and past practices, is ESTALE a good approach to dealing with an NFS outage/reboot/whatever? Very few programs know how to deal with ESTALE, and I really have yet to see one that knows how to recover from it. I've come up to a machine with a load average of 20+, with lots of processes spinning, doing something like: 28591 eggdrop 0.000037 CALL read(0x6,0xd6800,0x400) 28591 eggdrop 0.000379 RET read -1 errno 70 Stale NFS file handle 28591 eggdrop 0.000036 CALL read(0x6,0xd6800,0x400) 28591 eggdrop 0.000378 RET read -1 errno 70 Stale NFS file handle 28591 eggdrop 0.000038 CALL read(0x6,0xd6800,0x400) 28591 eggdrop 0.000781 RET read -1 errno 70 Stale NFS file handle 28591 eggdrop 0.000044 CALL read(0x6,0xd6800,0x400) 28591 eggdrop 0.000363 RET read -1 errno 70 Stale NFS file handle 28591 eggdrop 0.000035 CALL read(0x6,0xd6800,0x400) 28591 eggdrop 0.000383 RET read -1 errno 70 Stale NFS file handle 28591 eggdrop 0.000039 CALL read(0x6,0xd6800,0x400) 28591 eggdrop 0.000432 RET read -1 errno 70 Stale NFS file handle 28591 eggdrop 0.000034 CALL read(0x6,0xd6800,0x400) 28591 eggdrop 0.000510 RET read -1 errno 70 Stale NFS file handle 28591 eggdrop 0.000029 CALL read(0x6,0xd6800,0x400) 28591 eggdrop 0.001082 RET read -1 errno 70 Stale NFS file handle 28591 eggdrop 0.000037 CALL read(0x6,0xd6800,0x400) 28591 eggdrop 0.000381 RET read -1 errno 70 Stale NFS file handle Because they not realize that ESTALE is a fatal condition, or lots of programs tend to just go bezerk at having a FD closed on them... I've been experimenting here with making any ESTALE return something other than ESTALE, to see what happens. EBADF was nearly as bad, as most programs that couldn't deal with ESTALE probably didn't expect a fd that they had already opened to be suddenly closed. EINVAL seemed to make most programs die on their own, but not all. Some also left some very cryptic/wrong diagnostics behind. ENOSPC stopped programs that were stuck in write() mostly, but obviously not if they were read()'ing like above. EIO was probably the best of the bunch, for just getting the program to stop freaking out, but even then, some programs didn't check for it. My next step is going to be to make nfsrv_fhtovp(?) actually kill the process instead of returning anything, in a final attempt to fix this, locally. Is there some justification for treating ESTALE like a transient error anyway? Did some implementation somewhere eventually restore things? Anyone have any comments on this? I've found that at least these programs cannot deal with ESTALE in some manner: cron apache httpd eggdrop afio bnc ircii bitchx Kevin To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message