From owner-freebsd-hackers  Sun Mar  1 15:43:37 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id PAA25121
          for freebsd-hackers-outgoing; Sun, 1 Mar 1998 15:43:37 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id PAA25114;
          Sun, 1 Mar 1998 15:43:30 -0800 (PST)
          (envelope-from tlambert@usr08.primenet.com)
Received: (from daemon@localhost)
	by smtp02.primenet.com (8.8.8/8.8.8) id QAA11182;
	Sun, 1 Mar 1998 16:27:10 -0700 (MST)
Received: from usr08.primenet.com(206.165.6.208)
 via SMTP by smtp02.primenet.com, id smtpd011125; Sun Mar  1 16:26:59 1998
Received: (from tlambert@localhost)
	by usr08.primenet.com (8.8.5/8.8.5) id QAA04854;
	Sun, 1 Mar 1998 16:26:53 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199803012326.QAA04854@usr08.primenet.com>
Subject: Re: help - make world fails
To: dyson@FreeBSD.ORG
Date: Sun, 1 Mar 1998 23:26:53 +0000 (GMT)
Cc: nrice@emu.sourcee.com, karl@mcs.net, jb@cimlogic.com.au, joe@via.net,
        hackers@FreeBSD.ORG
In-Reply-To: <199803011531.KAA02458@dyson.iquest.net> from "John S. Dyson" at Mar 1, 98 10:31:21 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> > > I think that the system is very close to stable again, with the
> > > NFS caveat.  Once I can solve the (very reproduceable) problem,
> > > I will be much happier with NFS.  There are also some outstanding
> > > bugfixes for NFS, which I am working with in my local tree...
> > 
> > Would any of those outstanding ``bug fixes'' resolve the issue with
> > NFS client freezing the system when the server is non-responsive?
>
> Not yet.  I am working on things that are *more* severe than that
> right now.  Not discounting the above problem though as not being
> severe.

IMO, this is a problem in the RPC state machine not being sensitive
to remote resets in the middle of an operation.

Basically, an RPC call is made, your request is ack'ed or nak'ed,
and if it was ack'ed, you go into a state from which you can only
emerge with more data from the server.

Probably this needs to timeout back to a retry as if you had not
been ack'ed.  I have not looked very deeply into what this would
mean in terms of needing to unwind state, in the case that the
original reques could no longer be validly served (ie: open/unlink
an NFS file (results in a rename) and continue to do I/O).

One thing that would help is server-signalling.  This is basically
the job of rpc.statd.  THe operation could be retried before the
timeout.

One real pain is that for a long delay link, ie: satellite, Sprint (;-)),
etc., if you were to restart the call that was ACK'ed and wait for
another ACK, you would have to accept a response-without-ACK to
make yourself robust (ie: if the OP was a "delete file" or whatever,
it's not idempotent -- ie: unlike a block write, you can't replay the
event with no ill effect).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message