From owner-freebsd-hackers Sun Mar 1 15:43:37 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id PAA25121 for freebsd-hackers-outgoing; Sun, 1 Mar 1998 15:43:37 -0800 (PST) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id PAA25114; Sun, 1 Mar 1998 15:43:30 -0800 (PST) (envelope-from tlambert@usr08.primenet.com) Received: (from daemon@localhost) by smtp02.primenet.com (8.8.8/8.8.8) id QAA11182; Sun, 1 Mar 1998 16:27:10 -0700 (MST) Received: from usr08.primenet.com(206.165.6.208) via SMTP by smtp02.primenet.com, id smtpd011125; Sun Mar 1 16:26:59 1998 Received: (from tlambert@localhost) by usr08.primenet.com (8.8.5/8.8.5) id QAA04854; Sun, 1 Mar 1998 16:26:53 -0700 (MST) From: Terry Lambert Message-Id: <199803012326.QAA04854@usr08.primenet.com> Subject: Re: help - make world fails To: dyson@FreeBSD.ORG Date: Sun, 1 Mar 1998 23:26:53 +0000 (GMT) Cc: nrice@emu.sourcee.com, karl@mcs.net, jb@cimlogic.com.au, joe@via.net, hackers@FreeBSD.ORG In-Reply-To: <199803011531.KAA02458@dyson.iquest.net> from "John S. Dyson" at Mar 1, 98 10:31:21 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > > > I think that the system is very close to stable again, with the > > > NFS caveat. Once I can solve the (very reproduceable) problem, > > > I will be much happier with NFS. There are also some outstanding > > > bugfixes for NFS, which I am working with in my local tree... > > > > Would any of those outstanding ``bug fixes'' resolve the issue with > > NFS client freezing the system when the server is non-responsive? > > Not yet. I am working on things that are *more* severe than that > right now. Not discounting the above problem though as not being > severe. IMO, this is a problem in the RPC state machine not being sensitive to remote resets in the middle of an operation. Basically, an RPC call is made, your request is ack'ed or nak'ed, and if it was ack'ed, you go into a state from which you can only emerge with more data from the server. Probably this needs to timeout back to a retry as if you had not been ack'ed. I have not looked very deeply into what this would mean in terms of needing to unwind state, in the case that the original reques could no longer be validly served (ie: open/unlink an NFS file (results in a rename) and continue to do I/O). One thing that would help is server-signalling. This is basically the job of rpc.statd. THe operation could be retried before the timeout. One real pain is that for a long delay link, ie: satellite, Sprint (;-)), etc., if you were to restart the call that was ACK'ed and wait for another ACK, you would have to accept a response-without-ACK to make yourself robust (ie: if the OP was a "delete file" or whatever, it's not idempotent -- ie: unlike a block write, you can't replay the event with no ill effect). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message