From owner-freebsd-stable@FreeBSD.ORG Tue Jan 13 07:49:36 2004 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1BE3116A4CE for ; Tue, 13 Jan 2004 07:49:36 -0800 (PST) Received: from mutare.noc.clara.net (mutare.noc.clara.net [195.8.70.95]) by mx1.FreeBSD.org (Postfix) with ESMTP id C243F43D49 for ; Tue, 13 Jan 2004 07:49:33 -0800 (PST) (envelope-from ollie@mutare.noc.clara.net) Received: from ollie by mutare.noc.clara.net with local (Exim 4.24) id 1AgQnM-0005WS-Qt for freebsd-stable@freebsd.org; Tue, 13 Jan 2004 15:49:32 +0000 Date: Tue, 13 Jan 2004 15:49:32 +0000 From: Ollie Cook To: freebsd-stable@freebsd.org Message-ID: <20040113154932.GE354@mutare.noc.clara.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.1i X-Operating-System: FreeBSD 4.9-STABLE i386 X-NCC-RegID: uk.claranet Sender: Ollie Cook X-Envelope-To: freebsd-stable@freebsd.org X-Clara-Scan: content scanned according to recipient preferences Subject: nfs send errors 32 and 35 on RELENG_4 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 Jan 2004 15:49:36 -0000 Hi, For a while I have been seeing errors of this nature on a cluster of i386 FreeBSD RELENG_4 hosts which mount a volume from a NetApp F825 filer using NFSv3 over a mixture of UDP and TCP, depending on whether the host is on the same local LAN as the filer or not: Jan 13 14:02:02 mese /kernel: nfs server 192.168.1.1:/vol/vol1/claramail: not responding Jan 13 14:02:03 mese /kernel: nfs server 192.168.1.1:/vol/vol1/claramail: is alive again The messages are logged with alarming regularity, but don't seem to actually have any bearing on the performance or availablility of the volume. My full findings are in my initial post to freebsd-net, which has been archived here: http://www.freebsd.org/cgi/getmsg.cgi?fetch=178585+184466+/usr/local/www/db/text/2004/freebsd-net/20040111.freebsd-net However more recently, and especially today, I am seeing errors which *are* affecting the availability of the mount point on one of the hosts in question: Jan 13 14:09:37 mese /kernel: nfs send error 35 for server 192.168.1.1:/vol/vol1/claramail Jan 13 14:09:42 mese /kernel: nfs send error 35 for server 192.168.1.1:/vol/vol1/claramail Jan 13 14:09:47 mese /kernel: nfs send error 35 for server 192.168.1.1:/vol/vol1/claramail Jan 13 14:09:52 mese /kernel: nfs send error 35 for server 192.168.1.1:/vol/vol1/claramail Jan 13 14:09:53 mese /kernel: nfs send error 32 for server 192.168.1.1:/vol/vol1/claramail We are running version 1.60.2.6 of nfs_socket.c, which is generating this message. Looking at the CVS Web Repository, that seems to be the latest version for RELENG_4. A quick google suggests that error 32 is 'OK' in the sense that the TCP connection should be reestablished and things can pick up where they left off[1], but I can't find what causes error 35. 35 seems to be the more abundant error, in any case. The symptoms on the hosts when these errors occur are: - processes accessing files on the remote volume get stuck in disk wait, specifically their state is 'nfsrcv'. - even when all processes accessing volume are killed, and lsof shows no open files on the volume, "umount /vol" claims the device is busy. - a "umount -f" hangs and the umount process can't be killed. - however, after a "umount -f", /vol is not listed in "mount" or "df" - similarly, trying to then mount the volume, "mount" hangs and can't be killed, and the volume does not appear in "mount" or "df" (in fact, df hangs too. Presumably as it's trying to work out available space etc.) - a tcpdump between client and server doesn't show any NFS traffic at all being emitted by the client, although IP connectivity to the server is maintained, and other hosts are able to still talk NFS to it happily. I tried to reboot the host in question to restore service, but it stayed multi-user. The host was in a remote data centre so in the end it had to be power cycled. The host wasn't on console so I wasn't able to determine why it stayed multi-user. I'm at a loss as to how to further debug this. It occurs to me that determining what error 35 is would be helpful. :) I've looked in a book that I have available[2], but it lists neither error 32 nor 35. Is there an up-to-date list of NFSv3 errors anywhere? At this stage, any and all advice on where to look and what data I can usefully retrieve that would help analyse this problem would be gratefully received. Cheers, Ollie 1: http://lists.freebsd.org/pipermail/freebsd-hackers/2003-July/001988.html 2: NFS Illustrated, Brent Callaghan, First Printing, ISBN 0-201-32570-5 -- Oliver Cook Systems Administrator, Claranet UK ollie@uk.clara.net +44 20 7903 3065