From owner-freebsd-net@FreeBSD.ORG Thu Jan 8 09:51:46 2004 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A284016A504 for ; Thu, 8 Jan 2004 09:51:46 -0800 (PST) Received: from mutare.noc.clara.net (mutare.noc.clara.net [195.8.70.95]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7C54043D6D for ; Thu, 8 Jan 2004 09:51:10 -0800 (PST) (envelope-from ollie@mutare.noc.clara.net) Received: from ollie by mutare.noc.clara.net with local (Exim 4.24) id 1AeeJJ-000IoN-Cd for freebsd-net@freebsd.org; Thu, 08 Jan 2004 17:51:09 +0000 Date: Thu, 8 Jan 2004 17:51:09 +0000 From: Ollie Cook To: freebsd-net@freebsd.org Message-ID: <20040108175109.GE70042@mutare.noc.clara.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.1i X-Operating-System: FreeBSD 4.9-STABLE i386 X-NCC-RegID: uk.claranet Sender: Ollie Cook X-Envelope-To: freebsd-net@freebsd.org X-Clara-Scan: content scanned according to recipient preferences Subject: NFS server not responding / alive again X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 08 Jan 2004 17:51:46 -0000 Good evening, I am seeking some advice on some errors I am seeing in the logs of the machines in a mail cluster I am responsible for. The errors do not seem to be causing any operational impact, but equally, I'm inclined to investigate the source of the warnings in any case. The log messages in question are of the form: Jan 8 17:04:51 metis /kernel: nfs server 192.168.1.1:/vol/vol1/claramail: not responding Jan 8 17:04:53 metis /kernel: nfs server 192.168.1.1:/vol/vol1/claramail: is alive again These messages are logged fairly frequently, with a new pair appearing every few seconds or so. The mail cluster consists of ten i386 hosts running a variety of FreeBSD versions from 4.5-STABLE to 4.9-STABLE. The NFS server is a Network Appliance F825 filer running Data ONTAP 6.4.1. The remote volume is just shy of 1TB large. Four of the hosts run message delivery software and perform mostly writes to the remotely mounted volume. The remaining six run POP or webmail software and perform mostly reads from the volume. Seven of the hosts are on the same local LAN and mount the volume as NFSv3 over UDP. The remaining three hosts are in a remote datacentre and mount the volume over TCP. All but two of the hosts log these error conditions. These two hosts are two of the local ones which mount the volume by UDP. The four delivery hosts each do up to 250 NFS operations per second (avg 120) while the POP hosts each do up to 750 NFS operations per second (avg 500). The total number of NFS operations the file handles is up to 7000 per second (avg 3500). As far as I can tell there is no correlation between the type of NFS activity, the OS revisions on the individual hosts, the number of NFS operations per client or the NFS transport and the appearance of these log lines in /var/log/messages. If this were a NFS server performance issue, I'd expect it to affect all the NFS clients, but this isn't the case. We also run a second, similar but smaller cluster, with the same architecture and software but fewer hosts for another vISP, which doesn't exhibit this problem. There are two delivery hosts and two POP/webmail hosts. They generate a maximum of around 1200 NFS operations all together. Other posts I have seen on this subject have suggested to check for local network problems, exhausted mbufs etc., but I don't believe this to be the cause. From one client (one of the TCP ones): ollie@mese:[ollie] (1) # netstat -i Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll fxp0 1500 00:50:8b:e0:5b:85 1427046704 0 1406319153 1 0 fxp0 1500 192.168.1/24 mese 1236173208 - 1406332198 - - fxp0 1500 pop1.mail/32 pop1.mail 188200155 - 58 - - fxp1* 1500 00:50:8b:e0:5b:3e 0 0 0 0 0 lo0 16384 3589 0 3589 0 0 lo0 16384 your-net localhost 1129 - 1129 - - ollie@mese:[ollie] (2) # netstat -m 544/1632/34816 mbufs in use (current/peak/max): 439 mbufs allocated to data 105 mbufs allocated to packet headers 370/878/8704 mbuf clusters in use (current/peak/max) 2164 Kbytes allocated to network (8% of mb_map in use) 0 requests for memory denied 0 requests for memory delayed 0 calls to protocol drain routines The network interfaces on the clients and servers all operate at Fast Ethernet speeds in full-duplex, and none is close to being saturated. The NetApp filer does about 25Mbit/s at peak. Should these log lines concern me or am I worrying unnecessarily? Has anyone else experienced any similar behaviour between FreeBSD clients and NetApp filers? I am at a loss for how to further investigate this NFS issue, and would be glad to receive any advice in that direction. Yours, Ollie -- Oliver Cook Systems Administrator, Claranet UK ollie@uk.clara.net 020 7903 3065