From owner-freebsd-net@FreeBSD.ORG  Thu Jan  8 09:51:46 2004
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id A284016A504
	for <freebsd-net@freebsd.org>; Thu,  8 Jan 2004 09:51:46 -0800 (PST)
Received: from mutare.noc.clara.net (mutare.noc.clara.net [195.8.70.95])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 7C54043D6D
	for <freebsd-net@freebsd.org>; Thu,  8 Jan 2004 09:51:10 -0800 (PST)
	(envelope-from ollie@mutare.noc.clara.net)
Received: from ollie by mutare.noc.clara.net with local (Exim 4.24)
	id 1AeeJJ-000IoN-Cd
	for freebsd-net@freebsd.org; Thu, 08 Jan 2004 17:51:09 +0000
Date: Thu, 8 Jan 2004 17:51:09 +0000
From: Ollie Cook <ollie@uk.clara.net>
To: freebsd-net@freebsd.org
Message-ID: <20040108175109.GE70042@mutare.noc.clara.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.4.1i
X-Operating-System: FreeBSD 4.9-STABLE i386
X-NCC-RegID: uk.claranet
Sender: Ollie Cook <ollie@mutare.noc.clara.net>
X-Envelope-To: freebsd-net@freebsd.org
X-Clara-Scan: content scanned according to recipient preferences
Subject: NFS server not responding / alive again
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 08 Jan 2004 17:51:46 -0000

Good evening,

I am seeking some advice on some errors I am seeing in the logs of the machines
in a mail cluster I am responsible for. The errors do not seem to be causing
any operational impact, but equally, I'm inclined to investigate the source of
the warnings in any case.

The log messages in question are of the form:

Jan  8 17:04:51 metis /kernel: nfs server 192.168.1.1:/vol/vol1/claramail: not responding
Jan  8 17:04:53 metis /kernel: nfs server 192.168.1.1:/vol/vol1/claramail: is alive again

These messages are logged fairly frequently, with a new pair appearing every
few seconds or so.

The mail cluster consists of ten i386 hosts running a variety of FreeBSD
versions from 4.5-STABLE to 4.9-STABLE. The NFS server is a Network Appliance
F825 filer running Data ONTAP 6.4.1. The remote volume is just shy of 1TB
large.

Four of the hosts run message delivery software and perform mostly writes to
the remotely mounted volume. The remaining six run POP or webmail software and
perform mostly reads from the volume.

Seven of the hosts are on the same local LAN and mount the volume as NFSv3 over
UDP. The remaining three hosts are in a remote datacentre and mount the volume
over TCP. All but two of the hosts log these error conditions. These two hosts
are two of the local ones which mount the volume by UDP.

The four delivery hosts each do up to 250 NFS operations per second (avg 120)
while the POP hosts each do up to 750 NFS operations per second (avg 500). The
total number of NFS operations the file handles is up to 7000 per second (avg
3500).

As far as I can tell there is no correlation between the type of NFS activity,
the OS revisions on the individual hosts, the number of NFS operations per
client or the NFS transport and the appearance of these log lines in
/var/log/messages.

If this were a NFS server performance issue, I'd expect it to affect all the
NFS clients, but this isn't the case.

We also run a second, similar but smaller cluster, with the same architecture
and software but fewer hosts for another vISP, which doesn't exhibit this
problem. There are two delivery hosts and two POP/webmail hosts. They generate
a maximum of around 1200 NFS operations all together.

Other posts I have seen on this subject have suggested to check for local
network problems, exhausted mbufs etc., but I don't believe this to be the
cause. From one client (one of the TCP ones):

ollie@mese:[ollie] (1) # netstat -i
Name  Mtu   Network       Address            Ipkts Ierrs    Opkts Oerrs  Coll
fxp0  1500  <Link#1>    00:50:8b:e0:5b:85 1427046704     0 1406319153     1     0
fxp0  1500  192.168.1/24  mese            1236173208     - 1406332198     -     -
fxp0  1500  pop1.mail/32  pop1.mail       188200155     -       58     -     -
fxp1* 1500  <Link#2>    00:50:8b:e0:5b:3e        0     0        0     0     0
lo0   16384 <Link#3>                          3589     0     3589     0     0
lo0   16384 your-net      localhost           1129     -     1129     -     -

ollie@mese:[ollie] (2) # netstat -m
544/1632/34816 mbufs in use (current/peak/max):
        439 mbufs allocated to data
        105 mbufs allocated to packet headers
370/878/8704 mbuf clusters in use (current/peak/max)
2164 Kbytes allocated to network (8% of mb_map in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines

The network interfaces on the clients and servers all operate at Fast Ethernet
speeds in full-duplex, and none is close to being saturated. The NetApp filer
does about 25Mbit/s at peak.

Should these log lines concern me or am I worrying unnecessarily? Has anyone
else experienced any similar behaviour between FreeBSD clients and NetApp
filers?

I am at a loss for how to further investigate this NFS issue, and would be glad
to receive any advice in that direction.

Yours,

Ollie

-- 
Oliver Cook    Systems Administrator, Claranet UK
ollie@uk.clara.net                  020 7903 3065