From owner-freebsd-hackers@FreeBSD.ORG  Sun Aug  1 01:26:18 2010
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6F1AF106564A;
	Sun,  1 Aug 2010 01:26:18 +0000 (UTC)
	(envelope-from rmacklem@uoguelph.ca)
Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca
	[131.104.91.36])
	by mx1.freebsd.org (Postfix) with ESMTP id 156898FC1F;
	Sun,  1 Aug 2010 01:26:17 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: ApwEALJjVEyDaFvO/2dsb2JhbACDE54FrRmQSYEmgyBzBIh/
X-IronPort-AV: E=Sophos;i="4.55,296,1278302400"; d="scan'208";a="86942942"
Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
	([131.104.91.206])
	by esa-annu-pri.mail.uoguelph.ca with ESMTP; 31 Jul 2010 20:57:42 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
	by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id DD135B3EA2;
	Sat, 31 Jul 2010 20:57:42 -0400 (EDT)
Date: Sat, 31 Jul 2010 20:57:42 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: krad <kraduk@googlemail.com>
Message-ID: <995936398.215591.1280624262743.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <AANLkTinUVKByfTX+f9DOQ97jh43VPVSug_=BDpJ9PB0z@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [24.65.230.102]
X-Mailer: Zimbra 6.0.7_GA_2476.RHEL4 (ZimbraWebClient - FF3.0
	(Mac)/6.0.7_GA_2473.RHEL4_64)
Cc: freebsd-hackers@freebsd.org,
	FreeBSD Questions <freebsd-questions@freebsd.org>
Subject: Re: possible NFS lockups
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 01 Aug 2010 01:26:18 -0000

> From: "krad" <kraduk@googlemail.com>
> To: freebsd-hackers@freebsd.org, "FreeBSD Questions" <freebsd-questions@freebsd.org>
> Sent: Tuesday, July 27, 2010 11:29:20 AM
> Subject: possible NFS lockups
> I have a production mail system with an nfs backend. Every now and
> again we
> see the nfs die on a particular head end. However it doesn't die
> across all
> the nodes. This suggests to me there isnt an issue with the filer
> itself and
> the stats from the filer concur with that.
> 
> The symptoms are lines like this appearing in dmesg
> 
> nfs server 10.44.17.138:/vol/vol1/mail: not responding
> nfs server 10.44.17.138:/vol/vol1/mail: is alive again
> 
> trussing df it seems to hang on getfsstat, this is presumably when it
> tries
> the nfs mounts
> 
> eg
> 
> __sysctl(0xbfbfe224,0x2,0xbfbfe22c,0xbfbfe230,0x0,0x0) = 0 (0x0)
> mmap(0x0,1048576,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) =
> 1746583552 (0x681ac000)
> mmap(0x682ac000,344064,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0)
> =
> 1747632128 (0x682ac000)
> munmap(0x681ac000,344064) = 0 (0x0)
> getfsstat(0x68201000,0x1270,0x2,0xbfbfe960,0xbfbfe95c,0x1) = 9 (0x9)
> 
> 
> I have played with mount options a fair bit but they dont make much
> difference. This is what they are set to at present
> 
> 10.44.17.138:/vol/vol1/mail /mail/0 nfs
> rw,noatime,tcp,acdirmax=320,acdirmin=180,acregmax=320,acregmin=180 0 0
> 
> When this locking is occuring I find that if I do a show mount or
> mount
> 10.44.17.138:/vol/vol1/mail again under another mount point I can
> access it
> fine.
> 
> One thing I have just noticed is that lockd and statd always seem to
> have
> died when this happens. Restarting does not help
> 
> 
lockd and statd implement separate protocols (NLM ans NSM) that do
locking. The protocols were poorly designed and fundamentally
broken imho. (That refers to the protocols and not the implementation.)

I am not familiar with the lockd and statd implementations, but if you
don't need file locking to work for the same file when accessed
concurrently from multiple clients (heads) concurrently, you can use
the "nolockd" mount option to avoid using them. (I have no idea if
the mail system you are using will work without lockd or not? It
should be ok to use "nolockd" if file locking is only done on a
given file in one client node.)

I suspect that some interaction between your server and the
lockd/statd client causes them to crash and then the client is
stuck trying to talk to them, but I don't really know? Looking
at where all the processes and threads are sleeping via "ps axlH"
may tell you what is stuck and where.

As others noted, intermittent "server not responding...server ok"
messages just indicate slow response from the server and don't
mean much. However, if a given process is hung and doesn't
recover, knowing what it is sleeping on can help w.r.t diagnosis.

rick