From owner-freebsd-hackers@FreeBSD.ORG Sun Aug 1 01:26:18 2010 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6F1AF106564A; Sun, 1 Aug 2010 01:26:18 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 156898FC1F; Sun, 1 Aug 2010 01:26:17 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApwEALJjVEyDaFvO/2dsb2JhbACDE54FrRmQSYEmgyBzBIh/ X-IronPort-AV: E=Sophos;i="4.55,296,1278302400"; d="scan'208";a="86942942" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu-pri.mail.uoguelph.ca with ESMTP; 31 Jul 2010 20:57:42 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id DD135B3EA2; Sat, 31 Jul 2010 20:57:42 -0400 (EDT) Date: Sat, 31 Jul 2010 20:57:42 -0400 (EDT) From: Rick Macklem To: krad Message-ID: <995936398.215591.1280624262743.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [24.65.230.102] X-Mailer: Zimbra 6.0.7_GA_2476.RHEL4 (ZimbraWebClient - FF3.0 (Mac)/6.0.7_GA_2473.RHEL4_64) Cc: freebsd-hackers@freebsd.org, FreeBSD Questions Subject: Re: possible NFS lockups X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 01 Aug 2010 01:26:18 -0000 > From: "krad" > To: freebsd-hackers@freebsd.org, "FreeBSD Questions" > Sent: Tuesday, July 27, 2010 11:29:20 AM > Subject: possible NFS lockups > I have a production mail system with an nfs backend. Every now and > again we > see the nfs die on a particular head end. However it doesn't die > across all > the nodes. This suggests to me there isnt an issue with the filer > itself and > the stats from the filer concur with that. > > The symptoms are lines like this appearing in dmesg > > nfs server 10.44.17.138:/vol/vol1/mail: not responding > nfs server 10.44.17.138:/vol/vol1/mail: is alive again > > trussing df it seems to hang on getfsstat, this is presumably when it > tries > the nfs mounts > > eg > > __sysctl(0xbfbfe224,0x2,0xbfbfe22c,0xbfbfe230,0x0,0x0) = 0 (0x0) > mmap(0x0,1048576,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = > 1746583552 (0x681ac000) > mmap(0x682ac000,344064,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) > = > 1747632128 (0x682ac000) > munmap(0x681ac000,344064) = 0 (0x0) > getfsstat(0x68201000,0x1270,0x2,0xbfbfe960,0xbfbfe95c,0x1) = 9 (0x9) > > > I have played with mount options a fair bit but they dont make much > difference. This is what they are set to at present > > 10.44.17.138:/vol/vol1/mail /mail/0 nfs > rw,noatime,tcp,acdirmax=320,acdirmin=180,acregmax=320,acregmin=180 0 0 > > When this locking is occuring I find that if I do a show mount or > mount > 10.44.17.138:/vol/vol1/mail again under another mount point I can > access it > fine. > > One thing I have just noticed is that lockd and statd always seem to > have > died when this happens. Restarting does not help > > lockd and statd implement separate protocols (NLM ans NSM) that do locking. The protocols were poorly designed and fundamentally broken imho. (That refers to the protocols and not the implementation.) I am not familiar with the lockd and statd implementations, but if you don't need file locking to work for the same file when accessed concurrently from multiple clients (heads) concurrently, you can use the "nolockd" mount option to avoid using them. (I have no idea if the mail system you are using will work without lockd or not? It should be ok to use "nolockd" if file locking is only done on a given file in one client node.) I suspect that some interaction between your server and the lockd/statd client causes them to crash and then the client is stuck trying to talk to them, but I don't really know? Looking at where all the processes and threads are sleeping via "ps axlH" may tell you what is stuck and where. As others noted, intermittent "server not responding...server ok" messages just indicate slow response from the server and don't mean much. However, if a given process is hung and doesn't recover, knowing what it is sleeping on can help w.r.t diagnosis. rick