From owner-freebsd-fs@FreeBSD.ORG  Fri Mar 30 23:51:48 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 222311065674
	for <freebsd-fs@freebsd.org>; Fri, 30 Mar 2012 23:51:48 +0000 (UTC)
	(envelope-from josh@signalboxes.net)
Received: from mail.signaltotrust.net (hewbert.com [69.164.207.85])
	by mx1.freebsd.org (Postfix) with ESMTP id BC0A48FC14
	for <freebsd-fs@freebsd.org>; Fri, 30 Mar 2012 23:51:47 +0000 (UTC)
Received: from [192.168.1.51] (unknown [67.158.43.219])
	(Authenticated sender: josh)
	by mail.signaltotrust.net (Postfix) with ESMTPSA id 14BB862E1
	for <freebsd-fs@freebsd.org>; Fri, 30 Mar 2012 17:51:46 -0600 (MDT)
Message-ID: <4F764712.2010407@signalboxes.net>
Date: Fri, 30 Mar 2012 17:51:46 -0600
From: Josh Beard <josh@signalboxes.net>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:11.0) Gecko/20120313 Thunderbird/11.0
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: NFS:  rpc.statd/lockd becomes unresponsive
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 30 Mar 2012 23:51:48 -0000

Originally sent to freebsd-net, but I realized this is probably a more 
appropriate list. Sorry!


Hello,

We've recently setup a FreeBSD 9.0-RELEASE (x64) system to test as an 
NFS server for "live" network homes for Mac clients (mostly 10.5 and 
10.6 clients).

We're a public school district and normally have around 150-200 users 
logged in at a time with network homes.  Currently, we're using netatalk 
(AFP) on a Linux box, after migrating from an aging Mac OS X server. 
Unfortunately, netatalk has some serious performance issues under the 
load we're putting it under and we'd like to migrate to NFS.

We've tried several Linux distributions and various kernels and we're 
now testing FreeBSD (and tested FreeNAS) with similar setups. 
Unfortunately, they all suffer the same issue.

As a test, I have a series of scripts to simulate user activity on the 
clients (e.g. opening Word, opening a browser, doing some read/writes 
with dd, etc).  After a while, NFS on the server runs into an issue 
where (what I think happens) rpc.statd can't talk to rpc.lockd.  Being 
Mac clients, they all get a rather ugly dialog box stating that their 
connection to the server has been lost.

It's worth mentioning that this server is a KVM 'guest' on a Linux 
server.  I'm aware of some I/O issues there, but I don't have a decent 
piece of hardware to really test this on.  I allocated 4 CPUs to it and 
10GB of RAM.  I've tested with the virtio net drivers and without. 
Considering I've seen the same symptoms on around 6 Linux distributions, 
with various kernels, FreeNAS, and FreeBSD, I wouldn't be surprised to 
get the same results if I weren't virtualized.

I haven't really done any tuning on the FreeBSD server, it's fairly vanilla.

We have around ~2600 machines throughout our campus, with limited remote 
management capabilities (that's on the big agenda to tackle), so 
changing NFS mount options there would be rather difficult.  These are 
LDAP accounts with the NFS mounts in LDAP as well, for what it's worth. 
  The clients mount it pretty vanilla (output of 'mount' on client):
freenas.dsdk12.schoollocal:/mnt/homes on 
/net/freenas.dsdk12.schoollocal/mnt/homes (nfs, nodev, nosuid, 
automounted, nobrowse)

On the server, my /etc/exports looks like this:
/srv/homes    -alldirs -network    172.30.0.0/16

This export doesn't have a lot of data - it's 150 small home directories 
of test accounts.  No other activity is being done on this server.  The 
filesystem if UFS.

/etc/rc.conf on the server:
rpcbind_enable="YES"
nfs_server_enable="YES"
mountd_flags="-r -l"
nfsd_enable="YES"
mountd_enable="YES"
rpc_lockd_enable="YES"
rpc_statd_enable="YES"
nfs_server_flags="-t -n 128"

When this occurs, /var/log/messages starts to fill up with this:

Mar 30 16:35:18 freefs kernel: Failed to contact local NSM - rpc error 5
Mar 30 16:35:20 freefs rpc.statd: unmon request from localhost, no 
matching monitor
Mar 30 16:35:44 freefs rpc.statd: unmon request from localhost, no 
matching monitor
-- repeated a few times every few seconds --
Mar 30 16:54:50 freefs rpc.statd: Unsolicited notification from host 
hs00508s4434.dsdk12.schoollocal
Mar 30 16:55:01 freefs rpc.statd: Unsolicited notification from host 
hs00520s4539.dsdk12.schoollocal
Mar 30 16:55:10 freefs rpc.statd: Failed to call rpc.statd client at 
host localhost

nfsstat shortly after a failure:
Rpc Info:
  TimedOut   Invalid X Replies   Retries  Requests
         0         0         0         0      1208
Cache Info:
Attr Hits    Misses Lkup Hits    Misses BioR Hits    Misses BioW Hits 
  Misses
       177       951       226        28         3         6         0 
        2
BioRLHits    Misses BioD Hits    Misses DirE Hits    Misses Accs Hits 
  Misses
        49         3        13         5         9         0       148 
        9

Server Info:
   Getattr   Setattr    Lookup  Readlink      Read     Write    Create 
   Remove
    262698    101012   1575347        29   1924761   2172712         0 
    43792
    Rename      Link   Symlink     Mkdir     Rmdir   Readdir  RdirPlus 
   Access
     27447         0        21      5596      1691    118073         0 
  2596146
     Mknod    Fsstat    Fsinfo  PathConf    Commit
         0     83638       108       108    183632
Server Ret-Failed
                 0
Server Faults
             0
Server Cache Stats:
    Inprog      Idem  Non-idem    Misses
         0         0         0   9172982
Server Write Gathering:
  WriteOps  WriteRPC   Opsaved
   2172712   2172712         0

rpcinfo shortly after a failure:
    program version netid     address                service    owner
     100000    4    tcp       0.0.0.0.0.111          rpcbind    superuser
     100000    3    tcp       0.0.0.0.0.111          rpcbind    superuser
     100000    2    tcp       0.0.0.0.0.111          rpcbind    superuser
     100000    4    udp       0.0.0.0.0.111          rpcbind    superuser
     100000    3    udp       0.0.0.0.0.111          rpcbind    superuser
     100000    2    udp       0.0.0.0.0.111          rpcbind    superuser
     100000    4    tcp6      ::.0.111               rpcbind    superuser
     100000    3    tcp6      ::.0.111               rpcbind    superuser
     100000    4    udp6      ::.0.111               rpcbind    superuser
     100000    3    udp6      ::.0.111               rpcbind    superuser
     100000    4    local     /var/run/rpcbind.sock  rpcbind    superuser
     100000    3    local     /var/run/rpcbind.sock  rpcbind    superuser
     100000    2    local     /var/run/rpcbind.sock  rpcbind    superuser
     100005    1    udp6      ::.2.119               mountd     superuser
     100005    3    udp6      ::.2.119               mountd     superuser
     100005    1    tcp6      ::.2.119               mountd     superuser
     100005    3    tcp6      ::.2.119               mountd     superuser
     100005    1    udp       0.0.0.0.2.119          mountd     superuser
     100005    3    udp       0.0.0.0.2.119          mountd     superuser
     100005    1    tcp       0.0.0.0.2.119          mountd     superuser
     100005    3    tcp       0.0.0.0.2.119          mountd     superuser
     100024    1    udp6      ::.3.191               status     superuser
     100024    1    tcp6      ::.3.191               status     superuser
     100024    1    udp       0.0.0.0.3.191          status     superuser
     100024    1    tcp       0.0.0.0.3.191          status     superuser
     100003    2    tcp       0.0.0.0.8.1            nfs        superuser
     100003    3    tcp       0.0.0.0.8.1            nfs        superuser
     100003    2    tcp6      ::.8.1                 nfs        superuser
     100003    3    tcp6      ::.8.1                 nfs        superuser
     100021    0    udp6      ::.3.248               nlockmgr   superuser
     100021    0    tcp6      ::.2.220               nlockmgr   superuser
     100021    0    udp       0.0.0.0.3.202          nlockmgr   superuser
     100021    0    tcp       0.0.0.0.2.255          nlockmgr   superuser
     100021    1    udp6      ::.3.248               nlockmgr   superuser
     100021    1    tcp6      ::.2.220               nlockmgr   superuser
     100021    1    udp       0.0.0.0.3.202          nlockmgr   superuser
     100021    1    tcp       0.0.0.0.2.255          nlockmgr   superuser
     100021    3    udp6      ::.3.248               nlockmgr   superuser
     100021    3    tcp6      ::.2.220               nlockmgr   superuser
     100021    3    udp       0.0.0.0.3.202          nlockmgr   superuser
     100021    3    tcp       0.0.0.0.2.255          nlockmgr   superuser
     100021    4    udp6      ::.3.248               nlockmgr   superuser
     100021    4    tcp6      ::.2.220               nlockmgr   superuser
     100021    4    udp       0.0.0.0.3.202          nlockmgr   superuser
     100021    4    tcp       0.0.0.0.2.255          nlockmgr   superuser
     300019    1    tcp       0.0.0.0.2.185          amd        superuser
     300019    1    udp       0.0.0.0.2.162          amd        superuser

The load can get fairly high during my 'stress' tests, but not *that* 
high.  I'm surprised to see these particular symptoms that affect every 
connected user at the same time and would expect slowdowns rather than 
the issue I'm seeing.

Any ideas or nudges in the right direction are most welcome.  This is 
severely plaguing us and our students :\

Thanks,
Josh