Date: Fri, 30 Mar 2012 17:36:50 -0600 From: Josh Beard <josh@signalboxes.net> To: freebsd-net@freebsd.org Subject: NFS: rpc.statd/lockd becomes unresponsive Message-ID: <4F764392.6090400@signalboxes.net>
next in thread | raw e-mail | index | archive | help
Hello, We've recently setup a FreeBSD 9.0-RELEASE (x64) system to test as an NFS server for "live" network homes for Mac clients (mostly 10.5 and 10.6 clients). We're a public school district and normally have around 150-200 users logged in at a time with network homes. Currently, we're using netatalk (AFP) on a Linux box, after migrating from an aging Mac OS X server. Unfortunately, netatalk has some serious performance issues under the load we're putting it under and we'd like to migrate to NFS. We've tried several Linux distributions and various kernels and we're now testing FreeBSD (and tested FreeNAS) with similar setups. Unfortunately, they all suffer the same issue. As a test, I have a series of scripts to simulate user activity on the clients (e.g. opening Word, opening a browser, doing some read/writes with dd, etc). After a while, NFS on the server runs into an issue where (what I think happens) rpc.statd can't talk to rpc.lockd. Being Mac clients, they all get a rather ugly dialog box stating that their connection to the server has been lost. It's worth mentioning that this server is a KVM 'guest' on a Linux server. I'm aware of some I/O issues there, but I don't have a decent piece of hardware to really test this on. I allocated 4 CPUs to it and 10GB of RAM. I've tested with the virtio net drivers and without. Considering I've seen the same symptoms on around 6 Linux distributions, with various kernels, FreeNAS, and FreeBSD, I wouldn't be surprised to get the same results if I weren't virtualized. I haven't really done any tuning on the FreeBSD server, it's fairly vanilla. We have around ~2600 machines throughout our campus, with limited remote management capabilities (that's on the big agenda to tackle), so changing NFS mount options there would be rather difficult. These are LDAP accounts with the NFS mounts in LDAP as well, for what it's worth. The clients mount it pretty vanilla (output of 'mount' on client): freenas.dsdk12.schoollocal:/mnt/homes on /net/freenas.dsdk12.schoollocal/mnt/homes (nfs, nodev, nosuid, automounted, nobrowse) On the server, my /etc/exports looks like this: /srv/homes -alldirs -network 172.30.0.0/16 This export doesn't have a lot of data - it's 150 small home directories of test accounts. No other activity is being done on this server. The filesystem if UFS. /etc/rc.conf on the server: rpcbind_enable="YES" nfs_server_enable="YES" mountd_flags="-r -l" nfsd_enable="YES" mountd_enable="YES" rpc_lockd_enable="YES" rpc_statd_enable="YES" nfs_server_flags="-t -n 128" When this occurs, /var/log/messages starts to fill up with this: Mar 30 16:35:18 freefs kernel: Failed to contact local NSM - rpc error 5 Mar 30 16:35:20 freefs rpc.statd: unmon request from localhost, no matching monitor Mar 30 16:35:44 freefs rpc.statd: unmon request from localhost, no matching monitor -- repeated a few times every few seconds -- Mar 30 16:54:50 freefs rpc.statd: Unsolicited notification from host hs00508s4434.dsdk12.schoollocal Mar 30 16:55:01 freefs rpc.statd: Unsolicited notification from host hs00520s4539.dsdk12.schoollocal Mar 30 16:55:10 freefs rpc.statd: Failed to call rpc.statd client at host localhost nfsstat shortly after a failure: Rpc Info: TimedOut Invalid X Replies Retries Requests 0 0 0 0 1208 Cache Info: Attr Hits Misses Lkup Hits Misses BioR Hits Misses BioW Hits Misses 177 951 226 28 3 6 0 2 BioRLHits Misses BioD Hits Misses DirE Hits Misses Accs Hits Misses 49 3 13 5 9 0 148 9 Server Info: Getattr Setattr Lookup Readlink Read Write Create Remove 262698 101012 1575347 29 1924761 2172712 0 43792 Rename Link Symlink Mkdir Rmdir Readdir RdirPlus Access 27447 0 21 5596 1691 118073 0 2596146 Mknod Fsstat Fsinfo PathConf Commit 0 83638 108 108 183632 Server Ret-Failed 0 Server Faults 0 Server Cache Stats: Inprog Idem Non-idem Misses 0 0 0 9172982 Server Write Gathering: WriteOps WriteRPC Opsaved 2172712 2172712 0 rpcinfo shortly after a failure: program version netid address service owner 100000 4 tcp 0.0.0.0.0.111 rpcbind superuser 100000 3 tcp 0.0.0.0.0.111 rpcbind superuser 100000 2 tcp 0.0.0.0.0.111 rpcbind superuser 100000 4 udp 0.0.0.0.0.111 rpcbind superuser 100000 3 udp 0.0.0.0.0.111 rpcbind superuser 100000 2 udp 0.0.0.0.0.111 rpcbind superuser 100000 4 tcp6 ::.0.111 rpcbind superuser 100000 3 tcp6 ::.0.111 rpcbind superuser 100000 4 udp6 ::.0.111 rpcbind superuser 100000 3 udp6 ::.0.111 rpcbind superuser 100000 4 local /var/run/rpcbind.sock rpcbind superuser 100000 3 local /var/run/rpcbind.sock rpcbind superuser 100000 2 local /var/run/rpcbind.sock rpcbind superuser 100005 1 udp6 ::.2.119 mountd superuser 100005 3 udp6 ::.2.119 mountd superuser 100005 1 tcp6 ::.2.119 mountd superuser 100005 3 tcp6 ::.2.119 mountd superuser 100005 1 udp 0.0.0.0.2.119 mountd superuser 100005 3 udp 0.0.0.0.2.119 mountd superuser 100005 1 tcp 0.0.0.0.2.119 mountd superuser 100005 3 tcp 0.0.0.0.2.119 mountd superuser 100024 1 udp6 ::.3.191 status superuser 100024 1 tcp6 ::.3.191 status superuser 100024 1 udp 0.0.0.0.3.191 status superuser 100024 1 tcp 0.0.0.0.3.191 status superuser 100003 2 tcp 0.0.0.0.8.1 nfs superuser 100003 3 tcp 0.0.0.0.8.1 nfs superuser 100003 2 tcp6 ::.8.1 nfs superuser 100003 3 tcp6 ::.8.1 nfs superuser 100021 0 udp6 ::.3.248 nlockmgr superuser 100021 0 tcp6 ::.2.220 nlockmgr superuser 100021 0 udp 0.0.0.0.3.202 nlockmgr superuser 100021 0 tcp 0.0.0.0.2.255 nlockmgr superuser 100021 1 udp6 ::.3.248 nlockmgr superuser 100021 1 tcp6 ::.2.220 nlockmgr superuser 100021 1 udp 0.0.0.0.3.202 nlockmgr superuser 100021 1 tcp 0.0.0.0.2.255 nlockmgr superuser 100021 3 udp6 ::.3.248 nlockmgr superuser 100021 3 tcp6 ::.2.220 nlockmgr superuser 100021 3 udp 0.0.0.0.3.202 nlockmgr superuser 100021 3 tcp 0.0.0.0.2.255 nlockmgr superuser 100021 4 udp6 ::.3.248 nlockmgr superuser 100021 4 tcp6 ::.2.220 nlockmgr superuser 100021 4 udp 0.0.0.0.3.202 nlockmgr superuser 100021 4 tcp 0.0.0.0.2.255 nlockmgr superuser 300019 1 tcp 0.0.0.0.2.185 amd superuser 300019 1 udp 0.0.0.0.2.162 amd superuser The load can get fairly high during my 'stress' tests, but not *that* high. I'm surprised to see these particular symptoms that affect every connected user at the same time and would expect slowdowns rather than the issue I'm seeing. Any ideas or nudges in the right direction are most welcome. This is severely plaguing us and our students :\ Thanks, Josh
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F764392.6090400>