Date: Wed, 27 Jul 2016 16:04:21 -0700 From: Marc Goroff <marc.goroff@quorum.net> To: <freebsd-fs@freebsd.org> Subject: Hanging/stalling mountd on heavily loaded NFS server Message-ID: <98b4db11-8b41-608c-c714-f704a78914b7@quorum.net>
next in thread | raw e-mail | index | archive | help
We have a large and busy production NFS server running 10.2 that is serving approximately 200 ZFS file systems to production VMs. The system has been very stable up until last night when we attempted to mount new ZFS filesystems on NFS clients. The mountd process hung and client mount requests timed out. The NFS server continued to serve traffic to existing clients during this time. The mountd was hung in state nfsv4lck: [root@zfs-west1 ~]# ps -axgl|grep mount 0 38043 1 0 20 0 63672 17644 nfsv4lck Ds - 0:00.30 /usr/sbin/mountd -r -S /etc/exports /etc/zfs/exports It remains in this state for an indeterminate amount of time. I once saw it continue on after several minutes, but most of the time it seems to stay in this state for 15+ minutes. During this time, it does not respond to kill -9 but it will eventually exit after many minutes. Restarting mountd will allow the existing NFS clients to continue (they hang when mountd exits), but any attempt to perform additional NFS mounts will push mountd back into the bad state. This problem seems to be related to the number of NFS mounts off the server. If we unmount some of the clients, we can successfully perform the NFS mounts of the new ZFS filesystems. However, when we attempt to mount all of the production NFS mounts, mountd will hang as above. All clients are using NFS V3 only. dmesg and /var/log/messages show no errors and the server seems to be operating normally other than mountd. The nfs server is configured with 256 nfsd threads, 128GB of RAM, 280TB of disk split into two zpools and 12 CPU cores. Below is the output of 'sysctl -a | grep nfsd' during one of these mountd events: vfs.nfsd.fha.fhe_stats: hash 13: { vfs.nfsd.fha.max_reqs_per_nfsd: 0 vfs.nfsd.fha.max_nfsds_per_fh: 8 vfs.nfsd.fha.bin_shift: 22 vfs.nfsd.fha.enable: 1 vfs.nfsd.request_space_throttle_count: 80875 vfs.nfsd.request_space_throttled: 0 vfs.nfsd.request_space_low: 31457280 vfs.nfsd.request_space_high: 47185920 vfs.nfsd.request_space_used_highest: 47841972 vfs.nfsd.request_space_used: 11074576 vfs.nfsd.groups: 2 vfs.nfsd.threads: 256 vfs.nfsd.maxthreads: 256 vfs.nfsd.minthreads: 256 vfs.nfsd.cachetcp: 1 vfs.nfsd.tcpcachetimeo: 43200 vfs.nfsd.udphighwater: 500 vfs.nfsd.tcphighwater: 0 vfs.nfsd.enable_stringtouid: 0 vfs.nfsd.debuglevel: 0 vfs.nfsd.enable_locallocks: 0 vfs.nfsd.issue_delegations: 0 vfs.nfsd.commit_miss: 0 vfs.nfsd.commit_blks: 0 vfs.nfsd.mirrormnt: 1 vfs.nfsd.async: 0 vfs.nfsd.server_max_nfsvers: 3 vfs.nfsd.server_min_nfsvers: 2 vfs.nfsd.nfs_privport: 0 vfs.nfsd.v4statelimit: 500000 vfs.nfsd.sessionhashsize: 20 vfs.nfsd.fhhashsize: 20 vfs.nfsd.clienthashsize: 20 vfs.nfsd.statehashsize: 10 vfs.nfsd.enable_nogroupcheck: 1 vfs.nfsd.enable_nobodycheck: 1 vfs.nfsd.disable_checkutf8: 0 Any suggestion on how to resolve this issue? Since this is a production server, my options for intrusive debugging are very limited. Thanks. Marc
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?98b4db11-8b41-608c-c714-f704a78914b7>