From owner-freebsd-fs@freebsd.org Wed Jul 1 23:32:34 2015 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 272AB99243C for ; Wed, 1 Jul 2015 23:32:34 +0000 (UTC) (envelope-from allan@physics.umn.edu) Received: from mail.physics.umn.edu (smtp.spa.umn.edu [128.101.220.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 0B19E21B3 for ; Wed, 1 Jul 2015 23:32:33 +0000 (UTC) (envelope-from allan@physics.umn.edu) Received: from peevish.spa.umn.edu ([128.101.220.230]) by mail.physics.umn.edu with esmtpsa (TLSv1:CAMELLIA256-SHA:256) (Exim 4.77 (FreeBSD)) (envelope-from ) id 1ZAQv8-000GlU-PA for freebsd-fs@freebsd.org; Wed, 01 Jul 2015 17:55:58 -0500 Message-ID: <55946FFE.8070402@physics.umn.edu> Date: Wed, 01 Jul 2015 17:55:58 -0500 From: Graham Allan Organization: Physics, University of Minnesota User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Strange NFS problem implicating nfsuserd? Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 01 Jul 2015 23:32:34 -0000 I spent a few days digging into a strange NFSv4 problem at our site, which I think I may have finally resolved but don't really understand why. We have a bunch of large-ish NFS servers running FreeBSD 9.3 exporting ZFS filesystems to mostly "RHEL-clone" linux clients. Over the last few weeks I started getting reports that peoples' jobs would fail erratically with i/o errors, and it became apparent that they pointed in general to all our FreeBSD NFS servers rather than just one. Ultimately I could trivially reproduce the problem running "find . -type f -exec cat {} > /dev/null \;" on one of the NFS-mounted filesystems. Linux clients would eventually error with "Input/output error" FreeBSD clients would eventually error with "Permission denied" on files or directories which should be readable. Reverting to earlier patch releases didn't make any difference, though it seemed like the problem started roughly when I updated p8->p13. Finally I seem to have pinpointed it to one change made in rc.conf for nfsuserd, which I committed at around the right date: nfsuserd_flags="-usermax 500 -usertimeout 600 16" became: nfsuserd_flags="-domain xxx.yyy.zzz -usermax 500 -usertimeout 600 16" probably because I saw a user mapping failure somewhere previously, and decided to make the domain explicit. Undoing this change appears to eliminate the problem - but this makes no sense to me. Starting nfsuserd with either set of options (adding -verbose) prints the same output: Starting nfsuserd. nfsuserd: domain=xxx.yyy.zzz usermax=500 usertimeout=36000 So the domain chosen by default is the same as the one explicitly specified (as I would expect). I've reproduced this across 4-5 different servers and a similar number of different client systems. I'm wondering if any plausible explanation suggests itself? Graham --