From owner-freebsd-fs@freebsd.org Thu Mar 10 14:29:29 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3FF25ACBFDD; Thu, 10 Mar 2016 14:29:29 +0000 (UTC) (envelope-from paul@gromit.dlib.vt.edu) Received: from gromit.dlib.vt.edu (gromit.dlib.vt.edu [128.173.126.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "gromit.dlib.vt.edu", Issuer "Chumby Certificate Authority" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id EFB3D2ED; Thu, 10 Mar 2016 14:29:28 +0000 (UTC) (envelope-from paul@gromit.dlib.vt.edu) Received: from macbook.chumby.lan (c-71-63-91-41.hsd1.va.comcast.net [71.63.91.41]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by gromit.dlib.vt.edu (Postfix) with ESMTPSA id 0D922185; Thu, 10 Mar 2016 09:29:26 -0500 (EST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\)) Subject: Re: Unstable NFS on recent CURRENT From: Paul Mather In-Reply-To: <508973676.11871738.1457575196588.JavaMail.zimbra@uoguelph.ca> Date: Thu, 10 Mar 2016 09:29:25 -0500 Cc: Ronald Klop , freebsd-fs@freebsd.org, freebsd-arm@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <3DAB3639-8FB8-43D3-9517-94D46EDEC19E@gromit.dlib.vt.edu> <1482595660.8940439.1457405756110.JavaMail.zimbra@uoguelph.ca> <08710728-3130-49BE-8BD7-AFE85A31C633@gromit.dlib.vt.edu> <1290552239.10146172.1457484570450.JavaMail.zimbra@uoguelph.ca> <60E8006A-F0A8-4284-839E-882FAD7E6A55@gromit.dlib.vt.edu> <508973676.11871738.1457575196588.JavaMail.zimbra@uoguelph.ca> To: Rick Macklem X-Mailer: Apple Mail (2.3112) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Mar 2016 14:29:29 -0000 On Mar 9, 2016, at 8:59 PM, Rick Macklem wrote: > Paul Mather wrote: >> On Mar 8, 2016, at 7:49 PM, Rick Macklem = wrote: >>=20 >>> Paul Mather wrote: >>>> On Mar 7, 2016, at 9:55 PM, Rick Macklem = wrote: >>>>=20 >>>>> Paul Mather (forwarded by Ronald Klop) wrote: >>>>>> On Sun, 06 Mar 2016 02:57:03 +0100, Paul Mather >>>>>> >>>>>> wrote: >>>>>>=20 >>>>>>> On my BeagleBone Black running 11-CURRENT (r296162) lately I = have been >>>>>>> having trouble with NFS. I have been doing a buildworld and >>>>>>> buildkernel >>>>>>> with /usr/src and /usr/obj mounted via NFS. Recently, this = process has >>>>>>> resulted in the buildworld failing at some point, with a variety = of >>>>>>> errors (Segmentation fault; Permission denied; etc.). Even a = "ls -alR" >>>>>>> of /usr/src doesn't manage to complete. It errors out thus: >>>>>>>=20 >>>>>>> =3D=3D=3D=3D=3D >>>>>>> [[...]] >>>>>>> total 0 >>>>>>> ls: ./.svn/pristine/fe: Permission denied >>>>>>>=20 >>>>>>> ./.svn/pristine/ff: >>>>>>> total 0 >>>>>>> ls: ./.svn/pristine/ff: Permission denied >>>>>>> ls: fts_read: Permission denied >>>>>>> =3D=3D=3D=3D=3D >>>>>>>=20 >>>>>>> On the console, I get the following: >>>>>>>=20 >>>>>>> newnfs: server 'chumby.chumby.lan' error: fileid changed. fsid >>>>>>> 94790777:a4385de: expected fileid 0x4, got 0x2. (BROKEN NFS = SERVER OR >>>>>>> MIDDLEWARE) >>>>>>>=20 >>> Oh, I had forgotten this. Here's the comment related to this error. >>> (about line#445 in sys/fs/nfsclient/nfs_clport.c): >>> 446 * BROKEN NFS SERVER OR MIDDLEWARE >>> 447 * >>> 448 * Certain NFS servers (certain old = proprietary filers >>> ca. >>> 449 * 2006) or broken middleboxes (e.g. WAN = accelerator >>> products) >>> 450 * will respond to GETATTR requests with = results for a >>> 451 * different fileid. >>> 452 * >>> 453 * The WAN accelerator we've observed = not only serves >>> stale >>> 454 * cache results for a given file, it = also >>> occasionally serves >>> 455 * results for wholly different files. = This causes >>> surprising >>> 456 * problems; for example the cached size = attribute of >>> a file >>> 457 * may truncate down and then back up, = resulting in >>> zero >>> 458 * regions in file contents read by = applications. We >>> observed >>> 459 * this reliably with Clang and .c files = during >>> parallel build. >>> 460 * A pcap revealed packet fragmentation = and GETATTR >>> RPC >>> 461 * responses with wholly wrong fileids. >>>=20 >>> If you can connect the client->server with a simple switch (or just = an RJ45 >>> cable), it >>> might be worth testing that way. (I don't recall the name of the = middleware >>> product, but >>> I think it was shipped by one of the major switch vendors. I also = don't >>> know if the product >>> supports NFSv4?) >>>=20 >>> rick >>=20 >>=20 >> Currently, the client is connected to the server via a dumb gigabit = switch, >> so it is already fairly direct. >>=20 >> As for the above error, it appeared on the console only once. (Sorry = if I >> made it sound like it appears every time.) >>=20 >> I just tried another buildworld attempt via NFS and it failed again. = This >> time, I get this on the BeagleBone Black console: >>=20 >> nfs_getpages: error 13 >> vm_fault: pager read error, pid 5401 (install) >>=20 > 13 is EACCES and could be caused by what I mention below. (Any mount = of a file > system on the server unless "-S" is specified as a flag for mountd.) >=20 >>=20 >> The other thing I have noticed is that if I induce heavy load on the = NFS >> server---e.g., by starting a Poudriere bulk build---then that = provokes the >> client to crash much more readily. For example, I started a NFS = buildworld >> on the BeagleBone Black, and it seemed to be chugging along nicely. = The >> moment I kicked off a Poudriere build update of my packages on the = NFS >> server, it crashed the buildworld on the NFS client. >>=20 > Try adding "-S" to mountd_flags on the server. Any time file systems = are mounted > (and Poudriere likes to do that, I am told), mount sends a SIGHUP to = mountd to > reload /etc/exports. When /etc/exports are being reloaded, there will = be access > errors for mounts (that are temporarily not exported) unless you = specify "-S" > (which makes mountd suspend the nfsd threads during the reload of = /etc/exports). >=20 > rick Bingo! I think we may have a winner. I added that flag to mountd_flags = on the server and the "instability" appears to have gone away. It may be that all along the NFS problems on the client just coincided = with Poudriere runs on the server. I build custom packages for my local = machines using Poudriere so I use it quite a lot. Maybe the Poudriere = port should come with a warning at install to those using NFS that it = may provoke disruption and suggest the addition of "-S"? = (Alternatively, maybe "-S" could become a default for mountd_flags? Is = there a downside from using it that means making it a default option is = unsuitable?) Anyway, many, many thanks for all the help, Rick. I'll keep monitoring = my BeagleBone Black, but it looks for now that this has solved the NFS = "instability." Cheers, Paul.