From owner-freebsd-stable@FreeBSD.ORG Mon Jul 9 22:01:26 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id DBEF5106566B for ; Mon, 9 Jul 2012 22:01:26 +0000 (UTC) (envelope-from arno@heho.snv.jussieu.fr) Received: from shiva.jussieu.fr (shiva.jussieu.fr [134.157.0.129]) by mx1.freebsd.org (Postfix) with ESMTP id 8AB888FC0C for ; Mon, 9 Jul 2012 22:01:26 +0000 (UTC) Received: from heho.snv.jussieu.fr (heho.snv.jussieu.fr [134.157.184.22]) by shiva.jussieu.fr (8.14.4/jtpda-5.4) with ESMTP id q69M0rxH029382 ; Tue, 10 Jul 2012 00:01:06 +0200 (CEST) X-Ids: 165 Received: from heho.snv.jussieu.fr (localhost [127.0.0.1]) by heho.snv.jussieu.fr (8.14.3/8.14.3) with ESMTP id q69M0O8E070483; Tue, 10 Jul 2012 00:00:24 +0200 (CEST) (envelope-from arno@heho.snv.jussieu.fr) Received: (from arno@localhost) by heho.snv.jussieu.fr (8.14.3/8.14.3/Submit) id q69M0NMC070480; Tue, 10 Jul 2012 00:00:23 +0200 (CEST) (envelope-from arno) To: Vincent Hoffman From: "Arno J. Klaassen" References: <4FF7055D.9000507@unsane.co.uk> <4FF76066.1000401@unsane.co.uk> Date: Tue, 10 Jul 2012 00:00:23 +0200 In-Reply-To: <4FF76066.1000401@unsane.co.uk> (Vincent Hoffman's message of "Fri\, 06 Jul 2012 23\:02\:14 +0100") Message-ID: User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.3 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Miltered: at jchkmail.jussieu.fr with ID 4FFB5495.002 by Joe's j-chkmail (http : // j-chkmail dot ensmp dot fr)! X-j-chkmail-Enveloppe: 4FFB5495.002/134.157.184.22/heho.snv.jussieu.fr/heho.snv.jussieu.fr/ Cc: freebsd-stable@freebsd.org Subject: Re: nfs-bug when server for 9-Stable becomes client as well ? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 09 Jul 2012 22:01:27 -0000 Vincent Hoffman writes: > On 06/07/2012 18:51, Arno J. Klaassen wrote: >> Vincent Hoffman writes: >> >>> On 06/07/2012 14:19, Arno J. Klaassen wrote: >>>> Hello, >>>> >>>> looks like I discouvered a probable bug in the nfs-code, very >>>> easy to reproduce in my setup : >>>> >>>> >>>> Machine-1 : Today's 9-stable, exporting /files (ufs) and /z2 (zfs) >>>> >>>> Machine-2 : 8-stable as of April the 10th exporting /raid1 >>>> >>>> On Machine-1 I mount /raid1 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768) >>>> and start a script on this mount looping something like : >>>> >>>> dd if=/dev/random of=BIG bs=1048576 count=${SIZE} >>>> cp -fp BIG BIG2 >>>> cmp -x BIG BIG2 >>>> >>>> I let this run for 24 hours (from time to time stressing Machine-1 with >>>> other scripts, including provoking heavy swapping), no problem at all. >>>> >>>> However, then I mount /z2 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768) >>>> on Machine-2, and *immediately* the above loop on Machine-1 fails : >>>> >>>> Copying file ...cp: BIG: Permission denied >>>> >>>> No console messages this time, last time I got >>>> >>>> kernel: nfs_getpages: error 13 >>>> kernel: vm_fault: pager read error, pid 87803 (cmp) >>>> >>>> on Machine-1. >>>> >>>> I repeated this scenario by replacing Machine-2 with a good old >>>> 6-4-stable one, same outcome. >>>> >>>> Please tell me what I could do to nail this down a bit more. >>> Its possible (although not definite) that you have hit the a mountd bug >>> as documented in PRs >>> >>> kern/131342 >>> kern/136865 >> especially kern/131342 looks similar and quite old; funny I never hit >> this before, I basically do the same tests since 'ages' on each new box. >> Could be that faster network/cpu unreveals some race condition; I notice >> as well that this server is the first (IIRC) who uses 3 different IRQs >> for network interrupts (em(4) Intel(R) PRO/1000). > Certainly possible and seems reasonable enough. just my $0.02, I glanced kern/131342, looks like the culprit should be something like a 'non-atomic'-operation in-between invalidating old /etc/exports and validating new /etc/exports. Wonder if just verifying /var/run/mountd.pid is newer than /etc/exports and if true just skip that operation would be an acceptable band-aid (if I understood correctly, a rewrite of mountd correcting this (amongst others) is close to hit -current (?)) >>> I've recently asked on -CURRENT about this and had a patch to try from >>> Rick, I'm testing it now but it doesnt seem to fix it for me, just >>> improve it alothough I'm trying to get enough runs to be a valid sample. >>> (see >>> http://docs.freebsd.org/cgi/getmsg.cgi?fetch=377627+0+archive/2012/freebsd-current/20120701.freebsd-current >>> ) >>> >>> What I did for my production nas was edit mount.c so it didnt send a >>> SIGHUP to mountd as suggested by rick, as it was easy to do and non >>> intrusive. >> hmm, this means I should patch each fbsd-client, no? May be easier to >> patch mountd to ignore SIHGUP and use some non-standard signal to force >> re-init? > No just patch /sbin/mount on the nfs server so it doesnt send the SIGHUP > to mountd. [In my case] it's the mount on a client which causes the server to fail, I don't see how patching /sbin/mount on the nfs server should fix this? As I don't remember if it's possible to discriminate a -1 signal send from a process against one sent from terminal, if so, another bandaid, one sent from a process could be ignored at all? Merci Arno > you can manually HUP mountd if needed. >> >> Arno >> >> >>> Vince >>> >>>> Thanx in advance, >>>> >>>> Best, Arno > > > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" > -- Arno J. Klaassen SCITO S.A. 8 rue des Haies F-75020 Paris, France http://scito.com