From owner-freebsd-stable@FreeBSD.ORG Mon Apr 21 21:46:58 2008 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 63CC41065679; Mon, 21 Apr 2008 21:46:58 +0000 (UTC) (envelope-from arno@heho.snv.jussieu.fr) Received: from shiva.jussieu.fr (shiva.jussieu.fr [134.157.0.129]) by mx1.freebsd.org (Postfix) with ESMTP id 157BD8FC36; Mon, 21 Apr 2008 21:46:57 +0000 (UTC) (envelope-from arno@heho.snv.jussieu.fr) Received: from heho.snv.jussieu.fr (heho.snv.jussieu.fr [134.157.184.22]) by shiva.jussieu.fr (8.14.2/jtpda-5.4) with ESMTP id m3LLksTv015607 ; Mon, 21 Apr 2008 23:46:54 +0200 (CEST) X-Ids: 168 Received: from heho.snv.jussieu.fr (localhost [127.0.0.1]) by heho.snv.jussieu.fr (8.13.3/jtpda-5.2) with ESMTP id m3LLkqNM022119 ; Mon, 21 Apr 2008 23:46:52 +0200 (MEST) Received: (from arno@localhost) by heho.snv.jussieu.fr (8.13.3/8.13.1/Submit) id m3LLkqOd022116; Mon, 21 Apr 2008 23:46:52 +0200 (MEST) (envelope-from arno) To: Jeremy Chadwick References: <20080421094718.GY25623@hub.freebsd.org> <20080421154333.GA96237@eos.sc1.parodius.com> From: "Arno J. Klaassen" Date: 21 Apr 2008 23:46:52 +0200 In-Reply-To: <20080421154333.GA96237@eos.sc1.parodius.com> Message-ID: Lines: 106 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-3.0 (shiva.jussieu.fr [134.157.0.168]); Mon, 21 Apr 2008 23:46:55 +0200 (CEST) X-Virus-Scanned: ClamAV 0.92/6865/Mon Apr 21 17:43:29 2008 on shiva.jussieu.fr X-Virus-Status: Clean X-Miltered: at jchkmail.jussieu.fr with ID 480D0B4E.000 by Joe's j-chkmail (http : // j-chkmail dot ensmp dot fr)! X-j-chkmail-Enveloppe: 480D0B4E.000/134.157.184.22/heho.snv.jussieu.fr/heho.snv.jussieu.fr/ X-j-chkmail-Score: MSGID : 480D0B4E.000 on jchkmail.jussieu.fr : j-chkmail score : . : R=. U=. O=. B=0.014 -> S=0.014 X-j-chkmail-Status: Ham Cc: Clayton Milos , Kris Kennaway , stable@freebsd.org, net@freebsd.org Subject: Re: nfs-server silent data corruption X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Apr 2008 21:46:58 -0000 re, Jeremy Chadwick writes: > On Mon, Apr 21, 2008 at 04:52:55PM +0200, Arno J. Klaassen wrote: > > Kris Kennaway writes: > > > Uh, you're getting server-side data corruption, it could definitely be > > > because of the memory you added. > > > > yop, though I'm still not convinced the memory is bad (the very same > > Kingston ECC as the 2*1G in use for about half a year already) : > > Can you download and run memtest86 on this system, with the added 2G ECC > insalled? memtest86 doesn't guarantee showing signs of memory problems, > but in most cases it'll start spewing errors almost immediately. it finished in a bit less than 3 hours without a single error/warning I feel pretty confident all memory is fine > One thing I did notice in the motherboard manual below is something > called "Hammer Configuration". It appears to default to 800MHz, but > there's an "Auto" choice. Does using Auto fix anything? Nope > > I added it directly to the 2nd CPU (diagram on page 9 of > > http://www.tyan.com/manuals/m_s2895_101.pdf) and the problem > > seems to be the interaction between nfe0 and powerd .... : > > That board is the weirdest thing I've seen in years. ;) I agree I lifted (?) my eye-brows the first time I saw that diagram > Two separate CPUs using a single (shared) memory controller, two > separate (and different!) nVidia chipsets, a SMSC I/O controller > probably used for serial and parallel I/O, two separate nVidia NICs with > Marvell PHYs (yet somehow you can bridge the two NICs and PHYs?), two > separate PCI-e busses (each associated with a separate nVidia chipset), > two separate PCI-X busses... the list continues. some may say "it's just four wheels, an engine and a steer", she looks different compared to most others > I know you don't need opinions at this point, but what a behemoth. I > can't imagine that thing running reliably. though it does ;) (till the day I decided she deserved a -stable upgrade and 2 more gigs ...) > > - if I stop powerd, problems go away > > This would imply that clock frequency stepping is somehow attributing > itself to the corruption. I don't see any BIOS options for controlling > things related to AMD's Cool-n-Quiet or PowerNow! feature, which is > usually what handles this. you can turn it on/off; anyway, the problem *seems* easy to reproduce when freq drops quickly form 2600Mhz to 1000Mhz .... I just inspected a few corrupted copies, but out of 10-200Mbytes just 1 byte was 0 iso \t > > - I let run powerd but turn of txcsum and tso4 on the interface, > > the problem is a lot harder to produce (if ever this gives > > a hint to anyone) > > Possibly shared interrupts are causing problems? don't think so; I first had two Promise TX4 cards in this box iso the Marvell 8port card; since I had problems with TX4 some time ago I first suspected them. The board is still running memtest86, but from the dmesg I posted I don't see a shared irq. > MSI/MSI-X doing > something odd? Have you tried disabling MSI/MSI-X and see if it makes a > difference? MSI is disabled as is PCI-e Error reporting (or something like that) > > I think you mean "MAC LAN Bridge", according to the motherboard manual. > I'm not even sure what that really does; somehow trunks the two NICs > together to give you the equivalent of 2000mbit of traffic? I don't > know. probably; I never tried ;) I need the second NIC for a seperate subnet > Does the corruption you see go away if you install a separate NIC (e.g. > an Intel NIC) in a PCI or PCI-e slot, and disable the onboard NICs > (should be "MAC LAN: Disable" on both the primary and slave)? Don't have one available right now (for a 2U server). I will test if I do not find another solution. Thanx, Arno