From owner-freebsd-fs@FreeBSD.ORG Sat Jun 18 14:45:39 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 929831065672 for ; Sat, 18 Jun 2011 14:45:39 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta10.emeryville.ca.mail.comcast.net (qmta10.emeryville.ca.mail.comcast.net [76.96.30.17]) by mx1.freebsd.org (Postfix) with ESMTP id 7AC868FC1A for ; Sat, 18 Jun 2011 14:45:39 +0000 (UTC) Received: from omta24.emeryville.ca.mail.comcast.net ([76.96.30.92]) by qmta10.emeryville.ca.mail.comcast.net with comcast id xSNA1g0051zF43QAASldPt; Sat, 18 Jun 2011 14:45:37 +0000 Received: from koitsu.dyndns.org ([67.180.84.87]) by omta24.emeryville.ca.mail.comcast.net with comcast id xSl51g00u1t3BNj8kSl6hX; Sat, 18 Jun 2011 14:45:07 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id AFC32102C36; Sat, 18 Jun 2011 07:45:36 -0700 (PDT) Date: Sat, 18 Jun 2011 07:45:36 -0700 From: Jeremy Chadwick To: Stephane LAPIE Message-ID: <20110618144536.GA15627@icarus.home.lan> References: <4DFCB12A.6030805@darkbsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4DFCB12A.6030805@darkbsd.org> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@freebsd.org, freebsd-drivers@freebsd.org, freebsd-hardware@freebsd.org Subject: Re: Problem with a LSILogic SAS/SATA adapter on 8.2-STABLE/ZFSv28 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Jun 2011 14:45:39 -0000 On Sat, Jun 18, 2011 at 11:07:38PM +0900, Stephane LAPIE wrote: > I have a problem with my 8.2-STABLE/ZFSv28 server. I am currently > upgrading my disks from 1.5TB Seagate drives to 2TB Seagate drives, and > therefore replacing devices within ZFS. (I have activated deduplication > on a few file systems, for the record) > > I think this is more related to a hardware problem (flaky memory ? flaky > controller/driver maybe ?), but I would appreciate any input. > > I experienced several kernel panics, all of which seem to point at mpt0 > mis-handling interrupts : > www.darkbsd.org/~darksoul/kernel-panic-mpt1.txt (no target cmd ptrs) > www.darkbsd.org/~darksoul/kernel-panic-mpt2.txt (mpt_intr index == ...) > www.darkbsd.org/~darksoul/kernel-panic-mpt3.txt (NMI in kernel mode) > www.darkbsd.org/~darksoul/kernel-panic-mpt4.txt (LAN CONTEXT REPLY) > www.darkbsd.org/~darksoul/kernel-panic-mpt5.txt (LAN CONTEXT REPLY) > www.darkbsd.org/~darksoul/kernel-panic-mpt6.txt (LAN CONTEXT REPLY) > www.darkbsd.org/~darksoul/kernel-panic-mpt7.txt (LAN CONTEXT REPLY) > > I would appeciate any pointers to what on earth "LAN CONTEXT REPLY" > means for an LSI controller (using driver mpt(4)), as I have no idea, > and the source was not really helpful. > > The error message about an NMI and RAM parity error is what is scaring > me the most here, and points me in the direction of flaky memory. > > This is a personal machine, so I can add debug options and try stuff if > it can help figure out what is going on. Also, any critical data is > replicated, backed up and accounted for. For readers, the NMI and RAM parity error message in question is shown here: http://www.darkbsd.org/~darksoul/kernel-panic-mpt2.txt But is difficult to decode due to the well-established problem with the FreeBSD kernel interspersing text output. (I imagine this gets worse the more cores you have on your system, but that's not relevant to this discussion) Anyway, to expand on the "RAM parity error" and NMI message: this information I'm going to give you isn't specific to the LSI controller; it's a general piece of information. I've talked about this in the past. Please read it and focus on the SERR/PERR and NMI details: http://lists.freebsd.org/pipermail/freebsd-fs/2011-March/010938.html If you want to rule out actual system RAM issues, I would recommend running memtest86 for about 30 minutes, and then memtest86+ for the same amount of time. This might sound crazy ("why can't I just run one?!"), but you need to review the ChangeLog for memtest86 to see why. Their support for detecting corrected ECC errors was removed with 4.0, but in 4.0 they added multi-CPU support (which is good to have in this situation), while memtest86 may still have support for ECC. Neither of these utilities are as excellent as a hardware RAM tester (which does cool things like sending extreme amounts of voltage through each DRAM module, looks for soft and hard errors, etc.), but those are expensive. Usually system memory problems will show up in memtest86/86+ pretty quickly though. All that said: it may be possible that the NMIs you're seeing aren't being induced by system RAM issues at all, but somehow are being generated or caused by the LSI controller. I wasn't under the impression that a PCIe MSI and/or MSI-X generated an NMI, but I could be completely wrong. You may want to try the memtest86/86+ tests with and without the LSI controller plugged into the system to see if there's any difference as well. So that's another hour of testing. Anyway, hope this helps in some regard. P.S. -- In the future, try to avoid cross-posting. :-) -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |