From owner-freebsd-fs@FreeBSD.ORG Fri Mar 11 00:18:00 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1FE1A106566C for ; Fri, 11 Mar 2011 00:18:00 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta08.westchester.pa.mail.comcast.net (qmta08.westchester.pa.mail.comcast.net [76.96.62.80]) by mx1.freebsd.org (Postfix) with ESMTP id CEA3F8FC0A for ; Fri, 11 Mar 2011 00:17:59 +0000 (UTC) Received: from omta20.westchester.pa.mail.comcast.net ([76.96.62.71]) by qmta08.westchester.pa.mail.comcast.net with comcast id HcBo1g0081YDfWL58cJ09P; Fri, 11 Mar 2011 00:18:00 +0000 Received: from koitsu.dyndns.org ([98.248.33.18]) by omta20.westchester.pa.mail.comcast.net with comcast id HcHw1g00l0PUQVN3gcHxDm; Fri, 11 Mar 2011 00:17:59 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 53D829B422; Thu, 10 Mar 2011 16:17:55 -0800 (PST) Date: Thu, 10 Mar 2011 16:17:55 -0800 From: Jeremy Chadwick To: Stephen McKay Message-ID: <20110311001755.GB9136@icarus.home.lan> References: <201103081425.p28EPQtM002115@dungeon.home> <201103091241.p29CfUM1003302@dungeon.home> <201103102319.p2ANJWxN002125@dungeon.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201103102319.p2ANJWxN002125@dungeon.home> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@freebsd.org Subject: Re: Constant minor ZFS corruption X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Mar 2011 00:18:00 -0000 On Fri, Mar 11, 2011 at 09:19:32AM +1000, Stephen McKay wrote: > On Thursday, 10th March 2011, Chris Forgeron wrote: > >Lastly, check what Mike Tancsa said about his hardware - All of my > >gear is quality, 1000W dual redundant power supplies, LSI SAS > >controllers, ECC registered ram, no overclocking, etc, etc. You may > >have a software issue, but it's more likely that ZFS is just exposing > >some instability in your system. Has your RAM checked out with a Memtest > >run overnight? We're talking small, intermittent errors here, not big > >red flags that will be obvious to spot. > > The ASUS PIKE2008 card is LSI based. Our RAM is ECC. We're not > overclocking (in fact I disabled turbo-boost). We haven't run memtest > but we have done a few "make buildworld" runs. All of these completed > without error. And with ECC RAM, we should see log messages if anything > is wrong there anyway. Specifically with regards to your last sentence: you're making blind assumptions here. Let me talk a bit about how ECC RAM errors are reported to the motherboard and how all of that works. (Also -- calling John Baldwin to come in here and correct me if I'm wrong, because over the years I've had to piece all of this together myself, and I could obviously have parts wrong. :-) ) When there's an uncorrectable-bit or correctable-bit errors (of either single-bit or multi-bit types), witnessed on ECC RAM, the memory controller can (doesn't have to!) throw, on the PCI bus, what's called a PERR or SERR signal. The BIOS controls this capability, and what PERR/SERR can get turned into. Some BIOSes permit you to tie these signals to an interrupt (usually some form of NMI). The operating system's kernel has to be written to understand this NMI and handle it appropriately. So you have the following pieces that are required for the OS to report an ECC error: 1) Use of ECC RAM, 2) A memory controller on your motherboard (or possibly the MCH is within the CPU, such as on newer Core iX CPUs or some Xeons) that supports throwing PERR# and SERR# signals, 3) A BIOS that can set up an NMI generation on PERR or SERR, 4) An operating system that knows how to handle that NMI. There are a LOT of motherboards out there which "support ECC", but what they mean to say is "our board works with ECC RAM, but if there's uncorrected bit errors we didn't implement any mechanisms to tell the underlying OS, lolz". Lots of consumer-grade boards that claim to work with either ECC or non-ECC RAM do this. You won't find the BIOS tweaks in there, and Technical Support will just tell you "yes board X works with ECC". Lovely situation. Does FreeBSD support the above? I have absolutely no idea. The only systems I've used which can generate an NMI on PERR or SERR are Tyan boards (we use them at work), and all those systems run Solaris. Solaris also has really good MCA support -- more on that next. Now, there's also another possibility/mechanism, which is MCA. MCA is something that's generated by the actual processor and covers quite a vast number of hardware events of all ranges (some minor, some major). MCA will generate an MCE when there's any sort of memory error and so on. The OS has to have support for handling MCA, and also has to provide decent details of the MCE. Decoding MCEs is tricky, especially on FreeBSD. John Baldwin has made some patches for getting Linux's mcelog working -- well, the log parsing part -- on FreeBSD (but they're slightly out of date; I can provide more recent patches if need be). Don't expect direct DMI to work on FreeBSD with mcelog, for example. So with this situation we now have: 1) CPU has to support MCA, 2) OS has to support MCA and know how to decode MCEs properly, 3) Utilities to decode MCEs correctly. FreeBSD 8.x does support MCA (it's enabled by default), and if you skim the -stable list you'll find people occasionally trying to figure out why their system is spewing these mysterious MCEs and what they mean. MCA is only available, however, if your CPU supports it, and my gut feeling says that parts of the system (motherboard) have to have parts integrated as well. So circling back to your very first post, you said you were using: Asus P7F-E (includes 6 3Gb/s SATA ports) Oh dear, Asus. What kind of mission-critical environment uses this hardware? :-) Let's see what the user manual has in it. Section 4.4.2 has options related to the Northbridge (which I'm not sure what it is in this case; the board supports Core iX CPUs which have on-die MCH, so I'm not sure what this controls). All of the items in this section of the manual are horribly documented, but ones that catch my eye are: * DRAM Margin Ranks (Enabled/Disabled) * MRC Serial Debug Message Level (Disabled/Min/Max/Test) * Memory ECC Function (Enabled/Disabled) * Page Policy (Closed/Open) * Adaptive Page (Disabled/Enabled) * Data Scramble (Disabled/Enabled) * Memory Thermal Throttling (Disabled/CLTT/OLTT) I know what the 3rd and last items do, but not the rest. There's also something on the Southbridge part of the manual which is strange: something called "Energy Lake Feature". It defaults to Disabled, with a comment "We do not recommend you enable this feature". This is all I could find: * Energy Lake technology introduces two main end-user features: the "Consumer Electronics" (CE)-like device power behavior, and maintaining system state and data integrity during power loss events. * Allow you to configure Intel's Energy Lake power management technology. If you are running a Media Center you can install the Intel VIIV software to get the correct driver; otherwise disable the Energy Lake feature in BIOS (it relates purely to Intel's Quick Resume feature, which is generally useless). Otherwise, I see no mention of MCA, PERR/SERR, or anything else that's considered useful (by my standards). I see lots of server-esque options like BIOS-level serial console, but the rest of the board is extremely desktop-oriented, which is what Asus is known for. > We have tried to buy quality hardware. At least, we didn't deliberately > skimp (except to build our own box vs buy a big name brand pre-built zfs > server). No offence intended -- honestly -- but I question anyone who would buy an Asus motherboard for a server. If I was sitting in a meeting room with infrastructure engineers discussing what to buy and someone said "We're considering Asus", I would say "This is a joke, right?" (Note that for my home Windows workstations, I do use Asus motherboards) Sure, the motherboard might not even be the problem. But I'm just saying, who knows what's going on here, I have to question everything. You followed up with "we're starting to question the PIKE card", which should in turn make you question exactly why you bought this hardware to begin with. My recommendation, while not wanting to spend zillions of bucks on HP/Compaq or Dell hardware? Supermicro. I can't talk about their storage HBAs, but many other people here can -- the results have been hit-or-miss. I tend to stick with solely Intel ICHxx or ESBx on-board controllers, which FreeBSD works wonderfully with. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP 4BD6C0CB |