From owner-freebsd-fs@FreeBSD.ORG Thu Mar 10 23:41:57 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C9E5D1065670 for ; Thu, 10 Mar 2011 23:41:57 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta04.emeryville.ca.mail.comcast.net (qmta04.emeryville.ca.mail.comcast.net [76.96.30.40]) by mx1.freebsd.org (Postfix) with ESMTP id AD2038FC0A for ; Thu, 10 Mar 2011 23:41:57 +0000 (UTC) Received: from omta15.emeryville.ca.mail.comcast.net ([76.96.30.71]) by qmta04.emeryville.ca.mail.comcast.net with comcast id Hb2G1g0071Y3wxoA4bhxFu; Thu, 10 Mar 2011 23:41:57 +0000 Received: from koitsu.dyndns.org ([98.248.33.18]) by omta15.emeryville.ca.mail.comcast.net with comcast id Hbhj1g01P0PUQVN8bbhojy; Thu, 10 Mar 2011 23:41:54 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id B1EEC9B422; Thu, 10 Mar 2011 15:41:43 -0800 (PST) Date: Thu, 10 Mar 2011 15:41:43 -0800 From: Jeremy Chadwick To: Stephen McKay Message-ID: <20110310234143.GA9136@icarus.home.lan> References: <201103081425.p28EPQtM002115@dungeon.home> <201103091241.p29CfUM1003302@dungeon.home> <4D7788D9.50808@sentex.net> <201103102302.p2AN2hNB002016@dungeon.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201103102302.p2AN2hNB002016@dungeon.home> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@freebsd.org Subject: Re: Constant minor ZFS corruption X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Mar 2011 23:41:57 -0000 On Fri, Mar 11, 2011 at 09:02:43AM +1000, Stephen McKay wrote: > On Wednesday, 9th March 2011, Mike Tancsa wrote: > > >On 3/9/2011 7:41 AM, Stephen McKay wrote: > >> Of the 12 disks, only 1 has been error-free. I've been doing this for > >> about 10 days now and there is no pattern that I can see in the errors. > > >After adding a larger case for future expansion, we found the next day > >we were seeing all sorts of random errors > > > >Like > > > >Mar 3 05:34:47 offsite kernel: ad1: FAILURE - WRITE_DMA48 > >status=51 error=10 LBA=2281852580 > > > >and > > > >Mar 4 08:56:15 offsite kernel: siisch1: siis_timeout is 00040000 ss > >04000000 rs 04000000 es 00000000 sts 801e2000 serr 00000000 Speaking strictly to Mike here: I spent some time a while ago trying to figure out the NID_NOT_FOUND error. Something I wrote back when I was contributing on the Wiki; see section "SATA disk troubleshooting": http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting So, it could be that the LBA being accessed isn't within the permitted valid range. I could be completely off my rocker though; I'd need someone much more familiar with the ATA-7 specification to state up front what this bit actually defines. Anyway, despite that, the controller is also reporting timeouts. What you haven't shown is what exact model of Silicon Image controller you're using. It matters. There are certain models of SI chipsets that have very bad, nasty bugs. Other models of chips do not have these issues: http://en.wikipedia.org/wiki/Silicon_Image#Product_alerts > Our system does not report any driver errors or disk errors. We see > checksum errors from ZFS (mostly in scrubs). It's like there's an > invisible pixie sprinkling bad data on our disks while we sleep. Speaking to Stephen: With disk bit rot, your "system" (motherboard) won't report any errors. The controller you're using won't report any errors. The disks also won't report any errors. ZFS, however, *will* report checksum errors. What if there's a bug in the FreeBSD driver you're using? What if on some rare occasion it only writes 4095 bytes of the 4096 it needs to write? What if there's a off-by-one bug in the FreeBSD driver where it's randomly corrupting a piece of data it intends to write to the disk? And what about the firmware, which controls all the disk interaction? There's also the possibility that there's some wonkiness going on with the memory controller on your mainboard; maybe it's randomly corrupting something. ECC RAM wouldn't necessarily detect this either. FreeBSD kmem/KVA, as I understand it, is dedicated solely to the kernel and not to userland (so a userland app might not sig11, for example). However I would expect the kernel to be freaking out randomly in other ways (e.g. I would expect the system to be behaving oddly and not just limited to ZFS or disk I/O). You get the idea. The problem could be anywhere. Welcome to OS, system, and hardware troubleshooting in 2011, glad to have you on the team. ;-) You're going to need to spend a lot of time debugging this, and some of it will absolutely involve downtime unless you can afford to build a complete 100% identical replica system that can reproduce the problem. If you can reproduce the problem on that system, awesome. My advice would be to start (on the replica system) by replacing the controller entirely. Use some on-board SATA controller, or invest in an Areca or "something else". This will narrow down the problem to either the controller, the controller firmware, or the FreeBSD driver. That should help. > >We narrowed it down to 2 problems. Failing / Marginal power supply and > >bad SATA cables. After changing the power supply, we still had a few > >disks errors. > > If either of these were the cause of our problem, we'd see errors > logged, right? Not just invisible corruption? Simple answer: no. Long answer: I can't provide one because I'm not an EE guy, so you'll just have to trust me: problems caused by dirty power or "bad power" are absolutely crazy. Given how complex hardware is these days (numerous ASICs, circuitry components, etc.), absolutely bizarre and weird things happen when a device doesn't get what it expects. That's about all I know, and there's lots of evidence on the net to validate this fact. I just wish I could put more absolute faith into it, but since I don't understand EE/power "stuff", it'll always be a mystery to me. I could give you an example of a power-related problem I'm dealing with at home that would probably blow your mind. Contact me off-list if you want the story (every person I've given it to so far has gone "...what? That makes absolutely no sense. Did you try...?" "Yes" "What about..." "Yep" "...wow"). > We will probably swap the power supply and cables anyway soon, just to > see what happens, but on other machines where cables or power was the > problem I saw errors (just like yours) in the logs. I imagine your controller has some kind of multi-lane break-out cable that's used. It's possible that thing is bad. God I hate to bring this up, because it's really going out on a limb, but there's always the remote possibility of interference/EMI causing "weird things" to happen with data flowing across the cable. However, I STRONGLY doubt this; SMART attribute 199 (UDMA CRC Error Count) would absolutely be incrementing whenever this occurred. If you want to provide me with SMART stats (smartctl -a /dev/disk) for each of your disks (please be sure to label them and re-provide me "zpool status" output so I can correlate the checksum errors with the disks), I will be happy to review them for you. > >After almost 5 days of uptime, no problems at all now. Not one error. > > Well, we've got something to aim for, eh? :-) I sure hope so. :-) Like you, I hate problems of this nature. And problems of this nature are exactly why I started spending a *lot* of time, both at my job and outside of work, studying disks, ATA/SATA, and storage a bit better. I honestly don't mean to sound like a braggart (despite how direct/pompous I am, I honestly have a very small ego), but I've more or less become the main guy at my workplace when it comes to disk/storage problems. I just got done dealing with two separate cases of desktop-grade SATA disks in our Citrix Netscaler products (which use FreeBSD) spewing DMA errors right in the middle of OS upgrades (worst time for it to happen). I was able to work around the problem using a combo of a sh script, dd, and smartmontools, allowing upgrades to complete + get production traffic working again. We did RMAs on the disks/units later, since the turnaround time for replacements was way outside of the permitted maintenance window. Networking owes me a case of beer. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP 4BD6C0CB |