Date: Tue, 8 Jan 2008 15:24:42 -0800 From: Jeremy Chadwick <koitsu@FreeBSD.org> To: "Stephen M. Rumble" <stephen.rumble@utoronto.ca> Cc: freebsd-stable@freebsd.org Subject: Re: RELENG_7: zfs mirror causes ata timeout Message-ID: <20080108232442.GA35068@eos.sc1.parodius.com> In-Reply-To: <20080108172846.2lglrcvo0qsk88o0@webmail.utoronto.ca> References: <20080108172846.2lglrcvo0qsk88o0@webmail.utoronto.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jan 08, 2008 at 05:28:46PM -0500, Stephen M. Rumble wrote: > I'm having a bit of trouble with a new machine running the latest RELENG_7 > code. I have two 500GB WD Caviar GP disks on a mini-itx GM965-based board > (MSI "fuzzy") running amd64 with 4GB of ram. The disks are: Could be related to a PR that I submit long ago, but was not specific to ZFS -- instead, it appeared to be specific to the motherboard I was using. There's also some tidbits posted by others which appeared to help them, although performance was impacted: http://www.freebsd.org/cgi/query-pr.cgi?pr=103435 Another related PR, which seems to indicate motherboard problems: http://www.freebsd.org/cgi/query-pr.cgi?pr=93885 > ad4: 476940MB <WDC WD5000AACS-00ZUB0 01.01B01> at ata2-master SATA150 > ad6: 476940MB <WDC WD5000AACS-00ZUB0 01.01B01> at ata3-master SATA150 > > I've tried different power supplies and cables. I've enabled and disabled > spread spectrum clocking and tried both SATA300 and SATA150 rates. I've > also tried switching drives between ports so that what was ad4 is ad6 and > what was ad6 is ad4. The problems persist, but seem to follow the same > drive (ad6 originally, then ad4 when swapped). This seems to indicate a > drive problem, but it works great on its own, even when exercising both > disks simultaneously. SMART reports no problems and ZFS reports no issues > when ad6 is used on its own outside of a zfs mirror. It seems like it's the > drive, but it works fine when not in a mirror. I'm stumped. Any ideas? Have you tried running long SMART tests (smartctl -t long) on both of these drives, ditto with an offline test (smartctl -t offline)? Statistics that are labelled "Offline" as their type won't get updated until an offline test is performed. It's possible those statistics may provide some answers, but no guarantees. > The only interesting bit of evidence I could find is that when these errors > do occur, smartctl reports an increase in the Start_Stop_Count field on > ad6. ad4, which appears to work fine, doesn't demonstrate this and has a > much lower value. Start_Stop_Count indicates the drive is actually stopping then spinning back up (usually caused by a reset of some kind; equivalent of powering down then back up but without the loss of power). It's possible that your drive has actual problems -- this is supported by the fact that the problem follows the disk (when moving the disk to another SATA port). Tracking down the source of this problem usually requires a lot of time, money, and trial-and-error techniques. This is what I'd go with: 1) See if there's a BIOS update. I know at least in the case of Intel manufactured boards BIOS updates have solved weird problems like this in the past. 2) Try an Advanced RMA with Western Digital (which guarantees you get a brand new drive rather than chancing that they repair the one you send them) and see if a new drive helps. 3) Try replacing the motherboard with a different brand (non-MSI). I have nothing against MSI, but switching vendors usually means that you ensure a cross-model h/w bug (e.g. something vendor does in the BIOS or engineering which is suspect). Try Asus or Gigabyte. Obviously this will cost money to do and will very likely set you out the cost of the motherboard you have currently, but it's a viable option since you've already tried replacing SATA cables. I'm not sure why ZFS would cause something like this to happen vs. UFS. I happen to run ZFS at home (same machine as what's mentioned in PR 103435, with the replaced motherboard of course) doing very heavy disk I/O across two disks, and I have never seen problems of this sort. That doesn't mean there isn't a problem, just that I haven't encountered it with ZFS. My box at home is an Asus A8N-E w/ 2GB, running RELENG_7 i386. I don't use any of the on-board "RAID" garbage; I use FreeBSD for it. Relevant SATA stuff: atapci1: <nVidia nForce CK804 SATA300 controller> port 0x9f0-0x9f7,0xbf0-0xbf3,0x970-0x977,0xb70-0xb73,0xd800-0xd80f mem 0xd3002000-0xd3002fff irq 23 at device 7.0 on pci0 atapci1: [ITHREAD] ata2: <ATA channel 0> on atapci1 ata2: [ITHREAD] ata3: <ATA channel 1> on atapci1 ata3: [ITHREAD] atapci2: <nVidia nForce CK804 SATA300 controller> port 0x9e0-0x9e7,0xbe0-0xbe3,0x960-0x967,0xb60-0xb63,0xc400-0xc40f mem 0xd3001000-0xd3001fff irq 21 at device 8.0 on pci0 atapci2: [ITHREAD] ata4: <ATA channel 0> on atapci2 ata4: [ITHREAD] ata5: <ATA channel 1> on atapci2 ata5: [ITHREAD] ad4: 476940MB <WDC WD5000AAKS-00TMA0 12.01C01> at ata2-master SATA300 ad6: 476940MB <WDC WD5000AAKS-00TMA0 12.01C01> at ata3-master SATA300 ad8: 190782MB <WDC WD2000JD-00HBB0 08.02D08> at ata4-master SATA150 ad10: 476940MB <Seagate ST3500630AS 3.AAE> at ata5-master SATA300 Disks ad4/ad6 are in a ZFS pool (RAID-0, not mirror), and ad8/ad10 are UFS. All are on the same physical SATA controller, as you can see. icarus# zpool status pool: storage state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 0 errors: No known data errors icarus# zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT storage 928G 126G 802G 13% ONLINE - -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080108232442.GA35068>