Date: Fri, 25 Jan 2008 13:30:55 -0800 From: Jeremy Chadwick <koitsu@FreeBSD.org> To: Joe Peterson <joe@skyrush.com> Cc: freebsd-stable@freebsd.org Subject: Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1 Message-ID: <20080125213055.GA46500@eos.sc1.parodius.com> In-Reply-To: <479A3764.6050800@skyrush.com> References: <479A0731.6020405@skyrush.com> <20080125162940.GA38494@eos.sc1.parodius.com> <479A3764.6050800@skyrush.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Jan 25, 2008 at 12:24:20PM -0700, Joe Peterson wrote: > In my case, I am using only one disk (ad0) for FreeBSD, and I am only > using one partition on this disk in my ZFS pool. So, in this case, > unfortunately, it's not possible to tell from the fact that only ad0 is > listed that it is specific to this drive. Ah ha. Well, in your below example, you may only be using one drive for FreeBSD (ad0), but you do have a 2nd drive (ad1) which is installed. I would try doing some I/O on /dev/ad1 to see if you can get the timeouts to occur on that drive as well. You don't have to do anything risky with ad1 either: dd if=/dev/ad1 of=/dev/null bs=64k would probably suffice. > Yep, I am also always skeptical of smart reports. That's one reason I > am very interested in ZFS. I don't trust the drive to be completely > reliable, and the fact that ZFS does end-to-end data integrity is very > intriguing. I agree entirely -- and I also use ZFS myself (across two drives in a RAID0-like fashion, with a completely separate drive which is used for nightly backups of the ZFS pool). I'm absolutely thrilled with it; finally something clean, reliable, and simple -- something I've always wanted in a LVM or LVM-like implementation. > > * smartctl -a /dev/ad0 > > OK, I've attached this to the end of this email. > > atapci0: <Intel ICH4 UDMA100 controller> port > 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.1 on pci0 > ata0: <ATA channel 0> on atapci0 > ata0: [ITHREAD] > ad0: 476940MB <Seagate ST3500630A 3.AAE> at ata0-master UDMA100 The smartctl output for /dev/ad0 looks good, minus the one uncorrected sector. I'm ignoring that since it's proof that the drive knew of it and remapped it. If that number starts incrementing over time, though, replace the drive ASAP, of course. The atacontrol cap output looks fine too; nothing wonky, and the LBA capabilities look fine. The controller is nothing out-of-the-ordinary; it's reliable under FreeBSD (I've had many a motherboard which used it). Of course I haven't used an ICH4 since FreeBSD 3.x, and the ATA layer has changed substantially, numerous times. > {regarding -t short and -t long} > Also, none of the numbers that were zero incremented, esp: > > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age > Offline - 0 > > Also, no more errors were reported in the system log during the self-tests. Seem to indicate that the drive considers itself healthy. Another test I could recommend at this point would be one that would require a few hours of downtime: download Seagate's SeaTools (will require a CD burner or floppies) and consider doing both "quick" and "long" scans. "Quick" checks some of the stuff we've looked at here, but it also looks at some vendor-specific stuff within the drive. "Long" will scan every block on the disk for errors (and will not destroy data). > OK, I started a scrub, and it will take some more time to complete... > But I get the following with status. Could this be due to the timeouts > and failures? I suspect so, so maybe this is not surprizing. It depends on whether or not you saw more timeouts and cache errors spit out by the kernel while "zpool scrub" ran. If so, then yes, I would definitely say they're related. > I'd also guess that this doesn't necessarily point to the drive, but > anything in the chain of events... I do not have a mirror or RADI-Z, > so I guess the reason there was "no data loss" (yet) is because the > checksum passed, and maybe it just had to retry...? I'm still new to ZFS myself, so I don't have an answer for you. Your conclusion is the same thing I'd conclude, though. > I've been using this same motherboard/BIOS for a long time (as well as > this drive), so no changes have happened to the HW recently. The BIOS > is the newest, available, I believe (It's a Tyan Trinity S2099, so it's > a few years old) I'd say the BIOS is probably not responsible at this point; I'd expect other weird things to be going on with the system if the BIOS was broken in some way (or possibly bit rot in the flash). It's going to be difficult to determine if maybe something on the mainboard has decided to start failing (some transistor within the ICH4, etc...) though. :-( > I'm using regular ATA 80-pin cables. Also, these seem to have been > working fine for quite a while now. But, yes, I have also witnessed bad > cable issues on older systems in the past. I certainly could try a new > cable and see if it helps. I'd try that for sure. It's just one more thing to rule out. > > * Getting a larger power supply (usually when lots of disk are involved) > > I only have two drives, so I think the PS has enough capacity in my case. Agreed; even a 350W PSU should handle 2 disks without a problem. Here's something to ponder: The LBAs being reported as having errors are scattered all over. They aren't lumped together (usually the sign of part of a platter going bad); instead, they're all over the drive. This would indicate either cable problems, motherboard/southbridge problems, or possibly something on the drive PCB itself going bad. The drive PCB going bad is a sad reality -- but sometimes you can replace them with a spare drive that's known to be good, and a Torx screwdriver in most cases. I've seen a lot of old Seagate SCSI drives which start exhibiting random I/O errors which were fixed simply by the PCB being replaced. Bad cache/RAM on the PCB is my guess. There's no sign of your drive actually spinning down or powering down in any way (as you probably know, some drives will actually reset themselves and re-spin up when encountering errors where the drive gets "stuck" or is wedged in some way. I don't know if this is a watchdog on the drive, or if an error condition just causes the drive to reset), so that's ruled out too. My recommendation would be to, in this order: * Replace the 80-pin ATA cable and see if it continues. * Download SeaTools and let it do both quick and long scans. If the problem happens during either scan, then it's safe to say it's either a drive or MB/controller problem and FreeBSD isn't the problem. * Worst-case scenario: purchase an identical drive and see if the problem continues with the new drive. That would rule out the disk being the problem. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080125213055.GA46500>