Date: Sat, 18 Oct 2008 14:25:43 -0700 From: Jeremy Chadwick <koitsu@FreeBSD.org> To: Kristian Rooke <kristianr@gmail.com> Cc: freebsd-stable@freebsd.org Subject: Re: SETFEATURES SET TRANSFER MODE taskqueue timeout.. Error occuring constantly.. Please help!! Message-ID: <20081018212543.GA58536@icarus.home.lan> In-Reply-To: <f9ccec500810180932k5fe192e1uc360afe41ae8581f@mail.gmail.com> References: <f9ccec500810180100j7969b1eeucb6e974f37b05961@mail.gmail.com> <20081018102403.GA46124@icarus.home.lan> <f9ccec500810180932k5fe192e1uc360afe41ae8581f@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Oct 19, 2008 at 03:32:29AM +1100, Kristian Rooke wrote: > Thanks for the quick response! > > Please see requested output below: Cool, thanks. One thing I forgot to ask for was "vmstat -i" output. For now, let's break it down for ease of understanding: FreeBSD 7.0-RELEASE i386, built February 2008. atapci0: nVidia nForce MCP73 ATA133 controller -- IRQ 14 atapci1: Silicon Image 0680 ATA133 controller -- IRQ 16 ata0: attached to atapci0 ata1: attached to atapci0 ata2: attached to atapci1 ata3: attached to atapci1 ad0: <Seagate ST380011A 3.06> at ata0-master PIO4 ad4: <Seagate ST3320620A 3.AAF> at ata2-master PIO4 ad5: <Seagate ST3320620A 3.AAF> at ata2-slave PIO4 ad6: <Seagate ST3750640A 3.AAE> at ata3-master PIO4 ad7: <Seagate ST3320620A 3.AAD> at ata3-slave PIO4 ATA errors are reported for disks ad4, ad5, ad6, and ad7. ad0 appears to be error-free. First and foremost: there are known problems with Silicon Image controllers on all operating systems (Windows, Linux, and FreeBSD in particular), known for causing data loss and other sporadic issues. This is at least confirmed on their SATA controllers, and I've become quite the "pick something else" advocate when it comes to their stuff. However: I've no idea about their PATA controllers. Secondly, so far there isn't any evidence that the ad0 disk, which uses the nVidia controller, has any problem -- all the disks having problems are on the Silicon Image controller. That is a very key piece of information here. If when you're writing data to, say, the ad4 disk, and you start to see errors on all disks (ad4 through ad7), then what this probably means is the controller has locked up or is behaving badly. This adds further evidence that the Silicon Image controller may be at fault here. Thirdly, you said the system requires a hard reset to get things back in working order. Sometimes this can be induced by a power supply that isn't providing decent/proper voltages, or is being overloaded, particularly during heavy disk I/O (drawing more power in some cases). It might be good to check your voltages inside of your system BIOS, write them down, and type them in here. FreeBSD does not provide a decent set of tools for monitoring this stuff inside the OS (yet; I'm working on it, mainly for server boards. I do what I can...) But keep in mind that a controller locking up hard could also require a hard reset (pressing reset on the front of the PC) -- a soft reset (Ctrl-Alt-Del) would probably work, except much of the running kernel is spinning hard trying to deal with ATA problems. Fourthly, I see a "<some output omitted>" line in your original dmesg. Can you provide that output? It's important -- sometimes people have seen issues where their ATA controller shows problems, but it turns out to be an IRQ sharing or device compatibility problem with another device (e.g. their board was showing ATA errors, but at the exact same time, also showing NIC watchdog timeouts or other anomalies). They omitted the dmesg data thinking it had nothing to do with the problem, when in fact it helps determine if the issue is truly with one piece or the entire system. Next, let's take a look at your SMART output, which tells a tale of something very very bad: Disk ad4 has a good temperature, and no sign of bad blocks/sectors. The disk had been powered on for a total of 7799 hours. There was a CRC error detected when attempting to set specific capabilities on the device. The error occurred at LBA 0 on the disk, which is completely bizarre, but the SMART error log might just say LBA 0 to indicate "no LBA was being accessed" (e.g. the error was purely during the mode setting attempts). However, the SMART error "wraps" its timestamps at 49.710 days (every 1149.840 hours), so it's going to be difficult to determine if the below SMART error log entry was from long ago, or was fairly recent. Looking at other disks might help, so let's continue. Disk ad5 has an excellent temperature, and no sign of bad blocks/sectors either. The disk has been powered on for a total of 11956 hours. No errors were found in the SMART log. Disk ad6 has a good temperature, and no sign of bad blocks/sectors. No errors were found in the SMART log. Disk ad7 has an excellent temperature, and no sign of bad blocks/sectors either. The disk had been powered on for a total of 12512 hours. However, much like disk ad4, this disk also witnessed a CRC error when attempting to either do a DMA read operation or when setting capabilities on the device. I'm prone to believe it's when setting capabilities, because LBA 0 is also seen here, which isn't a likely LBA. This error happened at the 6310 hour mark, which was about half of its lifetime ago. All of this is somewhat of a mystery. Disk ad4 is on a completely different physical cable than disk ad7, so that *could* rule out cabling problems. The errors seen are only when setting device capabilities (making an educated guess, but I'm not 100% positive), not when actually accessing data on the disks. Heck, I'm not even sure the errors in the SMART log are accurate, as the disks have been powered on for quite some time after the supposed errors occurred. Power draw could also explain this, ditto with the voltage possibility. I would start by doing 3 easy things: 1) Re-enable DMA mode; it's obviously not the cause of your problems since PIO mode shows the same problem for you, 2) Replacing both sets of PATA cables with brand new ones. There's no evidence this is the problem, but changing these is easy and cheap. If it doesn't solve the problem, then you're one step closer to tracking it down, 3) Getting voltages from the BIOS and providing them here. Again, this won't be an accurate representation of the system under load, but it's the best we've got right now. Assuming the problem continues after #2, and the voltages shown in #3 look good, this is what I'd do for the next step: Buy a PCI, PCI-X (if this make sure it's backwards-compatible with 32-bit 33MHz PCI slots, unless you actually have a PCI-X slot!) or PCI Express PATA controller -- specifically, one that does not use a Silicon Image chip. This may be hard to accomplish since PATA is a dying interface (and good riddance!). I will also stress this in capitals, just to make it clear: DO NOT BUY A SATA CONTROLLER THEN USE PATA-TO-SATA ADAPTERS. Those adapters will cause you even more problems. If you go the SATA route, buy actual SATA disks and recycle or sell your old PATA ones. That said, Highpoint and Promise both make PATA controllers -- not to mention, I even see that you've tried to load the hptrr(4) driver on that system! :-) Additionally, DO NOT use the "RAID" features of these cards (if you end up buying one that has such); just plug the disks in and use them in a JBOD fashion. You might find that the disk numbers (e.g. ad4) change on you when doing this; that's to be expected. Others might recommend that you should try replacing the PSU before buying a new PATA controller, but I have doubts the problem is with the PSU; I would expect more odd/awkward problems if the PSU was to blame. If you do try a different PSU, go with one that does 450W or more. You DO NOT need a l33t-g4m3-d00dz-omgwtfbbq!! 850-1000W PSU; most of the power draw for hard disks happens during power-on, when the disks have to spin up, not once they're already spinning. Hope this helps, and good luck! -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20081018212543.GA58536>