Date: Tue, 5 Aug 2008 20:17:51 -0700 From: Jeremy Chadwick <koitsu@FreeBSD.org> To: Sebastiaan van Erk <sebster@sebster.com> Cc: freebsd-stable@freebsd.org Subject: Re: Stable SATA pci card for FreeBSD 6.x/7.0 Message-ID: <20080806031751.GA33798@eos.sc1.parodius.com> In-Reply-To: <48988904.80509@sebster.com> References: <48982B58.4000406@sebster.com> <20080805121632.GA88406@eos.sc1.parodius.com> <48984BF1.60805@sebster.com> <20080805150301.GA94198@eos.sc1.parodius.com> <48988904.80509@sebster.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Aug 05, 2008 at 07:08:20PM +0200, Sebastiaan van Erk wrote: > Jeremy Chadwick wrote: >> On Tue, Aug 05, 2008 at 02:47:45PM +0200, Sebastiaan van Erk wrote: >>> Hi, >>> >>> Thanks for the reply. >>> >>> Jeremy Chadwick wrote: >>>> Yes, most of the Silicon Image ICs I've read about have odd driver >>>> problems or general issues (even under Windows). The system rebooting >>>> is an odd one; you sure your PSU can handle two disks? >>> Well, I've got a 450W Asus PSU in there, but I've also got 6 hard >>> disks and 1 dvd-rom drive (mostly inactive) in there. The hard disks >>> are mostly 250/300GB but the two new ones are 1TB SATA drives. But >>> the 450W should easily be enough, shouldn't it? >> >> Without getting into semantics, a 450W PSU may be on the light side for >> 6 disks. I'm fairly amazed you're able to power up that machine without >> disk errors or other problems during POST. You'll be having 6 disks >> spin up all simultaneously -- and spin-up is when disks draw the most >> power, and possibly during normal operation. >> >> If you have a different (or larger) PSU, I would recommend trying that >> to see if it addresses your problem. A PSU which isn't providing enough >> power will cause the disks to occasionally disconnect from the bus, or >> the machine sporadtically lock up, reboot (power-cycle), or other odd >> things. > > Unfortunately I don't have a larger PSU lying around, but I could buy > one; though I'd like to try some other stuff first because I've had 6 > disks in my PC before without any problems. See the very bottom of my mail. I don't believe the PSU is the problem, after reviewing your SMART statistics. > <...parts of thread cut...> > My other (on-board) SATA controller is a VIA controller; and I've never > had any problems with it (although the hardware raid messed up once a > year or 2 ago, and since then I've been using software raid without any > issues). Okay, so you've got an onboard VIA (VT6410) SATA controller, an onboard VIA IDE controller, and a PCI SATA controller. I'd still like to know which disks are attached to what controller, and if any of the devices are sharing IRQs. Can you provide the output from the following two commands? dmesg | egrep 'atapci|(ad|ata)[0-9]+' vmstat -i I'm just trying to narrow stuff down. >> Your recommended method of troubleshooting (swapping the 250G for the >> 1TB) is a good idea. But hear me loud and clear: just because you >> switch the disks and the problem disappears for a few hours doesn't mean >> it's gone. There have been **many** people who have shown up on the >> mailing lists stating "I did <X thing> and now it works!", only to find >> that a week later it *didn't* fix the problem. > > Yes, I don't really expect it to solve the problem, but was thinking > that at least I could try and stress test the known working disks on the > controller and try to see if it's the controller that's the problem or > the disks (or something else). I've been able to reproduce the crashes > pretty well by just doing a lot of disk IO on the 1TB disks only (so the > other disks were pretty idle during the tests). It's interesting that the disks which are giving you trouble are Samsung disks. There's some history here which you should be made aware of: In July, Daniel Eriksson reported data corruption occurring with his nVidia MCP55 chipset when 1TB Samsung disks were attached to it. The same disks on another controller performed fine. The corruption was being detected by ZFS as checksum errors. (UFS/UFS2 won't detect this sort of thing, unless the corruption is occurring somewhere within the filesystem tables.) http://lists.freebsd.org/pipermail/freebsd-stable/2008-July/043427.html Soren Schmidt (ata(4) author) replied that there are some nVidia chipset-related fixes for ATA in -CURRENT, and provided a patch. Daniel reported that the patch made absolutely no difference: http://lists.freebsd.org/pipermail/freebsd-stable/2008-July/043434.html Daniel also tried using a firmware patch for his Samsung disks, which limit the SATA speed to SATA150, but the speed was still negotiated as SATA300 (indicating the vendors' own f/w patch is broken, or FreeBSD does not play well with it). The f/w patch didn't fix his problem either: http://lists.freebsd.org/pipermail/freebsd-stable/2008-July/043432.html zbeeble@gmail.com reported using his MCP55 controller without any problem -- as long as he didn't use Samsung disks. He stated that he believes Samsung disks are PATA disks that use a PATA-to-SATA adapter inside of the drive, leading to problems (and yes, those adapters are known to cause all sorts of mayhem): http://lists.freebsd.org/pipermail/freebsd-stable/2008-July/043485.html I'm not sure what became of the thread; Daniel never provided a post-mortem. I'm left to believe he probably took zbeeble@gmail.com's advice and switched to another disk vendor. > <...parts of thread cut...> > <...smartctl output for both Samsung disks...> Thanks for upgrading to 5.38. All the SMART statistics for these disks look okay. Can you run some SMART tests on the disks? You can run these tests while the disks are in use (but I/O will make the test take longer to complete): smartctl -t short /dev/ad4 smartctl -t short /dev/ad6 Then you'll need to look at the SMART self test log, as well as the SMART error log, to see if anything is returned. Make sure the tests have completed (the Status field should be "Completed without error", unless an error was found of course): smartctl -a /dev/ad4 smartctl -a /dev/ad6 If nothing is found, try a different test (also safe to run during operation; don't let the word "offline" scare you), and repeat looking at the logs once more. This test may take some time, though: smartctl -t offline /dev/ad4 smartctl -t offline /dev/ad6 At this point, I'm inclined to believe the issue is specific to those Samsung disks. I do not believe your PSU is a problem; the SMART statistics would be showing a higher number of power-cycles if the disks were losing power. Worth noting (about Samsung disks) is that smartctl has options to work around 3 different firmware bugs. The bugs are SMART statistics-related, but those kind of mistakes don't give me "warm fuzzies". Be wary. :-) -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080806031751.GA33798>