Date: Tue, 1 Jul 2008 05:48:06 -0700 From: Jeremy Chadwick <koitsu@FreeBSD.org> To: Jonas Lund <whizzter@gmail.com> Cc: Danny Carroll <fbsd@dannysplace.net>, freebsd-hardware@freebsd.org Subject: Re: new server motherboard with SATA II Message-ID: <20080701124806.GA68799@eos.sc1.parodius.com> In-Reply-To: <436c7eda0807010246u4c22b32bic67bf06db1728583@mail.gmail.com> References: <486450DB.4000907@dannysplace.net> <20080627040545.GA21856@eos.sc1.parodius.com> <436c7eda0807010246u4c22b32bic67bf06db1728583@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jul 01, 2008 at 11:46:35AM +0200, Jonas Lund wrote: > > Fourth, because you'll likely have multiple disks in a ZFS zpool, you > > should probably be aware of the problem that haunts some users from time > > to time (re: DMA errors). > > > > http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting > > Reading that page i recognized the DMA timeouts from my last disk > crash (Running a small server for various dev work). > > Anyhow after this last crash that did turn out a tad expensive(pro > disk recov) i decided to put up a small bit of security by using > raid1. To be able to prepare myself for the inevitable problems i've > setup Raid and SMART monitoring. > > Now your wiki says that the disks lie about SMART data (ok ata bashing > is trendy but regarldess), Any info/db about what goes for various > vendors in this regard? No, the Wiki does not say that disks lie. It says that it's entirely up to the vendor to implement SMART however they desire; they do not have to increment statistics if they don't want to, and some only update statistics when offline SMART tests (short/long) are performed (though those are labelled as requiring such). There isn't an easy way of explaining the below, so I'll be verbose. A SATA disk that comes straight out of the factory has a list of blocks on it which are marked "free for reallocation" -- meaning, when a bad block is encountered, assuming the disk can work around the problem, that block will be used and removed from the list. The list is not user-maintainable, and unless the disk vendor implements (and documents) a custom ATA command that allows a driver to get that list, there is no way to get any information about it. This happens transparently -- the OS is not informed, and in most cases, SMART statistics are also not updated to reflect such reallocations. After that entire list has been exhausted, *that* is when SMART stats begin to get updated. But this is entirely up to the vendor to decide. Some may choose to increment certain SMART attributes even when the "free list" has an entry removed from it. To make matter worse/more complex, the above can also apply to different models of disks from the same vendor; it all depends on whoever at the company is writing the drive firmware. It would be fairly difficult to track every vendor, disk model, and firmware version to determine who adheres to what method. In fact, firmware version isn't exactly an accurate way to determine this either. I can refer you to a thread where Western Digital was found to be reporting temperature statistics incorrectly in some models of drives (either their firmware was broken, thermistor vendor changed silently (I doubt it), or something at the fab wasn't soldering something correctly). Customers found that if you reported the problem to Western Digital, they'd recommend an RMA, and you'd get back a new drive of the same size/model which would behave properly. I had two of these drives. I sent them off for RMA. What I got back were two brand new drives, same model, same revision, same size, same country of origin, and same firmware version string -- but the temperature problem was completely gone. I'm of the opinion the firmware had a bug, but whoever fixed it did not bother to increment the version number. Secondly, I am in no way shape or form "ATA bashing". SCSI has a better overall protocol (design and transport), but it's (unjustifiably) more expensive, and remains such even after all these years. I actually *like* SATA, and SAS as well. The point of my Wiki page is to document known issues with FreeBSD's ATA layer, and provide some detail for administrators who aren't sure if it's FreeBSD or their disk which is on the fritz. In my experience, experiences, with regards to the DMA errors, usually the disk is fine. In the case the disk isn't, SMART has been a good way to determine if something happened, but it's not a guaranteed solution. Footnote: I hope technical someone can expand on what the IDNF bit does in read/write 48-bit ATA requests, however -- the ATA-7 specification seems to imply it only gets set when an invalid LBA is submit to the disk, which would imply a FreeBSD problem, and may explain everything. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080701124806.GA68799>