Date: Fri, 18 Jun 2010 10:42:08 -0700 From: Jeremy Chadwick <freebsd@jdc.parodius.com> To: Matthew Lear <matt@bubblegen.co.uk> Cc: freebsd-stable@freebsd.org Subject: Re: 7.2-RELEASE-p4, IO errors & RAID1 failure Message-ID: <20100618174208.GA47470@icarus.home.lan> In-Reply-To: <1276876031.7519.39.camel@almscliff.bubblegen.co.uk> References: <1276844904.7519.19.camel@almscliff.bubblegen.co.uk> <20100618082127.GA34578@icarus.home.lan> <1276876031.7519.39.camel@almscliff.bubblegen.co.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Jun 18, 2010 at 04:47:11PM +0100, Matthew Lear wrote: > Hello Jeremy, > Thanks very much for the feedback. > > [snip] > > Could you please provide the full output from "smartctl -a /dev/ad0" > > here? Your drive may be completely fine and you may not have to swap it > > at all; hard to say. > > Sure. See below: > {snip} Your SMART statistics look completely OK. There's nothing there that indicates there were any write failures or otherwise. I'll explain near the end of the Email how to test a range of LBAs "just in case". I'll take a moment to point out that the error previously seen was a timeout during a write transaction (WRITE_DMA48). Recap: > > > ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=395032335 > > > ad0: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=395032335 > > > ar0: WARNING - mirror protection lost. RAID1 array in DEGRADED mode The status codes shown (status=51 and error=10) are hexadecimal. I'm pointing this out because they aren't preceded by '0x' or '$' and it clarifies my next point: NID_NOT_FOUND (bit 4 set in the ATA error field) is referred to as IDNF per ATA6-ACS specification and onward, so I'll refer to it as that. (I've always wondered why FreeBSD calls this NID_NOT_FOUND; IDFN stands for ID Not Found, so what's with the extra "N"? I've always felt this is a typo...) Using the ATA8-ACS specification working draft (2007/05/21), since it's more recent, we see the following: Section 6.2 - Error field Section 6.2.4 - ID Not Found (IDNF) bit Error bit 4. The IDNF bit shall be set to one if a user-accessible address was not found. The IDNF bit shall be set to one if an address outside of the range of user-accessible addresses is requested when command aborted is not returned (see 4.11.3 and 6.2.1). Section 4.11 - Host Protected Area (HPA) feature set Section 4.11.3 - 28-bit and 48-bit HPA commands Any read or write command to an address above the maximum address specified by the SET MAX ADDRESS or SET MAX ADDRESS EXT command shall cause command completion with the IDNF bit set to one and ERR set to one, or command aborted. There's no definition of what "address" means in 6.2.4, but the most logical (pun intended) guess is an LBA. This error is returned by the disk (e.g. not a controller-induced error). I've mentioned this problem in the past: http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting I've always read IDNF to mean "OS requested access (read or write) to an LBA which is out of bounds", where "out of bounds" means "not between 0 and <last LBA>". How exactly is that possible? Alexander, do you have any familiarity with this error code per ATA spec? Matthew, can you provide output from "atacontrol cap ad0"? Thanks. Now regarding the LBA tests -- "smartctl -t select,start-end" will do the trick. start should be a starting LBA, end should be an ending LBA. The OS claims that LBA 395032335 is what was requested to be accessed when the failure happened, so I would recommend picking start/end ranges around that area. Remember that a single sector encapsulates a very large number of blocks (especially given sizes of disks today), so it's wise to pick a very large range of LBAs. I would recommend this in your case: smartctl -t select,390000000,410000000 /dev/ad0 I would highly recommend doing this with the disk not doing any I/O, though it won't hurt it (it'll just delay the scan). "smartctl -a" will show the state of things in the "SMART Selective self-test log" at the bottom, or somewhere else within the output (depends on the drive). This should, in my opinion, rule out whether or not there's a bad block or something along those lines within said range. Given what I believe IDNF represents, I would say your scan will probably come back clean. Also remember that the scan performed here is a *disk-level scan*; the disk firmware itself is doing it (the OS isn't involved). This helps rule out any sort of "weird" issues that the OS may be reporting ("hey man, LBA 8943943983492893428932489324 is bad!" "Yeah sure it is"). > The two devices in the array are on channels 0 and 1. There is indeed a > second drive on channel 0 (160G). As I said above, I use that as an > additional back up device but it's not part of the array. Okay, so executing "atacontrol detach ata0" will cause you to lose both ad0 and ad1. If you can live with that, then cool. > > What motherboard is this? Can you change the setting to either > > "Native", "Enhanced", or (even better) "AHCI"? I've seen some systems > > where the Serial ATA option in the BIOS has an "Auto" option, which does > > totally bizarre things at times. > > I think this has been covered in subsequent postings. I could try it but > as you say below, I'd like to resolve the disk issue first. > ... > > The atacontrol man page covers your situation: > > ... > I don't think this is the case for me since ad0 and ad2 are on seperate > ata channels. > ... > Indeed but my hw doesn't have hot-swap capability (at the moment!). That's the problem -- we're not sure if this really is a disk issue. It's been reported before, others have reported solving it by increasing ATA timeout values, etc... But the fact of the matter is, that error code is being returned by the device. Speaking generally about disk replacements on your system -- when I say generally, I do mean generally and *not* in regards to the specific situation reported: Since there's no AHCI in use, we should just assume that a power-down of the system is the safest way to go about a disk replacement. Follow that procedure in the future and you should be fine. If you ever get a hot-swap backplane, you absolutely should use AHCI; hot-swap, especially on an Intel controller (FreeBSD is tested pretty thoroughly on Intel ICHxx and ESBx controllers), will work fine in that case. If you do go the AHCI route, and eventually upgrade to RELENG_8 down the line, I highly recommend you load kernel module ahci.ko (instead of the default/historic ataahci.ko). This will get you NCQ support amongst other things. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100618174208.GA47470>