From owner-freebsd-stable@FreeBSD.ORG Sat May 15 16:26:28 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 16774106566C for ; Sat, 15 May 2010 16:26:28 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta02.westchester.pa.mail.comcast.net (qmta02.westchester.pa.mail.comcast.net [76.96.62.24]) by mx1.freebsd.org (Postfix) with ESMTP id B71AF8FC1D for ; Sat, 15 May 2010 16:26:27 +0000 (UTC) Received: from omta17.westchester.pa.mail.comcast.net ([76.96.62.89]) by qmta02.westchester.pa.mail.comcast.net with comcast id HsMC1e0021vXlb852sSTJ3; Sat, 15 May 2010 16:26:27 +0000 Received: from koitsu.dyndns.org ([98.248.46.159]) by omta17.westchester.pa.mail.comcast.net with comcast id HsSS1e0023S48mS3dsSS8g; Sat, 15 May 2010 16:26:27 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id B838E9B419; Sat, 15 May 2010 09:26:24 -0700 (PDT) Date: Sat, 15 May 2010 09:26:24 -0700 From: Jeremy Chadwick To: Pieter de Boer Message-ID: <20100515162624.GA39585@icarus.home.lan> References: <4BED8B89.6010901@os3.nl> <20100514195346.GA8977@icarus.home.lan> <4BEDBC08.2040002@os3.nl> <20100514224236.GA11680@icarus.home.lan> <4BEE476B.6020407@os3.nl> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BEE476B.6020407@os3.nl> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: freebsd-stable@freebsd.org Subject: Re: Read / write timeouts on SATA disks connected to ICH9 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 15 May 2010 16:26:28 -0000 On Sat, May 15, 2010 at 09:04:11AM +0200, Pieter de Boer wrote: > Thanks for your elaborate reply, it was very useful to see smartctl > output explained a bit :) I still think there's something else in > play beside disk failure. I've checked one of the drives I replaced > earlier, but that one doesn't have any of the errors in its SMART > output you described, although it did drop out of the mirror > multiple times during its lifetime. That could be caused by a multitude of other known things. For example, some Western Digital "Green" drives (including the Enterprise class ones) are known to perform head parking/offloading excessively, which could result in the drive spending more time doing that than actually serving overall I/O requests. There are some other reports of Samsung Spinpoint drives experiencing other issues (I've since forgotten and would have to dig up the threads). If you could provide full SMART stats for that drive, it might help. > >The WD Caviar Black drives have a useful feature called TLER -- it's > >disabled by default, for reasons which I don't want to get into here -- > >which can force the drive to internally give up after X seconds (it's > >user-selectable) when dealing with such remapping/errors. The idea is > >to keep the drive from being deemed dead from the OS/controller's point > >of view. I believe Seagate, Hitachi, or Samsung (I forget which) have > >this feature as well, but it's not called TLER. > > I've read about this feature, but didn't have the time to try to get > it turned on (iirc you'd need a specific Western Digital DOS-based > util or something). Yes, it's a DOS-based utility (like most firmware upgraders these days). I can provide it if you'd like. I've been meaning to spend some time trying to reverse-engineer the binary to figure out what ATA commands it sends to the disk to toggle/adjust the feature (so that one could do it in real-time rather than have to boot into DOS). > >If you want to find out the exact LBA that has the problem (there may be > >more than one), I can step you through performing a selective LBA scan > >using SMART, since this model of disk does support such. It's easy to > >do, easy to understand the results, and can be done while the drive is > >in operation (though I would recommend trying to keep disk I/O to a > >minimum during this test). Let me know. > > At a certain point in time I had read errors from specific LBA's on > ad4. Using dd I was able to pinpoint those to single sectors. This isn't very effective (dd will read large chunks/amounts of data (read: multiple LBAs) from the underlying disk at once, rather than the disk itself performing a per-LBA test). My opinion is that the "dd method" should only be used on drives which don't support selective LBA scanning via SMART. > Overwriting those sectors with what was on ad6 made them readable > again. What is odd is that the 'remapped sector' count of ad4 is 0. What may have happened is that the drive took a while to read certain LBAs (long enough for the OS/controller to time out), but that internal drive ECC was used to correct the reads and the sectors therefore *did not* need to be remapped. I do see that Attribute 1 on ad4 is non-zero, which could indicate said situation, but WD doesn't provide Attribute 195 (ECC recovery rate), which could help here. SMART implementations are usually quite good (particularly in recent WD drives), but I have seen situations where certain counters are, erroneously, not being incremented or changed. I've seen a couple brand new disks come out of the factory with non-zero values (indicating someone at the fab forgot to clear them before shipping). I'd love to get my hands on a WD utility that zeros out the counters and re-flashes the drive firmware to rule out any oddities. It's been proven already that WD will re-uses the same F/W version number despite some code being changed. There was a FreeBSD user who got a F/W fix from WD for the head offloading/parking ordeal (see above, re: WD GP), and the firmware version between the old and the new were the same. Tracking stuff like this down is basically impossible unless MD5/SHAs of the firmware files can be provided (good luck). All HD vendors have their own quirks/ordeals right now. You basically just have to go with one who works wells for you, then if things start going downhill, switch to another. None of them are perfect. > Still I'd like to know how do perform such a scan. smartctl -t select,0-max This will start a selective LBA scan from LBA 0 to the end of the disk. If any error is encountered, the scan stops and the error -- including the LBA where an error was seen -- is output in the SMART self-test and SMART selective self-test logs. You can then write down the LBA, and then re-run the above command replacing "0" with the LBA+1 where the error was seen. Here's an example of what a failed selective scan looks like (taken from a Hitachi disk I just dealt with at work a few weeks ago, starting at LBA 100000): === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Selective offline Completed: read failure 90% 4931 6153934 SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 100000 1953525167 Completed_read_failure [90% left] (6153934-6219469) > >># vmstat -i > >>interrupt total rate > >>irq23: atapci0 371021299 10423 > > The rate is higher than 10000 also at idle. During a gmirror sync > from ad6 to ad4, it's about 10670. In your other post, we determined that your interrupt rate dropped to a completely normal value (1500 during a gmirror scan or rebuild) after a system reboot. I'm not surprised a reboot addressed it (for now...). What this indicates to me is that if a disk falls off the bus on an ICH9 controller in Enhanced (non-AHCI) mode, FreeBSD starts seeing an absurd number of interrupts generated from the ICH9. My guess is FreeBSD isn't doing something correctly with the controller when this happens; maybe certain commands aren't being sent back to the controller or handling of certain events are being done improperly when it comes to ICH9 (or possibly earlier ICH revisions too). This should be *very* easy to reproduce. > >"iostat 1", "iostat -x 1", or "gstat" might come in handy to tell you > >what kind of disk I/O is going on. If actual I/O is very little, then > >something weird is going on with regards to the number of interrupts > >being seen on IRQ 23. mav@ might have some ideas, otherwise I'd > >recommend rebooting the machine and seeing if the number drops. If so, > >it may be that the OS has some sort of bug where a disk timing out or > >falling off the bus causes interrupt problems. (It's too bad you don't > >have AHCI on this system. It handles stuff like this much more > >elegantly...) > If mav@ or anyone else doesn't have another insight in the interrupt > rate, I guess a reboot will at least show if it's persistent or > related to the errors. I'll try to do a reboot when convenient > (probably sunday morning or something). If you see any of your disks on the ICH9 controller fall off the bus or report ATA errors (doesn't matter what kind), please make note of the timestamp (should be in the kernel log), and ASAP run "smartctl -a" on the disk. You should compare attributes before and after the event. You might also want to consider using smartd, which can log SMART attribute changes on its own. Note that you might have to tune the arguments in smartd.conf to ignore some attributes which fluctuate naturally (such as drive temperature and seek error rate). -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |