Date: Tue, 14 Feb 2012 16:02:45 +0100 From: Victor Balada Diaz <victor@bsdes.net> To: Jeremy Chadwick <freebsd@jdc.parodius.com> Cc: stable@FreeBSD.org Subject: Re: problems with AHCI on FreeBSD 8.2 Message-ID: <20120214150245.GR2010@equilibrium.bsdes.net> In-Reply-To: <20120214141601.GA98986@icarus.home.lan> References: <20120214091909.GP2010@equilibrium.bsdes.net> <20120214100513.GA94501@icarus.home.lan> <20120214135435.GQ2010@equilibrium.bsdes.net> <20120214141601.GA98986@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Feb 14, 2012 at 06:16:01AM -0800, Jeremy Chadwick wrote: [..] > > Thanks. Both your drives look overall fine, sort-of. I'll outline my > concern points, and ask for some more info: > > * ada0 has 28 CRC errors, while ada1 has 2. These drives have been in > use for 4688 hours and 4583 hours (respectively), which is roughly 6 > months for each drive. CRC errors usually result in transparent > retransmits, but this can sometimes cause I/O delays (especially if the > CRC errors are repeated). > > If the timeout messages recur in the future, please run the commands I > gave you above once more and provide the output. I can then compare the > old to the new and see if there is anything of interest. I can force the error each time i want. Its 100% reproducible on my environment so i'll do the tests and send you smartctl -a output again. > > * Both drives had 2 long tests run on them a few days ago ("Extended > offline" tests). Did you induce these manually? If so, were these > tests running at the time you witnessed AHCI timeout errors on ada0? > Short, long, and selective surface scan tests are supposed to be > non-intrusive, but given the nature of the tests sometimes they can > stall the I/O subsystem. I've ran the tests, but they were not running during timeout problems. The only thing running on the disks was a newfs -J under a gjournal partiton. For the rest, they're mostly idle. > > If you do tests of this nature, you should write down the exact > dates/times when you ran them (at least from now on). > > If you didn't induce these, something must have, or possibly the drive > itself did it (and if that's the case, convenient that it induces an > entry in the self-test log!). > > I do have some familiarity with drives doing internal tests -- the best > example are old IBM Deskstar drives executing ADM on their own, > resulting in the drives spinning down and performing internal tests, > which would subsequently be interrupted by ATA I/O, drive spins back up, > etc. -- but took too long resulting in ATA timeouts on FreeBSD and > Linux. I mailed IBM about this back in 2000 and got confirmation of the > feature (which was also on their SCSI drives but defaulted to off); the > feature was mysteriously removed in future drive models and still > remains gone today: > > http://jdc.parodius.com/freebsd/ibm_email_aware_of_adm.txt > > I'm not saying your drives do this. I'm simply saying that if there is > some form of automated test that runs on these drives which is > transparent to the underlying ATA layer, then there is really nothing > you can do about it, and timeouts are possible. The IBM ADM issue was > only discovered after reviewing technical specifications/documentation > and compared to their SCSI drives. That's of course possible, but as the problem is 100% reproducible with AHCI driver and is not with ata driver, i guess this time is not drive's fault. We've also tested replacement disks and cables during the previous days. I guess the problem is in some bad interaction with AHCI driver. > > * Samsung has a notoriously bad reputation for firmware reliability on > their SpinPoint drives, but I haven't read of anything bad about the F2 > series, just the F1, F3, and F4 models. I have very little (almost > none) experience with these drives. I'm not boycotting their products, > but I wouldn't be surprised if the timeout errors you saw were caused by > something internal the drive was doing. There is absolutely zero > visibility into this kind of problem on any layer (even if you had an > ATA protocol analyser hooked up); you're completely at the mercy of the > firmware. Just something to keep in mind when working with ANY kind of > disk (MHDD, SSD, etc.). I've seen reports on freebsd lists and smartmontools wiki about firmware problems with F4 drives manufactured before december of 2010, but checking samsung's web page, seems this drives are not affected. I hope we're not hitting a new bug. More info: http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks > > All that said, could you please provide output from the following > commands as well? These may return "not supported" errors, which is > acceptable, but we have to check. > > * smartctl -l devstat /dev/ada0 > * smartctl -l sataphy /dev/ada0 > * smartctl -l devstat /dev/ada1 > * smartctl -l sataphy /dev/ada1 > Thanks a lot for you help Jeremy. Attached is the output of the commands: fe09# smartctl -l devstat /dev/ada0 smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-STABLE amd64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net (pass0:ahcich0:0:0:0): READ_LOG_EXT. ACB: 2f 00 04 00 00 40 00 00 00 00 01 00 (pass0:ahcich0:0:0:0): CAM status: ATA Status Error (pass0:ahcich0:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT ) (pass0:ahcich0:0:0:0): RES: 51 04 04 00 00 40 00 00 00 01 00 ATA_READ_LOG_EXT (addr=0x04:0x00, page=0, n=1) failed: Unknown error: 0 fe09# smartctl -l sataphy /dev/ada0 smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-STABLE amd64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x000a 2 16 Device-to-host register FISes sent due to a COMRESET 0x0001 2 0 Command failed due to ICRC error 0x0002 2 0 R_ERR response for data FIS 0x0003 2 0 R_ERR response for device-to-host data FIS 0x0004 2 0 R_ERR response for host-to-device data FIS 0x0005 2 0 R_ERR response for non-data FIS 0x0006 2 0 R_ERR response for device-to-host non-data FIS 0x0007 2 0 R_ERR response for host-to-device non-data FIS 0x0008 2 0 Device-to-host non-data FIS retries 0x0009 2 16 Transition from drive PhyRdy to drive PhyNRdy 0x000b 2 0 CRC errors within host-to-device FIS 0x000d 2 0 Non-CRC errors within host-to-device FIS 0x000f 2 0 R_ERR response for host-to-device data FIS, CRC 0x0010 2 0 R_ERR response for host-to-device data FIS, non-CRC 0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC 0x0013 2 0 R_ERR response for host-to-device non-data FIS, non-CRC fe09# smartctl -l devstat /dev/ada1 smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-STABLE amd64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net (pass1:ahcich1:0:0:0): READ_LOG_EXT. ACB: 2f 00 04 00 00 40 00 00 00 00 01 00 (pass1:ahcich1:0:0:0): CAM status: ATA Status Error (pass1:ahcich1:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT ) (pass1:ahcich1:0:0:0): RES: 51 04 04 00 00 40 00 00 00 01 00 ATA_READ_LOG_EXT (addr=0x04:0x00, page=0, n=1) failed: Unknown error: 0 fe09# smartctl -l sataphy /dev/ada1 smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-STABLE amd64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x000a 2 16 Device-to-host register FISes sent due to a COMRESET 0x0001 2 0 Command failed due to ICRC error 0x0002 2 0 R_ERR response for data FIS 0x0003 2 0 R_ERR response for device-to-host data FIS 0x0004 2 0 R_ERR response for host-to-device data FIS 0x0005 2 0 R_ERR response for non-data FIS 0x0006 2 0 R_ERR response for device-to-host non-data FIS 0x0007 2 0 R_ERR response for host-to-device non-data FIS 0x0008 2 0 Device-to-host non-data FIS retries 0x0009 2 16 Transition from drive PhyRdy to drive PhyNRdy 0x000b 2 0 CRC errors within host-to-device FIS 0x000d 2 0 Non-CRC errors within host-to-device FIS 0x000f 2 0 R_ERR response for host-to-device data FIS, CRC 0x0010 2 0 R_ERR response for host-to-device data FIS, non-CRC 0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC 0x0013 2 0 R_ERR response for host-to-device non-data FIS, non-CRC -- La prueba más fehaciente de que existe vida inteligente en otros planetas, es que no han intentado contactar con nosotros.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120214150245.GR2010>