From owner-freebsd-fs@FreeBSD.ORG Sat Jan 21 03:43:18 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 71CDF106566C for ; Sat, 21 Jan 2012 03:43:18 +0000 (UTC) (envelope-from freebsd@pki2.com) Received: from btw.pki2.com (btw.pki2.com [IPv6:2001:470:a:6fd::2]) by mx1.freebsd.org (Postfix) with ESMTP id 82F9D8FC12 for ; Sat, 21 Jan 2012 03:43:17 +0000 (UTC) Received: from [127.0.0.1] (localhost [127.0.0.1]) by btw.pki2.com (8.14.5/8.14.5) with ESMTP id q0L3h9QV097742; Fri, 20 Jan 2012 19:43:09 -0800 (PST) (envelope-from freebsd@pki2.com) From: Dennis Glatting To: Jeremy Chadwick In-Reply-To: <20120120181828.GA1049@icarus.home.lan> References: <4F192ADA.5020903@brockmann-consult.de> <1327069331.29444.4.camel@btw.pki2.com> <20120120153129.GA97746@icarus.home.lan> <1327077094.29408.11.camel@btw.pki2.com> <20120120181828.GA1049@icarus.home.lan> Content-Type: text/plain; charset="ISO-8859-1" Date: Fri, 20 Jan 2012 19:43:08 -0800 Message-ID: <1327117388.29408.24.camel@btw.pki2.com> Mime-Version: 1.0 X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit X-yoursite-MailScanner-Information: Dennis Glatting X-yoursite-MailScanner-ID: q0L3h9QV097742 X-yoursite-MailScanner: Found to be clean X-MailScanner-From: freebsd@pki2.com Cc: freebsd-fs@freebsd.org Subject: Re: sanity check: is 9211-8i, on 8.3, with IT firmware still "the one" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 21 Jan 2012 03:43:18 -0000 Data points update: I thought this problem may be related to a specific RAID controller (LSI 9211-8i - "R") first used on the disks. So I used it on a new, different set of disks. Those disks work fine afterwards: ada3 at ata0 bus 0 scbus6 target 0 lun 0 ada3: ATA-8 SATA 3.x device ada3: 150.000MB/s transfers (SATA, UDMA6, PIO 8192bytes) ada3: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C) ada3: Previously was known as ad0 ada4: ATA-8 SATA 3.x device ada4: 150.000MB/s transfers (SATA, UDMA6, PIO 8192bytes) ada4: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C) ada4: Previously was known as ad1 bd3# dd if=/dev/zero of=/dev/ada3 count=8 8+0 records in 8+0 records out 4096 bytes transferred in 0.006000 secs (682662 bytes/sec) bd3# dd if=/dev/zero of=/dev/ada4 count=8 8+0 records in 8+0 records out 4096 bytes transferred in 0.001953 secs (2097408 bytes/sec) I used Seatools on one of the disks from the first set (ST1000DL002-9TT153). On a long test the tools declared there were errors that it could not fix. I didn't see much point in trying the second disk. So, two separately purchased disks from the same vendor bad? (TigerDirect) What's the odds of that? Hmm... On Fri, 2012-01-20 at 10:18 -0800, Jeremy Chadwick wrote: > On Fri, Jan 20, 2012 at 08:31:34AM -0800, Dennis Glatting wrote: > > On Fri, 2012-01-20 at 07:31 -0800, Jeremy Chadwick wrote: > > > > > On Fri, Jan 20, 2012 at 06:22:11AM -0800, Dennis Glatting wrote: > > > > I am having a problem with Seagate ST1000DL002 disks but I haven't yet > > > > determined weather it is the disks themselves (they -- two of them, new > > > > -- fail under a MB controller too. > > > > > > Assuming the disks are seen directly on the bus (e.g. show up as daX, > > > adaX, or whatever), please install ports/sysutils/smartmontools (make > > > sure you're using version 5.42 or newer) and please provide output from > > > the following command: "smartctl -a /dev/XXX" where XXX is the device > > > name of the ST1000DL002 disk(s). Please be sure to state which device > > > name is associated with which smartctl output. You can delete or > > > remove the disk serial numbers from the output (for privacy) if you > > > wish. I'll be happy to review the data and tell you whether or not the > > > disks themselves are showing problems or if the issue is elsewhere. > > > > That is the motivation I needed to reboot that system, which was 50% > > through a task. That said, as remains the case today, for the last 20 > > years I haven't been able to find that "Any Key" on reboot. :) > > > > Regardless... > > First off, let's start with the full picture. Readers need to know > exactly what is going on within your controller setup, what disks are > connected to what, etc.. Taken from your full dmesg below, and turned > into something easy-to-read (mostly) > > Controller mps0 > --> LSI SAS2008 > --> IRQ 19 on pci1 > --> Firmware 12.00.00.00 > --> Disks attached: > --> da0 --> WDC WD25EZRS, SATA300 > --> da1 --> WDC WD25EZRS, SATA300 > --> da2 --> WDC WD25EZRS, SATA300 > --> da3 --> WDC WD25EZRS, SATA300 > --> da4 --> WDC WD25EZRS, SATA300 > --> da5 --> WDC WD25EZRS, SATA300 > --> da6 --> WDC WD25EZRS, SATA300 > --> da7 --> WDC WD25EZRS, SATA300 > > Controller mps1 > --> LSI SAS2008 > --> IRQ 19 on pci5 > --> Firmware 12.00.00.00 > --> Disks attached: > --> None > > Controller mps2 > --> LSI SAS2008 > --> IRQ 16 on pci6 > --> Firmware 12.00.00.00 > --> Disks attached: > --> da8 --> WDC WD25EZRS, SATA300 > --> da9 --> WDC WD25EZRS, SATA300 > --> da10 --> WDC WD25EZRS, SATA300 > --> da11 --> WDC WD25EZRS, SATA300 > --> da12 --> ST1000DL002, SATA300 > > Controller ahci0 > --> ATI IXP700 AHCI (4-port) > --> IRQ 19 on pci0 > --> Disks attached: > --> ahcich0 --> ada0 --> Corsair Force 3 SSD, SATA600 > --> ahcich1 --> ada1 --> OCZ-AGILITY2 SSD, SATA300 > --> ahcich2 --> ada2 --> ST31000333AS, SATA300 > > Controller ata0 > --> ATI IXP700/800 ATA133 (2-port/4-device, PATA) > --> IRQ on pci0 > --> I would assume this would be on IRQ 14 or 15, sigh... > --> Disks attached: > --> None > > Now that we have a full picture, let's continue. > > > An attempt to write to it: > > > > bd3# dd if=/dev/zero of=/dev/da12 > > dd: /dev/da12: Input/output error > > 1+0 records in > > 0+0 records out > > 0 bytes transferred in 0.378153 secs (0 bytes/sec) > > The dd command you executed to write zeros to the disk, 512-bytes at > time, starting at LBA 0, failed when writing the first 512 bytes. So, > from my perspective, writing to LBA 0 is failing. > > You should also keep in mind that this dd command to zero the disk (if > it was to work) would take a very long time to complete. If you used a > larger block size (bs=64k or maybe larger), it would be a lot faster. > Just a tip. Starting with bs=512 (default) is fine, or in this case > using 4096 would probably be better (see below), but whatever. > > > The disk is presently connected to this device (LSI 9211-8i) but I have > > also had it connected to the devices on the MB and I think to a > > SuperMicro board. I have also tried a different LSI board. > > Thanks for sharing this -- this is important information, but let's not > start moving the drive around any more, okay? There's no point. The > information you've given is enough, and I'll explain it in detail. > > > {snipping for brevity} > > > > bd3# smartctl -a /dev/da12 > > smartctl 5.42 2011-10-20 r3458 [FreeBSD 9.0-STABLE amd64] (local build) > > Copyright (C) 2002-11 by Bruce Allen, > > http://smartmontools.sourceforge.net > > > > === START OF INFORMATION SECTION === > > Model Family: Seagate Barracuda Green (Adv. Format) > > Device Model: ST1000DL002-9TT153 > > Serial Number: W1V06SLR > > LU WWN Device Id: 5 000c50 037e11be9 > > Firmware Version: CC32 > > User Capacity: 1,000,204,886,016 bytes [1.00 TB] > > Sector Size: 512 bytes logical/physical > > Device is: In smartctl database [for details use: -P show] > > ATA Version is: 8 > > ATA Standard is: ATA-8-ACS revision 4 > > Local Time is: Fri Jan 20 08:22:34 2012 PST > > SMART support is: Available - device has SMART capability. > > SMART support is: Enabled > > > > {snipping for brevity} > > > > SMART Attributes Data Structure revision number: 10 > > Vendor Specific SMART Attributes with Thresholds: > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE > > 1 Raw_Read_Error_Rate 0x000f 108 099 006 Pre-fail Always - 241488 > > 3 Spin_Up_Time 0x0003 087 070 000 Pre-fail Always - 0 > > 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 28 > > 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 > > 7 Seek_Error_Rate 0x000f 100 253 030 Pre-fail Always - 136324 > > 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 576 > > 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 > > 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 29 > > 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 > > 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 > > 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 > > 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 > > 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 > > 190 Airflow_Temperature_Cel 0x0022 073 062 045 Old_age Always - 27 (Min/Max 21/27) > > 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 > > 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 23 > > 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 29 > > 194 Temperature_Celsius 0x0022 027 040 000 Old_age Always - 27 (0 21 0 0 0) > > 195 Hardware_ECC_Recovered 0x001a 027 008 000 Old_age Always - 241488 > > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 > > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 > > 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 > > 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 265544943010369 > > 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 3746932548 > > 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3212957483 > > > > SMART Error Log Version: 1 > > No Errors Logged > > > > {snipping more} > > Your SMART attributes here appear perfectly fine. There is no > indication of bad LBAs (sectors) on the drive, or even "suspect" LBAs on > the drive. If LBA 0, for example, was actually bad (meaning the sector > itself), that would show up in the SMART error log (most likely), and if > not there, at bare minimum as some form of incremented RAW_VALUE field > in one of many attributes (either 5, 197, or 198; possibly 187, I forget). > > SMART attributes 1, 7, and 195 on Seagate drives are always "crazy"; > that is to say, they are not incremental counters, they are > vendor-encoded. smartmontools does not know how to decode some of these > attributes (on SOME Seagate drives it does, on others it doesn't). I > state this because people read SMART attributes wrong ~70% of the time; > they see non-zero numbers and go "oh my god, it's broken!" No it isn't. > SMART attribute values/decoding are not part of the ATA specification > (even working draft), so it's all proprietary more or less. > > I also want to assume attribute 240 is vendor-encoded as well, probably > as multiple data sets stored within the full 6-byte attribute field; > again, smartmontools doesn't know how to decode this. I wouldn't worry > about this, again even though the number is huge. :-) > > SMART attribute 184 keeps track of errors occurring between the drive > controller (on the PCB) and the drive cache; there are no cache errors. > That's good, and I'm glad to see vendors implementing this. > > SMART attribute 188 indicates the drive itself has not counted any > command timeouts (these would be ATA commands sent from the OS through > the SATA/SAS controller to the drive controller, which timed out at the > phase when the drive attempted to read/write data from a sector). > > SMART attribute 199 indicates there are no cabling problems or "physical > issues between the disk and the SATA/SAS controller" (bad connectors, > dust in the connectors, shoddy hot-swap plane, bad port, etc.). > > SMART attribute 183 is something I haven't seen before (I'm more > familiar with Western Digital disks), but it also looks fine. > > So again: your drive looks perfectly healthy per SMART stats. But > there's something amusing about this situation that a lot of people > overlook... > > > {snipping dmesg for brevity, but here's the URL for readers so they > > can see it themselves: > > http://lists.freebsd.org/pipermail/freebsd-fs/2012-January/013481.html > > } > > > > {simplify the SCSI errors shown} > > > > (da12:mps2:0:5:0): READ(6). CDB: 8 0 0 1 1 0 > > (da12:mps2:0:5:0): CAM status: SCSI Status Error > > (da12:mps2:0:5:0): SCSI status: Check Condition > > (da12:mps2:0:5:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information) > > (da12:mps2:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0 > > (da12:mps2:0:5:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information) > > (da12:mps2:0:5:0): READ(10). CDB: 28 0 74 70 6d af 0 0 1 0 > > (da12:mps2:0:5:0): CAM status: SCSI Status Error > > (da12:mps2:0:5:0): SCSI status: Check Condition > > (da12:mps2:0:5:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information) > > (da12:mps2:0:5:0): WRITE(6). CDB: a 0 0 0 1 0 > > (da12:mps2:0:5:0): CAM status: SCSI Status Error > > (da12:mps2:0:5:0): SCSI status: Check Condition > > (da12:mps2:0:5:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information) > > Based on this, we know the following: > > - The da12 disk is doing something weird when it comes to reads AND > writes. > - The da12 disk is not timing out; it receives an immediate error on > reads and writes (coming back from the controller; whether or not the > ATA command block makes it to the disk is unknown, but I have to > assume it does). > - The da12 disk, at one time, was working/usable as indicated by some > SMART attributes. > - The da12 disk is the only ST1000DL002 disk in the system. > - The da12 disk is on the same controller as 4 other disks. > - The da8 through da11 disks (WD25EZRS) on the mps2 controller are > performing fine with no issues (I have to assume this). > - The ST1000DL002 disk is an Advanced Format disk (4096-byte sectors). > - All the WD25EZRS disks are Advanced Format disks (4096-byte sectors). > - The ST1000DL002 disk behaves badly when used on the on-board AHCI > controller as well as a completely different motherboard (presumably). > > Here's the fun part: > > ATA commands being submit from the OS to the disk (specifically the > controller on the disk itself) are working fine. SMART attributes are > obtained via an ATA command that, internally on mechanical drives, > fetches data from the HPA (Host Protected Area) region of the drive (see > Wikipedia if you don't know about this), and returns that data. AFAIK > this data is not cached in any way, it's almost always read straight > from the HPA. > > So this means we know I/O communication between the OS and controller, > and the controller and the disk, works fine. And we also know, at least > with regards to the HPA region, that the heads can read data from the HPA > region successfully. Great. > > Could this be a controller problem (e.g. a firmware bug that affects > compatibility with ST1000DL002 drives)? I'm about 95% certain the > answer is no. The reason is that the ST1000DL002 drive behaved the same > when put on other controllers. > > What all this means is that the drive, in effect, refuses to read data > from non-HPA regions of the disk -- that means LBA 0 to . Why > or how could this happen? Unknown, because there's a *ton* of > possibilities -- way more than I care to speculate. :-) > > Have I seen this problem before? Yes -- many times, but only once with > a SATA drive: > > - I see this on rare occasion with Fujitsu SCSI disks at my workplace, > where the drives flat out refuse to do I/O any longer. However, these > return a vendor-specific ASC + ASCQ that indicate the drive is in a > "locked" or "frozen" state, requiring Fujitsu to investigate. I've seen > it happen a good 10, maybe 20 times over the past few years on drives > manufactured from 2001 to 2007. Thankfully Fujitsu provides full docs > on their SCSI drives so I was able to look up the ASC/ASCQ and figure > out it was an internal drive failure. We disposed of the disks > properly/securely. > > - In the SATA case, the end-user's drive behaved the same as yours. I > do not remember what brand (it really doesn't matter though). In their > case, however, the HPA region was corrupt; the drive spit out weird > errors during SMART attribute fetch, and those attributes which it did > fetch were *completely* garbled. My guess was a bad HPA region of the > drive, combined with either a firmware bug or something mechanical or > head problems. The end-user RMA'd the drive and the replacement worked > fine. > > My advice at this point (#1 is optional): > > 1. If you're curious and just interested in learning: put the > ST1000DL002 disk on a system where it's the only disk, and hooked > directly to the motherboard (and not in AHCI mode), and boot SeaTools > from a CD or USB stick. > > I'm willing to bet you get back an error code on the quick/short test > (which does more than just a SMART short test). If that does pass, try > doing a long test (which reads all the LBAs on the drive). I'll be > very, VERY surprised if that passes. > > 2. File an RMA with Seagate. The simple version is that all LBA I/O > (standard read/write) is being rejected by the drive for unknown > reasons. > > Good luck, and hope this sheds some light on the "fun" (or not so fun) > world of hard disk troubleshooting. And don't ask me to troubleshoot an > SSD. ;-) >