From owner-freebsd-fs@FreeBSD.ORG Fri Jan 20 18:18:30 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E0AFA106566C for ; Fri, 20 Jan 2012 18:18:30 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta11.emeryville.ca.mail.comcast.net (qmta11.emeryville.ca.mail.comcast.net [76.96.27.211]) by mx1.freebsd.org (Postfix) with ESMTP id BFA828FC12 for ; Fri, 20 Jan 2012 18:18:30 +0000 (UTC) Received: from omta19.emeryville.ca.mail.comcast.net ([76.96.30.76]) by qmta11.emeryville.ca.mail.comcast.net with comcast id PrJE1i0011eYJf8ABuJWGS; Fri, 20 Jan 2012 18:18:30 +0000 Received: from koitsu.dyndns.org ([67.180.84.87]) by omta19.emeryville.ca.mail.comcast.net with comcast id PuJV1i00C1t3BNj01uJVFw; Fri, 20 Jan 2012 18:18:30 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 073B1102C19; Fri, 20 Jan 2012 10:18:29 -0800 (PST) Date: Fri, 20 Jan 2012 10:18:29 -0800 From: Jeremy Chadwick To: Dennis Glatting Message-ID: <20120120181828.GA1049@icarus.home.lan> References: <4F192ADA.5020903@brockmann-consult.de> <1327069331.29444.4.camel@btw.pki2.com> <20120120153129.GA97746@icarus.home.lan> <1327077094.29408.11.camel@btw.pki2.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1327077094.29408.11.camel@btw.pki2.com> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@freebsd.org Subject: Re: sanity check: is 9211-8i, on 8.3, with IT firmware still "the one" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Jan 2012 18:18:31 -0000 On Fri, Jan 20, 2012 at 08:31:34AM -0800, Dennis Glatting wrote: > On Fri, 2012-01-20 at 07:31 -0800, Jeremy Chadwick wrote: > > > On Fri, Jan 20, 2012 at 06:22:11AM -0800, Dennis Glatting wrote: > > > I am having a problem with Seagate ST1000DL002 disks but I haven't yet > > > determined weather it is the disks themselves (they -- two of them, new > > > -- fail under a MB controller too. > > > > Assuming the disks are seen directly on the bus (e.g. show up as daX, > > adaX, or whatever), please install ports/sysutils/smartmontools (make > > sure you're using version 5.42 or newer) and please provide output from > > the following command: "smartctl -a /dev/XXX" where XXX is the device > > name of the ST1000DL002 disk(s). Please be sure to state which device > > name is associated with which smartctl output. You can delete or > > remove the disk serial numbers from the output (for privacy) if you > > wish. I'll be happy to review the data and tell you whether or not the > > disks themselves are showing problems or if the issue is elsewhere. > > That is the motivation I needed to reboot that system, which was 50% > through a task. That said, as remains the case today, for the last 20 > years I haven't been able to find that "Any Key" on reboot. :) > > Regardless... First off, let's start with the full picture. Readers need to know exactly what is going on within your controller setup, what disks are connected to what, etc.. Taken from your full dmesg below, and turned into something easy-to-read (mostly) Controller mps0 --> LSI SAS2008 --> IRQ 19 on pci1 --> Firmware 12.00.00.00 --> Disks attached: --> da0 --> WDC WD25EZRS, SATA300 --> da1 --> WDC WD25EZRS, SATA300 --> da2 --> WDC WD25EZRS, SATA300 --> da3 --> WDC WD25EZRS, SATA300 --> da4 --> WDC WD25EZRS, SATA300 --> da5 --> WDC WD25EZRS, SATA300 --> da6 --> WDC WD25EZRS, SATA300 --> da7 --> WDC WD25EZRS, SATA300 Controller mps1 --> LSI SAS2008 --> IRQ 19 on pci5 --> Firmware 12.00.00.00 --> Disks attached: --> None Controller mps2 --> LSI SAS2008 --> IRQ 16 on pci6 --> Firmware 12.00.00.00 --> Disks attached: --> da8 --> WDC WD25EZRS, SATA300 --> da9 --> WDC WD25EZRS, SATA300 --> da10 --> WDC WD25EZRS, SATA300 --> da11 --> WDC WD25EZRS, SATA300 --> da12 --> ST1000DL002, SATA300 Controller ahci0 --> ATI IXP700 AHCI (4-port) --> IRQ 19 on pci0 --> Disks attached: --> ahcich0 --> ada0 --> Corsair Force 3 SSD, SATA600 --> ahcich1 --> ada1 --> OCZ-AGILITY2 SSD, SATA300 --> ahcich2 --> ada2 --> ST31000333AS, SATA300 Controller ata0 --> ATI IXP700/800 ATA133 (2-port/4-device, PATA) --> IRQ on pci0 --> I would assume this would be on IRQ 14 or 15, sigh... --> Disks attached: --> None Now that we have a full picture, let's continue. > An attempt to write to it: > > bd3# dd if=/dev/zero of=/dev/da12 > dd: /dev/da12: Input/output error > 1+0 records in > 0+0 records out > 0 bytes transferred in 0.378153 secs (0 bytes/sec) The dd command you executed to write zeros to the disk, 512-bytes at time, starting at LBA 0, failed when writing the first 512 bytes. So, from my perspective, writing to LBA 0 is failing. You should also keep in mind that this dd command to zero the disk (if it was to work) would take a very long time to complete. If you used a larger block size (bs=64k or maybe larger), it would be a lot faster. Just a tip. Starting with bs=512 (default) is fine, or in this case using 4096 would probably be better (see below), but whatever. > The disk is presently connected to this device (LSI 9211-8i) but I have > also had it connected to the devices on the MB and I think to a > SuperMicro board. I have also tried a different LSI board. Thanks for sharing this -- this is important information, but let's not start moving the drive around any more, okay? There's no point. The information you've given is enough, and I'll explain it in detail. > {snipping for brevity} > > bd3# smartctl -a /dev/da12 > smartctl 5.42 2011-10-20 r3458 [FreeBSD 9.0-STABLE amd64] (local build) > Copyright (C) 2002-11 by Bruce Allen, > http://smartmontools.sourceforge.net > > === START OF INFORMATION SECTION === > Model Family: Seagate Barracuda Green (Adv. Format) > Device Model: ST1000DL002-9TT153 > Serial Number: W1V06SLR > LU WWN Device Id: 5 000c50 037e11be9 > Firmware Version: CC32 > User Capacity: 1,000,204,886,016 bytes [1.00 TB] > Sector Size: 512 bytes logical/physical > Device is: In smartctl database [for details use: -P show] > ATA Version is: 8 > ATA Standard is: ATA-8-ACS revision 4 > Local Time is: Fri Jan 20 08:22:34 2012 PST > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > {snipping for brevity} > > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 108 099 006 Pre-fail Always - 241488 > 3 Spin_Up_Time 0x0003 087 070 000 Pre-fail Always - 0 > 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 28 > 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 > 7 Seek_Error_Rate 0x000f 100 253 030 Pre-fail Always - 136324 > 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 576 > 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 29 > 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 > 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 > 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 > 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 > 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 > 190 Airflow_Temperature_Cel 0x0022 073 062 045 Old_age Always - 27 (Min/Max 21/27) > 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 > 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 23 > 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 29 > 194 Temperature_Celsius 0x0022 027 040 000 Old_age Always - 27 (0 21 0 0 0) > 195 Hardware_ECC_Recovered 0x001a 027 008 000 Old_age Always - 241488 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 > 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 > 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 265544943010369 > 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 3746932548 > 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3212957483 > > SMART Error Log Version: 1 > No Errors Logged > > {snipping more} Your SMART attributes here appear perfectly fine. There is no indication of bad LBAs (sectors) on the drive, or even "suspect" LBAs on the drive. If LBA 0, for example, was actually bad (meaning the sector itself), that would show up in the SMART error log (most likely), and if not there, at bare minimum as some form of incremented RAW_VALUE field in one of many attributes (either 5, 197, or 198; possibly 187, I forget). SMART attributes 1, 7, and 195 on Seagate drives are always "crazy"; that is to say, they are not incremental counters, they are vendor-encoded. smartmontools does not know how to decode some of these attributes (on SOME Seagate drives it does, on others it doesn't). I state this because people read SMART attributes wrong ~70% of the time; they see non-zero numbers and go "oh my god, it's broken!" No it isn't. SMART attribute values/decoding are not part of the ATA specification (even working draft), so it's all proprietary more or less. I also want to assume attribute 240 is vendor-encoded as well, probably as multiple data sets stored within the full 6-byte attribute field; again, smartmontools doesn't know how to decode this. I wouldn't worry about this, again even though the number is huge. :-) SMART attribute 184 keeps track of errors occurring between the drive controller (on the PCB) and the drive cache; there are no cache errors. That's good, and I'm glad to see vendors implementing this. SMART attribute 188 indicates the drive itself has not counted any command timeouts (these would be ATA commands sent from the OS through the SATA/SAS controller to the drive controller, which timed out at the phase when the drive attempted to read/write data from a sector). SMART attribute 199 indicates there are no cabling problems or "physical issues between the disk and the SATA/SAS controller" (bad connectors, dust in the connectors, shoddy hot-swap plane, bad port, etc.). SMART attribute 183 is something I haven't seen before (I'm more familiar with Western Digital disks), but it also looks fine. So again: your drive looks perfectly healthy per SMART stats. But there's something amusing about this situation that a lot of people overlook... > {snipping dmesg for brevity, but here's the URL for readers so they > can see it themselves: > http://lists.freebsd.org/pipermail/freebsd-fs/2012-January/013481.html > } > > {simplify the SCSI errors shown} > > (da12:mps2:0:5:0): READ(6). CDB: 8 0 0 1 1 0 > (da12:mps2:0:5:0): CAM status: SCSI Status Error > (da12:mps2:0:5:0): SCSI status: Check Condition > (da12:mps2:0:5:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information) > (da12:mps2:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0 > (da12:mps2:0:5:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information) > (da12:mps2:0:5:0): READ(10). CDB: 28 0 74 70 6d af 0 0 1 0 > (da12:mps2:0:5:0): CAM status: SCSI Status Error > (da12:mps2:0:5:0): SCSI status: Check Condition > (da12:mps2:0:5:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information) > (da12:mps2:0:5:0): WRITE(6). CDB: a 0 0 0 1 0 > (da12:mps2:0:5:0): CAM status: SCSI Status Error > (da12:mps2:0:5:0): SCSI status: Check Condition > (da12:mps2:0:5:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information) Based on this, we know the following: - The da12 disk is doing something weird when it comes to reads AND writes. - The da12 disk is not timing out; it receives an immediate error on reads and writes (coming back from the controller; whether or not the ATA command block makes it to the disk is unknown, but I have to assume it does). - The da12 disk, at one time, was working/usable as indicated by some SMART attributes. - The da12 disk is the only ST1000DL002 disk in the system. - The da12 disk is on the same controller as 4 other disks. - The da8 through da11 disks (WD25EZRS) on the mps2 controller are performing fine with no issues (I have to assume this). - The ST1000DL002 disk is an Advanced Format disk (4096-byte sectors). - All the WD25EZRS disks are Advanced Format disks (4096-byte sectors). - The ST1000DL002 disk behaves badly when used on the on-board AHCI controller as well as a completely different motherboard (presumably). Here's the fun part: ATA commands being submit from the OS to the disk (specifically the controller on the disk itself) are working fine. SMART attributes are obtained via an ATA command that, internally on mechanical drives, fetches data from the HPA (Host Protected Area) region of the drive (see Wikipedia if you don't know about this), and returns that data. AFAIK this data is not cached in any way, it's almost always read straight from the HPA. So this means we know I/O communication between the OS and controller, and the controller and the disk, works fine. And we also know, at least with regards to the HPA region, that the heads can read data from the HPA region successfully. Great. Could this be a controller problem (e.g. a firmware bug that affects compatibility with ST1000DL002 drives)? I'm about 95% certain the answer is no. The reason is that the ST1000DL002 drive behaved the same when put on other controllers. What all this means is that the drive, in effect, refuses to read data from non-HPA regions of the disk -- that means LBA 0 to . Why or how could this happen? Unknown, because there's a *ton* of possibilities -- way more than I care to speculate. :-) Have I seen this problem before? Yes -- many times, but only once with a SATA drive: - I see this on rare occasion with Fujitsu SCSI disks at my workplace, where the drives flat out refuse to do I/O any longer. However, these return a vendor-specific ASC + ASCQ that indicate the drive is in a "locked" or "frozen" state, requiring Fujitsu to investigate. I've seen it happen a good 10, maybe 20 times over the past few years on drives manufactured from 2001 to 2007. Thankfully Fujitsu provides full docs on their SCSI drives so I was able to look up the ASC/ASCQ and figure out it was an internal drive failure. We disposed of the disks properly/securely. - In the SATA case, the end-user's drive behaved the same as yours. I do not remember what brand (it really doesn't matter though). In their case, however, the HPA region was corrupt; the drive spit out weird errors during SMART attribute fetch, and those attributes which it did fetch were *completely* garbled. My guess was a bad HPA region of the drive, combined with either a firmware bug or something mechanical or head problems. The end-user RMA'd the drive and the replacement worked fine. My advice at this point (#1 is optional): 1. If you're curious and just interested in learning: put the ST1000DL002 disk on a system where it's the only disk, and hooked directly to the motherboard (and not in AHCI mode), and boot SeaTools from a CD or USB stick. I'm willing to bet you get back an error code on the quick/short test (which does more than just a SMART short test). If that does pass, try doing a long test (which reads all the LBAs on the drive). I'll be very, VERY surprised if that passes. 2. File an RMA with Seagate. The simple version is that all LBA I/O (standard read/write) is being rejected by the drive for unknown reasons. Good luck, and hope this sheds some light on the "fun" (or not so fun) world of hard disk troubleshooting. And don't ask me to troubleshoot an SSD. ;-) -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |