From owner-freebsd-stable@FreeBSD.ORG Wed Feb 1 16:35:58 2012 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B727B106564A for ; Wed, 1 Feb 2012 16:35:58 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta05.emeryville.ca.mail.comcast.net (qmta05.emeryville.ca.mail.comcast.net [76.96.30.48]) by mx1.freebsd.org (Postfix) with ESMTP id 9D53A8FC1A for ; Wed, 1 Feb 2012 16:35:58 +0000 (UTC) Received: from omta01.emeryville.ca.mail.comcast.net ([76.96.30.11]) by qmta05.emeryville.ca.mail.comcast.net with comcast id UgVK1i0080EPchoA5gbyUr; Wed, 01 Feb 2012 16:35:58 +0000 Received: from koitsu.dyndns.org ([67.180.84.87]) by omta01.emeryville.ca.mail.comcast.net with comcast id Ugbw1i00B1t3BNj8MgbxVm; Wed, 01 Feb 2012 16:35:57 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id BB197102C1E; Wed, 1 Feb 2012 08:35:56 -0800 (PST) Date: Wed, 1 Feb 2012 08:35:56 -0800 From: Jeremy Chadwick To: Willem Jan Withagen Message-ID: <20120201163556.GA97343@icarus.home.lan> References: <4F2940C1.10901@digiware.nl> <20120201143942.GA96012@icarus.home.lan> <4F2960A7.8040705@digiware.nl> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F2960A7.8040705@digiware.nl> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: "stable@freebsd.org" Subject: Re: Troube with SSD X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 01 Feb 2012 16:35:58 -0000 On Wed, Feb 01, 2012 at 04:56:23PM +0100, Willem Jan Withagen wrote: > On 2012-02-01 15:39, Jeremy Chadwick wrote: > > On Wed, Feb 01, 2012 at 02:40:17PM +0100, Willem Jan Withagen wrote: > >> The device is a Corsair 60Gb Force GT. And thusfar I have not found any > >> suggestions that that serie of devices is prone to doing this. > > > > Can you please provide the following output when that SSD is attached > > to the system? You will need to install ports/sysutils/smartmontools > > for this (please make sure it's version 5.42 or newer). > > > > * smartctl -a /dev/whatever > > * smartctl -l devstat /dev/whatever > > * smartctl -l sataphy /dev/whatever > > * smartctl -l ssd /dev/whatever > > Eh, the last 3 look like they are not supported on the 3ware controller: > ATA_READ_LOG_EXT (addr=0x00:0x00, page=0, n=1) failed: 48-bit ATA > commands not supported > Read GP Log Directory failed. This indicates either a) the driver on FreeBSD, b) the controller itself, or c) the controller firmware does not permit these specific kinds of SMART sub-commands to be passed to the underlying device. > So I'll have to put it back in my real fileserver. Yes, please do. In fact, I wish you would not have moved the disk to another machine at all. I wish people would not do this during/around the time they ask for help; wait until someone clueful has exhausted existing analysis before doing that. Doing so adds great complexity to the situation, because then I have to start asking questions like "did you power off the machine before moving the drive?" "Did you use the same SATA cables?" "Is the controller on the other machine identical?" You get the idea. It is extremely taxing for me to track all of these things, because 99% of people do not write down/track what it is they do when they start moving hardware around/etc.. I'm not necessarily lecturing you, I'm more or less ranting -- I go through this situation two or three times a week with people I help online, and it wastes a lot of time. I have another individual in a private Email who asked me for help with 2 disks (one SSD, one HD), and kept moving the drives around between 3 different machines, giving me random output from each one (behaviour differed per box). I cannot deal with this kind of situation. > The output of the first one command, but it contains some real weird > values.....?? All the values below look fine to me. I will try my best to explain. > === START OF INFORMATION SECTION === > Device Model: Corsair Force GT > Serial Number: 11296503000005870891 > LU WWN Device Id: 0 000000 000000000 > Firmware Version: 1.2 > User Capacity: 60,022,480,896 bytes [60.0 GB] > Sector Size: 512 bytes logical/physical > Device is: Not in smartctl database [for details use: -P showall] First thing to note is the last line here. smartmontools does not appear to have knowledge of all the quirks/SMART attribute data for this model of Corsair SSD. So, some data may be inaccurate, and it does the best it can. Reformatting output to not force newlines/wrapping: > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 082 082 050 Pre-fail Always - 897651373777 > 5 Reallocated_Sector_Ct 0x0033 100 100 003 Pre-fail Always - 0 > 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 121242631799621 > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 31 > 171 Unknown_Attribute 0x0032 000 000 000 Old_age Always - 0 > 172 Unknown_Attribute 0x0032 000 000 000 Old_age Always - 0 > 174 Unknown_Attribute 0x0030 000 000 000 Old_age Offline - 19 > 177 Wear_Leveling_Count 0x0000 000 000 000 Old_age Offline - 0 > 181 Program_Fail_Cnt_Total 0x0032 000 000 000 Old_age Always - 0 > 182 Erase_Fail_Count_Total 0x0032 000 000 000 Old_age Always - 0 > 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 > 194 Temperature_Celsius 0x0022 026 035 000 Old_age Always - 26 (Min/Max 21/35) > 195 Hardware_ECC_Recovered 0x001c 120 120 000 Old_age Offline - 897651373777 > 196 Reallocated_Event_Count 0x0033 100 100 003 Pre-fail Always - 0 > 201 Soft_Read_Error_Rate 0x001c 120 120 000 Old_age Offline - 897651373777 > 204 Soft_ECC_Correction 0x001c 120 120 000 Old_age Offline - 897651373777 > 230 Head_Amplitude 0x0013 100 100 000 Pre-fail Always - 429496729700 > 231 Temperature_Celsius 0x0013 100 100 010 Pre-fail Always - 0 > 233 Media_Wearout_Indicator 0x0000 000 000 000 Old_age Offline - 1260 > 234 Unknown_Attribute 0x0032 000 000 000 Old_age Always - 1925 > 241 Total_LBAs_Written 0x0032 000 000 000 Old_age Always - 1925 > 242 Total_LBAs_Read 0x0032 000 000 000 Old_age Always - 1032 These values all look acceptable/excellent as best I can tell. The only attribute above that interests me is attribute 174. smartmontools doesn't know what this is, but I am curious to know what value "19" (which to me appears to be a counter or gauge) actually represents. Also, just for note: I think it's cool that Corsair put a thermistor or DTS inside of their drive for temperature readings. Wise of them! What you probably meant by "real weird values" are the extremely high numbers in the RAW_VALUE column. This is a sign of an individual who lacks familiarity with SMART and does not know how to properly interpret attributes. :-) I will make it crystal clear (since this is a mailing list and I'm sure someone will read this in the future): you cannot look at RAW_VALUE and assume it is a raw integer/counter or gauge. SMART attributes and their associated 6-byte data values are not defined per ATA standard. Thus, each vendor can implement them or store the data in the RAW_VALUE portion in any format they wish. Common vendors who do this are Seagate and Hitachi, and apparently Corsair. The behaviour varies from vendor to vendor, drive model to drive model, and firmware to firmware. Vendor-encoded values often appear very large or "look scary" to the uneducated eye. smartmontools can decode some of these, but the drive has to be in the smartmontools database (drivedb.h), **and** the code has to be written in smartmontools to properly decode the data. Since the attributes are proprietary, figuring out the format is virtually impossible without help from the vendor. Some (most) vendors choose not to disclose this information. In the case of some Seagate drives, the smartmontools folks either got "tips" from someone within Seagate, or somehow managed to figure out how to decode some (not all) on their own. You should probably start digging around on the Corsair forums, or within any online documentation you can find from Corsair, to see if they document what their SMART attributes are in their drives. For example, Intel documents all of their SMART attributes in an official PDF. > SMART Error Log not supported Well that's disappointing. That means that any kind of LBA (read/write) error inside of the drive will not be logged within the drive itself. Thus, the only kind of I/O errors or anomalies you'll be able to verify are purely OS-level. Oh well, there isn't anything anyone can do about this. So let's recap the original OS errors you saw in FreeBSD: > Jan 7 10:04:24 zfs kernel: ahcich3: Timeout on slot 27 port 0 > Jan 7 10:04:24 zfs kernel: ahcich3: is 00000000 cs 20000000 ss 38000000 rs 38000000 tfd c0 serr 00000000 cmd 0004dd17 > Jan 7 10:04:56 zfs kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080) > Jan 7 10:05:26 zfs kernel: ahcich3: Timeout on slot 29 port 0 > Jan 7 10:05:26 zfs kernel: ahcich3: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd 80 serr 00000000 cmd 0004dd17 > Jan 7 10:05:57 zfs kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080) > Jan 7 10:06:27 zfs kernel: ahcich3: Timeout on slot 29 port 0 > Jan 7 10:06:27 zfs kernel: ahcich3: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd 80 serr 00000000 cmd 0004dd17 > Jan 7 10:06:27 zfs kernel: (ada2:ahcich3:0:0:0): lost device > Jan 7 10:06:58 zfs kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080) > Jan 7 10:07:28 zfs kernel: ahcich3: Timeout on slot 29 port 0 > Jan 7 10:07:28 zfs kernel: ahcich3: is 00000000 cs e0000000 ss e0000000 rs e0000000 tfd 80 serr 00000000 cmd 0004dd17 > Jan 7 10:08:16 zfs kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080) > Jan 7 10:08:16 zfs kernel: ahcich3: Poll timeout on slot 31 port 0 > Jan 7 10:08:16 zfs kernel: ahcich3: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr 00000000 cmd 0004df17 > Jan 7 10:08:46 zfs kernel: ahcich3: Timeout on slot 31 port 0 > Jan 7 10:08:46 zfs kernel: ahcich3: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr 00000000 cmd 0004df17 > Jan 7 10:08:48 zfs kernel: (ada2:ahcich3:0:0:0): removing device entry > Jan 7 10:09:33 zfs kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080) > Jan 7 10:09:33 zfs kernel: ahcich3: Poll timeout on slot 31 port 0 > Jan 7 10:09:33 zfs kernel: ahcich3: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr 00000000 cmd 0004df17 What is shown here appears to be the SSD disk simply falling off the SATA bus. Do not get confused about the "slots" and how the numbers there change; that has nothing to do with SATA ports or anything like that, it's an internal AHCI protocol thing. (I believe FreeBSD supports distributing commands across multiple slots or spreading them across multiple slots for added benefits). Everything above indicates that after 30 seconds (well, 31 seconds exactly, but I imagine it's 30 seconds plus 1 extra second due to how the timeout loop might be written) the drive stopped responding to commands on the AHCI protocol level. This could be caused by a multitude of things, and it is very difficult for me remotely to diagnose any of these: - Power supply issues (voltage ripple, not enough amps on that port, shoddy or loose SATA power connector) - SATA cable issues (cable too long, possibly some broken copper within the cable itself (very unlikely though), etc.) - SATA port (physical) problems; dust in connectors, etc. - SSD-level issues. There are so many possibilities here (more than on a MHDD) that it's almost impossible to list them all off: -- Internal garbage collection mechanism (this is different than TRIM) on drive may be overly aggressive and stalls all I/O to drive during heavy GC. This would be classified as a firmware bug -- Power circuitry on PCB may be flaky -- Drive may have locked up hard due to other firmware bugs or some form of very low-level electrical/electronic error -- Internal SSD SATA + NAND flash I/O controller failure For those considering the remote possibility of interoperability issues between the Corsair SSD and the AHCI controller -- it's possible, but highly unlikely. The controller itself is an Intel ICH9, which FreeBSD has excellent support for and is very reliable. So, the controller here is probably not at fault. I imagine if there were incompatibilities of this sort (between ICH9 and Corsair), we'd have heard about it. I have seen many drives in my time (many means hundreds, no exaggeration) "lock up" or fall off the bus, both on SCSI and SATA. It's very difficult to troubleshoot these kinds of issues as I said, and usually requires someone with extensive knowledge to figure it out. General "Tier 1" Technical Support from companies do not have this level of expertise, so don't expect that from Corsair. I look forward to seeing the output from the below 3 commands, as they may provide more insights to what actually transpired. Whether or not Corsair chose to implement these in the General Purpose Log area of SMART is unknown, however. Furthermore, they may actually implement them, but stick them in a non-common place (e.g. different GPLog offsets), but PLEASE DO NOT go tinkering around with -l gplog,0xXX values. > > * smartctl -l devstat /dev/whatever > > * smartctl -l sataphy /dev/whatever > > * smartctl -l ssd /dev/whatever Thanks. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |