Date: Tue, 28 Oct 2008 21:33:14 -0700 From: Jeremy Chadwick <koitsu@FreeBSD.org> To: Carl <k0802647@telus.net> Cc: Wojciech Puchar <wojtek@wojtek.tensor.gdynia.pl>, freebsd-questions@freebsd.org Subject: Re: gmirror slice insertion, "FAILURE - READ_DMA status=51<READY, DSC, ERROR>" Message-ID: <20081029043314.GA66773@icarus.home.lan> In-Reply-To: <4907DB6B.8090000@telus.net> References: <49067148.6080307@telus.net> <20081028024143.GA37131@icarus.home.lan> <20081028120407.G3326@wojtek.tensor.gdynia.pl> <20081028122013.GA49298@icarus.home.lan> <4907DB6B.8090000@telus.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Oct 28, 2008 at 08:41:31PM -0700, Carl wrote: > Jeremy Chadwick said: >>> ad6: FAILURE - READ_DMA status=51<READY,DSC,ERROR> >>> error=40<UNCORRECTABLE> LBA=134802751 >> >> Are you sure you don't have a bad hard disk? This looks to be like a >> classic block/sector failure. > > I hadn't realized that a bad block would manifest itself with a message > about DMA. Seems like such semantics would be a little obscure to most > users, apparently including me. Do not let the term "DMA" confuse you -- the operation was a read operation, and DMA is used to do the transfer of data between disk/controller/local memory. You might see things like "READ_DMA48" and "WRITE_DMA48", which just indicate that 48-bit LBA addressing mode is in use when attempting the operation. For sake of comparison, you should see what Linux and Solaris do. For example, when a disk falls off the bus (silently) on a Linux machine using ext3fs, all I've ever seen is continual spewing of "ext3fs journal errors" on the console -- absolutely no indication that the disk itself has actually fallen off the bus. With SCSI disks under Solaris, the level of detail you get is perfect -- it's very easy to determine what happened. But in the case of ATA disks, you get more or less something that looks similar to FreeBSD. If you have complaints about the formatting of the output, I would recommend filing a PR for it, or bringing it up with Soren Schmidt (sos@freebsd.org), author of the ata(4) layer. I will agree with you that some more coherent error messages would be useful. >> So you're saying that the *exact* same READ_DMA error, at the *exact* >> same LBA, is reported on ad4? If so, that's very bizarre. > > No, perhaps I wasn't clear enough. Both instances were on ad6, so far. Then that makes ad6, or something specific to ad6, the culprit. >> Can you please provide the output from the following commands? > > See end of message. Let me know if you then want more (in- or out-of-band). > > Having now installed smartmontools, you can see below that I ran it for > both ad4 and ad6. Sure enough, ad6 has logged 2 READ DMA errors - does > that make this a definitive bad disk then? I'll have to look at the output. See below. > Should I not be worried about ad4 too? Those Raw_Read_Error_Rate and > Seek_Error_Rate numbers should be zero or very close to it, shouldn't > they? I don't know how to interpret what I'm seeing in that output, so > I'd appreciate any insight. Should I be returning both disks for > warranty claims (they're both very recently purchased)? As you've admitted, the problem is that most people don't know how to interpret SMART data, and start "freaking out" over things which are normal. People focus on the RAW values, which for many attributes is the wrong thing to look at. For example, on Seagate disks, a insanely high Raw_Read_Error_Rate and Seek_Error_Rate means absolutely nothing; it's normal. But with another vendor, it might actually be accurate. Welcome to one of the problems with SMART: the specification does not state what format the raw data must be in. Seagate chooses to encode some raw data for some SMART attributes in a custom format. The format is not publicly documented. This is why you have to go off of the adjusted values shown in VALUE/WORST/THRESH. "How am I supposed to know all of this?!" You aren't -- it comes with experience. > Is there anything I should know about this model of hard disk with > regards to being known for problems? Also, is there a good test I can > perform to hopefully flush out any problems before I put this thing into > service? I'm confused: what gives you the impression there's a problem with *this model* of hard disk? I've seen no evidence presented that indicates such. What makes you ask that question? None of us here work at Seagate, so even if there was a known problem with this specific model of disk, we wouldn't know. For all we know, there could be little 3mm tall terrorists dancing on the platters, ready to leap out at any moment and stab us! :-) Please keep something in mind: just because you have brand new hard disks *does not* guarantee they're free of errors. I have seen hundreds of "brand new" hard disks fail right out of the box, including SCSI disks (which people, for some reason, think are "less likely to have this problem" simply because they cost more money). I deal with this situation on a daily basis at work, believe it or not. > # vmstat -i Interrupts look fine; I was looking for anything that might indicate an absurdly high rate. atacontrol cap output looks fine too, nothing weird or out of the ordinary (I wasn't expecting anything to show up here, but I did want to get an idea if the disks were truly SATA300 or not). Let's take a look at the SMART data. > # smartctl -a /dev/ad4 > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 158643744 > 3 Spin_Up_Time 0x0003 092 091 000 Pre-fail Always - 0 > 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 108 > 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 > 7 Seek_Error_Rate 0x000f 064 060 030 Pre-fail Always - 2921473 > 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 499 > 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 108 > 184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0 > 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 > 188 Unknown_Attribute 0x0032 100 099 000 Old_age Always - 65540 > 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 > 190 Airflow_Temperature_Cel 0x0022 071 069 045 Old_age Always - 29 (Lifetime Min/Max 23/31) > 194 Temperature_Celsius 0x0022 029 040 000 Old_age Always - 29 (0 20 0 0) > 195 Hardware_ECC_Recovered 0x001a 039 019 000 Old_age Always - 158643744 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 > 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 All of the attributes here look good. To get an update on Attribute 198, you'd need to run a short offline test ("smartctl -t short /dev/ad4"). You can safely do this while the disk is in use; don't let the word "offline" make you think the disk disappears. You can watch the status using smartctl -a, and once its finished, you can compare the old value to the new. I'm willing to bet it remains zero. The temperature also looks good (29C). Additionally, the SMART error log for this disk looks fine; no signs of errors. I would say ad4 is in perfect shape. > # smartctl -a /dev/ad6 > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 116 100 006 Pre-fail Always - 106947042 > 3 Spin_Up_Time 0x0003 092 091 000 Pre-fail Always - 0 > 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 108 > 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 2 > 7 Seek_Error_Rate 0x000f 061 060 030 Pre-fail Always - 1376532 > 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 499 > 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 1 > 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 108 > 184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0 > 187 Reported_Uncorrect 0x0032 098 098 000 Old_age Always - 2 > 188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 > 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 > 190 Airflow_Temperature_Cel 0x0022 071 069 045 Old_age Always - 29 (Lifetime Min/Max 23/31) > 194 Temperature_Celsius 0x0022 029 040 000 Old_age Always - 29 (0 19 0 0) > 195 Hardware_ECC_Recovered 0x001a 038 018 000 Old_age Always - 106947042 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 2 > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 2 > 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 And here we see the core of the problem. :-) Attribute 5 shows the disk has reallocated two sectors (meaning, it detected two sectors were bad, and reallocated them). This is hard evidence of bad blocks on the disk. Attribute 10 indicates that there was one incident of the disk failing to spin up properly, and had to re-initiate spinning up of the drive. Why/how this happened is unknown, but at least it's not a huge number. One incident is probably nothing to worry about. I'm not completely sure what Attribute 187 represents, but it very likely is directly related to Attribute 5. Attributes 197 and 198 indicate a bigger problem: the two bad sectors described earlier **have not** been corrected or remapped. This is bad. I'll explain a bit how SATA disks deal with bad sectors. First and foremost, straight out of the factory there's a pre-defined list of physically bad sectors on the disk. These sectors are never accessed by the drive, and the manufacturer (Seagate) is the one who creates that list. It's 100% normal; SCSI disks have the same thing (physical defect list vs. grown defect list). SATA disks also have a certain amount of pre-allocated "spare sectors" that the disk can use for transparent remapping. When I say transparent, I mean the OS never gets told of what's going on behind the scenes. Say the drive attempts to write some data, and the firmware on the drive notices that one of the sectors has a problem. The drive will, unknown to the OS, say "okay lets not use that one, mark it bad, and instead use one from the spare pool". But there's only so many spares... As far as I know, SMART **does not** log transparent sector remaps. When the OS starts seeing errors due to bad sectors, it means the pre-allocated "spare sector" pool has been exhausted. SMART also reflects this condition. What you see above is a classic example of a hard disk with a growing number of bad sectors. There are *definitely* other bad sectors on the disk which the drive has remapped on its own, but things are getting worse. As for the SMART error log -- what you see there is a direct result of the two bad sectors. Remember, block != sector, which is why you see two error entries for the same LBA (there are probably two sectors next to one another which make up part of the block). Advice is simple: replace this hard disk. I highly recommend you do an "advanced replacement" RMA, assuming Seagate offers it, where the manufacturer sends you a new/refurbished drive first. They'll need a credit card number (in the case you don't ship them the bad disk within 30 days, they charge you $$). Hope this helps. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20081029043314.GA66773>