Date: Sun, 9 Jan 2011 08:30:27 -0800 From: Jeremy Chadwick <freebsd@jdc.parodius.com> To: Tom Vijlbrief <tom.vijlbrief@xs4all.nl> Cc: freebsd-stable@freebsd.org Subject: Re: Panic 8.2 PRERELEASE WRITE_DMA48 Message-ID: <20110109163027.GA42562@icarus.home.lan> In-Reply-To: <AANLkTin3FHcsdMtA9OYaA2wrUx%2BfpyEsTThdRmS8sXA5@mail.gmail.com> References: <AANLkTi=iaq1Lx521oUF2BSB4-2wi9Ys2fTLzz4kLaLVo@mail.gmail.com> <20110109122243.GA37530@icarus.home.lan> <AANLkTin3FHcsdMtA9OYaA2wrUx%2BfpyEsTThdRmS8sXA5@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Jan 09, 2011 at 04:41:43PM +0100, Tom Vijlbrief wrote: > I've run many fscks on /usr in single user because I had soft update > inconsistencies, > no DMA errors during those repairs. There's no 1:1 ratio between running fsck on a filesystem and seeing a DMA error. I should explain what I mean by that: just because you receive a read or write error from a disk during operation doesn't mean fsck will induce it. fsck simply checks filesystem tables and so on for integrity, it doesn't do the equivalent of a bad block scan, nor does it check (read) every data block referenced by an inode. So if you have a filesystem which has a bad block somewhere within a data block, fsck almost certainly won't catch this. ZFS, on the other hand (specifically a "zpool scrub"), would/should induce such. The reason I advocated booting into single-user and running a fsck manually is because there's confirmation that background fsck doesn't catch/handle all filesystem consistency errors that a foreground fsck does. This is why I continue to advocate background_fsck="no" in rc.conf(5). That's for another discussion though. Let's review the disk: > === START OF INFORMATION SECTION === > Model Family: SAMSUNG SpinPoint F1 DT series > Device Model: SAMSUNG HD103UJ > Serial Number: S13PJ9BQC02902 > Firmware Version: 1AA01113 > User Capacity: 1,000,204,886,016 bytes > Device is: In smartctl database [for details use: -P show] > ATA Version is: 8 > ATA Standard is: ATA-8-ACS revision 3b > Local Time is: Sun Jan 9 16:40:24 2011 CET > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > ... > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail > Always - 0 > 3 Spin_Up_Time 0x0007 078 078 011 Pre-fail > Always - 7580 > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > Always - 399 > 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail > Always - 0 > 7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail > Always - 0 > 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail > Offline - 10097 > 9 Power_On_Hours 0x0032 100 100 000 Old_age > Always - 2375 > 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail > Always - 0 > 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > Always - 392 > 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age > Always - 0 > 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age > Always - 0 > 184 End-to-End_Error 0x0033 100 100 000 Pre-fail > Always - 0 > 187 Reported_Uncorrect 0x0032 100 100 000 Old_age > Always - 0 > 188 Command_Timeout 0x0032 100 100 000 Old_age > Always - 0 > 190 Airflow_Temperature_Cel 0x0022 057 052 000 Old_age > Always - 43 (Min/Max 42/45) > 194 Temperature_Celsius 0x0022 056 050 000 Old_age > Always - 44 (Min/Max 42/46) > 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age > Always - 20728126 > 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age > Always - 0 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age > Always - 1 > 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age > Always - 0 > 201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age > Always - 0 Your drive looks fine. Attribute 195 isn't anything to worry about (vendor-specific encoding makes this number appear large). Attribute 199 indicates one CRC error, but again nothing to worry about -- but could explain a single error during the lifetime of the drive (impossible to determine when it happened). > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error > # 1 Short offline Completed without error 00% 2361 - > # 2 Short offline Completed without error 00% 2205 - > # 3 Short offline Completed without error 00% 2138 - > # 4 Extended offline Completed without error 00% 2109 - > # 5 Short offline Completed without error 00% 2105 - > # 6 Short offline Completed without error 00% 2092 - > # 7 Short offline Completed without error 00% 2083 - > # 8 Short offline Completed without error 00% 2057 - > # 9 Extended offline Completed without error 00% 2037 - > #10 Short offline Completed without error 00% 2033 - > #11 Short offline Completed without error 00% 2009 - > #12 Short offline Completed without error 00% 1974 - > #13 Short offline Completed without error 00% 1941 - > #14 Extended offline Completed without error 00% 1920 - > #15 Short offline Completed without error 00% 1916 - > #16 Short offline Completed without error 00% 1868 - > #17 Short offline Completed without error 00% 1810 - > #18 Short offline Completed without error 00% 1655 - > #19 Short offline Completed without error 00% 1638 - > #20 Extended offline Completed without error 00% 1596 - > #21 Short offline Completed without error 00% 1591 - Not to get off topic, but what is causing this? It looks like you have a cron job or something very aggressive doing a "smartctl -t short /dev/ad4" or equivalent. If you have such, please disable this immediately. You shouldn't be doing SMART tests with such regularity; it accomplishes absolutely nothing, especially the "short" tests. Let the drive operate normally, otherwise run smartd and watch logs instead. If you want to scan the disk for bad blocks, you need to do a selective LBA test. Your drive does support selective scanning, as shown here: > Offline data collection > capabilities: > ... > Selective Self-test supported. You can do this with "smartctl -t select,0-max /dev/ad4", and safely while the drive is in operation. You can check the status of the scan (assuming the Samsung supports it) by using "smartctl -c /dev/ad4" and look at the percentage of completion. However, I would expect that if your drive had bad blocks, or even blocks which the drive consisted suspect, that Attributes 196 and 197 would be non-zero. I'm more familiar with Western Digital and Seagate disks though. > dmesg was in the attachment of the original mail but I'll paste it here: I apologise, I missed that -- sometimes the mailing list software removes attachments, so I've grown accustomed to not looking for them. My bad. > atapci0: <SiI 3512 SATA150 controller> port 0xb400-0xb407,0xb000-0xb003,0xa800-0xa807,0xa400-0xa403,0xa000-0xa00f mem 0xf0800000-0xf08001ff irq 23 at device 11.0 on pci2 > atapci0: [ITHREAD] > ata2: <ATA channel 0> on atapci0 > ata2: [ITHREAD] > ata3: <ATA channel 1> on atapci0 > ata3: [ITHREAD] > ad4: 953869MB <SAMSUNG HD103UJ 1AA01113> at ata2-master UDMA100 SATA 1.5Gb/s Using that information and circling back to the original error: > unknown: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=274799820^M > ata2: timeout waiting to issue command^M > ata2: error issuing WRITE_DMA48 command^M > g_vfs_done():ad4s2f[WRITE(offset=28915105792, length=131072)]error = 6^M > /usr: got error 6 while accessing filesystem^M > panic: softdep_deallocate_dependencies: unrecovered I/O error^M > cpuid = 0^M > KDB: stack backtrace:^M errno 6 is "device not configured". ad4 is on a Silicon Image controller (thankfully a reliable model). Sadly AHCI (ahci.ko) isn't in use here; I would advocate switching to it (your device names will change however) and see if these errors continue (they'll appear as SCSI CAM errors though). ahci_load="yes" in /boot/loader.conf should be enough. smartmontools does know to talk ATA to /dev/adaX (that's not a typo) disks. Am I advocating use of ahci.ko as a workaround for the problem? Sort of. I know that Alexander Motin has a lot of good experience with the Silicon Image controllers and would also advocate use of AHCI when one has such. Possibly what you're seeing is a bug or quirk of some kind in the ata(4) driver. These kinds of quirks ("I got an error but the disk itself looks fine") have concerned me on FreeBSD for many, many years now. I would recommend using ahci.ko first, then doing the selective scan only if more errors continue/show up after the fact. So in summary, at this point your drive looks fine, but I'd feel better after a selective scan had a chance to run. Purely speculative: there's always the possibility the Samsung disks do something similar to what IBM ATA drives circa 1999-2000 did: a feature called "ADM" (Automatic Drive Maintenance), where the drive would literally drop to standby mode to perform whatever. If it received an ATA command from the controller while doing this, would spin back up and respond to the command. The whole down/up process took so long that FreeBSD reported the issue as a timeout, as well as a DMA error if it was trying to do a read/write operation. You could literally hear the drive powering down then going "thunk" and powering back up when it received an ATA command. I mailed IBM about this and they confirmed it. The feature also existed on SCSI drives (and still does, I think), but is disabled by default. Here's relevant reading material: http://jdc.parodius.com/freebsd/ibm_email_aware_of_adm.txt http://www.mail-archive.com/freebsd-current@freebsd.org/msg07222.html The ATA drives that came out in 2001 and beyond had this feature *completely removed*, so it's pretty obvious it was causing problems, probably as more people started using the drives in servers vs. standard Windows desktops (well-known for hiding such I/O conditions). I imagine if Samsung drives did this we'd be seeing a lot more reports about it here on the lists. I'd pay close attention to the timestamps on the timeouts. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110109163027.GA42562>