Date: Fri, 2 Sep 2005 00:43:26 +0200 (CEST) From: Philip Paeps <philip@FreeBSD.org> To: FreeBSD-gnats-submit@FreeBSD.org Cc: apeiron+usenet@coitusmentis.info Subject: kern/85603: FS corruption and 'uncorrectable' DMA errors on ATA disks after unclean shutdown Message-ID: <200509012243.j81MhQDY035598@fasolt.home.paeps.cx> Resent-Message-ID: <200509012250.j81MoI8E096836@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
>Number: 85603 >Category: kern >Synopsis: FS corruption and 'uncorrectable' DMA errors on ATA disks after unclean shutdown >Confidential: no >Severity: serious >Priority: medium >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Thu Sep 01 22:50:18 GMT 2005 >Closed-Date: >Last-Modified: >Originator: Philip Paeps >Release: FreeBSD 7.0-CURRENT i386 >Organization: >Environment: System: FreeBSD fasolt.home.paeps.cx 7.0-CURRENT FreeBSD 7.0-CURRENT #39: Sun Aug 21 15:52:38 CEST 2005 philip@fasolt.home.paeps.cx:/usr/obj/usr/src/sys/FASOLT i386 >Description: Recently, after a power failure, I experience some inexplicable problems with an ATA disks, which could quite possibly be due to hardware. However, after having experienced the same problems on a second disk, and discovering, in a discussion on comp.unix.bsd.freebsd.misc, that others have seen the same sort of issue, I've begun to suspect a kernel issue. The first time I saw the problem, the machine initially came up fine, and I could dirty-mount the filesystem and let bgfsck take care of things. Soon after the fsck began, the kernel started spewing out errors along the lines of 'uncorrectable' and 'dma_read'. Unfortunately, I've not managed to reproduce the problem with a loggable console. After a reboot, the filesystem on the disk refused to mount again. Manually forcing an fsck, complained about unreadable sectors. Again, the kernel spewed out the 'uncorrectable' and 'dma_read' errors. According to SMART, the disk is quite healthy, though some errors were logged in the the log: | Error 387 occurred at disk power-on lifetime: 5315 hours (221 days + 11 hours) | When the command that caused the error occurred, the device was in an unknown state. | | After command completion occurred, registers were: | ER ST SC SN CL CH DH | -- -- -- -- -- -- -- | 40 51 10 80 00 00 e0 Error: UNC 16 sectors at LBA = 0x00000080 = 128 | | Commands leading to the command that caused the error were: | CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name | -- -- -- -- -- -- -- -- ---------------- -------------------- | c8 00 10 80 00 00 e0 08 00:09:49.792 READ DMA | 25 00 01 ff 87 bd 40 08 00:08:28.160 READ DMA EXT | c8 00 02 00 00 00 e0 08 00:08:28.160 READ DMA | c8 00 01 01 00 00 e0 08 00:08:28.160 READ DMA | c8 00 01 00 00 00 e0 08 00:08:28.160 READ DMA Four other errors were logged, differing only in error number (decrementing by one each time - 387 386 385) and LBA address (similarly decrementing). The funny thing is, after newfsing the disk, and restoring the data, all seems to be working well and happy on the disk. The first disk I had this problem with, has now been under medium heavy use again for over a month, the second disk (see below) has been in use again for two weeks. In the case of the second disk, the machine paniced shortly after starting the bgfsck - unfortunately, I wasn't able to capture the the panic. Following the panic, the machine refused to boot with an LBA error 16 in the boot loader. Trying to mount the filesystems on another machine, read-only, produced the same 'uncorrectable' and 'dma_read' errors as seen on the first disk with the problem. Forcing fsck also caused the same errors as before. Possibly an unrelated issue: ls on some directories on the dirty-mounted ro filesystem sometimes worked, cp'ing the files to somewhere else, paniced the kernel. Again with the second disk, newfs and restoring data made all work happily again. Not a trace of any dma_read errors or uncorrectable reads. I realize there's not much hard debugging information here, but maybe this makes sense to a filesystem or ata guru. I experienced the problems on 5.x -STABLE kernels from late may, and -CURRENT kernels from the middle of June and July. I've not seen problems since, but then, I've not had any power failures either. I'm happy to help debug this further, if indeed it's a software bug, and not something with flaky hardware. Cc: Christopher Nehren who reported similar issues on Usenet and suggested a PR be filed. He might be able to add more useful information. For what it's worth, the disks were Maxtor 6Y200P0 and Maxtor 6E040L0 on a VIA 8235 UDMA133 controller and a VIA 8231 UDMA100 controller in my case. >How-To-Repeat: Lose power or panic the machine with a filesystem on an ATA disk and wait for phase of moon and other elements of faith to be properly aligned. I have been able to reproduce the problem (and the 'working well after newfs') three times by accident, never yet by force. >Fix: Hopefully! :-) >Release-Note: >Audit-Trail: >Unformatted:
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200509012243.j81MhQDY035598>