From owner-freebsd-bugs@FreeBSD.ORG Thu Sep 1 22:51:01 2005 Return-Path: X-Original-To: freebsd-bugs@hub.freebsd.org Delivered-To: freebsd-bugs@hub.freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5DDDB16A425 for ; Thu, 1 Sep 2005 22:51:01 +0000 (GMT) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [216.136.204.21]) by mx1.FreeBSD.org (Postfix) with ESMTP id D4EFF43D8E for ; Thu, 1 Sep 2005 22:50:19 +0000 (GMT) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (gnats@localhost [127.0.0.1]) by freefall.freebsd.org (8.13.3/8.13.3) with ESMTP id j81MoIGH096837 for ; Thu, 1 Sep 2005 22:50:18 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.13.3/8.13.1/Submit) id j81MoI8E096836; Thu, 1 Sep 2005 22:50:18 GMT (envelope-from gnats) Resent-Date: Thu, 1 Sep 2005 22:50:18 GMT Resent-Message-Id: <200509012250.j81MoI8E096836@freefall.freebsd.org> Resent-From: FreeBSD-gnats-submit@FreeBSD.org (GNATS Filer) Resent-To: freebsd-bugs@FreeBSD.org Resent-Reply-To: FreeBSD-gnats-submit@FreeBSD.org, Philip Paeps Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5809616A420 for ; Thu, 1 Sep 2005 22:43:31 +0000 (GMT) (envelope-from philip@paeps.cx) Received: from gateway.nixsys.be (gateway.nixsys.be [195.144.77.33]) by mx1.FreeBSD.org (Postfix) with ESMTP id 37E0F43D5D for ; Thu, 1 Sep 2005 22:43:30 +0000 (GMT) (envelope-from philip@paeps.cx) Received: from wotan.home.paeps.cx (wotan.home.paeps.cx [IPv6:2001:6f8:32f:10:a00:20ff:fe9b:138c]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "wotan.home.paeps.cx", Issuer "NixSys CA" (verified OK)) by gateway.nixsys.be (Postfix) with ESMTP id 6A47AC131; Fri, 2 Sep 2005 00:43:28 +0200 (CEST) Received: from fasolt.home.paeps.cx (unknown [IPv6:2001:6f8:32f:10:20a:e6ff:fe7d:c08]) by wotan.home.paeps.cx (Postfix) with ESMTP id 58C8961D3; Fri, 2 Sep 2005 00:43:27 +0200 (CEST) Received: from fasolt.home.paeps.cx (philip@localhost [127.0.0.1]) by fasolt.home.paeps.cx (8.13.4/8.13.4) with ESMTP id j81MhQJ4035599; Fri, 2 Sep 2005 00:43:26 +0200 (CEST) (envelope-from philip@fasolt.home.paeps.cx) Received: (from philip@localhost) by fasolt.home.paeps.cx (8.13.4/8.13.4/Submit) id j81MhQDY035598; Fri, 2 Sep 2005 00:43:26 +0200 (CEST) (envelope-from philip) Message-Id: <200509012243.j81MhQDY035598@fasolt.home.paeps.cx> Date: Fri, 2 Sep 2005 00:43:26 +0200 (CEST) From: Philip Paeps To: FreeBSD-gnats-submit@FreeBSD.org X-Send-Pr-Version: 3.113 Cc: apeiron+usenet@coitusmentis.info Subject: kern/85603: FS corruption and 'uncorrectable' DMA errors on ATA disks after unclean shutdown X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Sep 2005 22:51:01 -0000 >Number: 85603 >Category: kern >Synopsis: FS corruption and 'uncorrectable' DMA errors on ATA disks after unclean shutdown >Confidential: no >Severity: serious >Priority: medium >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Thu Sep 01 22:50:18 GMT 2005 >Closed-Date: >Last-Modified: >Originator: Philip Paeps >Release: FreeBSD 7.0-CURRENT i386 >Organization: >Environment: System: FreeBSD fasolt.home.paeps.cx 7.0-CURRENT FreeBSD 7.0-CURRENT #39: Sun Aug 21 15:52:38 CEST 2005 philip@fasolt.home.paeps.cx:/usr/obj/usr/src/sys/FASOLT i386 >Description: Recently, after a power failure, I experience some inexplicable problems with an ATA disks, which could quite possibly be due to hardware. However, after having experienced the same problems on a second disk, and discovering, in a discussion on comp.unix.bsd.freebsd.misc, that others have seen the same sort of issue, I've begun to suspect a kernel issue. The first time I saw the problem, the machine initially came up fine, and I could dirty-mount the filesystem and let bgfsck take care of things. Soon after the fsck began, the kernel started spewing out errors along the lines of 'uncorrectable' and 'dma_read'. Unfortunately, I've not managed to reproduce the problem with a loggable console. After a reboot, the filesystem on the disk refused to mount again. Manually forcing an fsck, complained about unreadable sectors. Again, the kernel spewed out the 'uncorrectable' and 'dma_read' errors. According to SMART, the disk is quite healthy, though some errors were logged in the the log: | Error 387 occurred at disk power-on lifetime: 5315 hours (221 days + 11 hours) | When the command that caused the error occurred, the device was in an unknown state. | | After command completion occurred, registers were: | ER ST SC SN CL CH DH | -- -- -- -- -- -- -- | 40 51 10 80 00 00 e0 Error: UNC 16 sectors at LBA = 0x00000080 = 128 | | Commands leading to the command that caused the error were: | CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name | -- -- -- -- -- -- -- -- ---------------- -------------------- | c8 00 10 80 00 00 e0 08 00:09:49.792 READ DMA | 25 00 01 ff 87 bd 40 08 00:08:28.160 READ DMA EXT | c8 00 02 00 00 00 e0 08 00:08:28.160 READ DMA | c8 00 01 01 00 00 e0 08 00:08:28.160 READ DMA | c8 00 01 00 00 00 e0 08 00:08:28.160 READ DMA Four other errors were logged, differing only in error number (decrementing by one each time - 387 386 385) and LBA address (similarly decrementing). The funny thing is, after newfsing the disk, and restoring the data, all seems to be working well and happy on the disk. The first disk I had this problem with, has now been under medium heavy use again for over a month, the second disk (see below) has been in use again for two weeks. In the case of the second disk, the machine paniced shortly after starting the bgfsck - unfortunately, I wasn't able to capture the the panic. Following the panic, the machine refused to boot with an LBA error 16 in the boot loader. Trying to mount the filesystems on another machine, read-only, produced the same 'uncorrectable' and 'dma_read' errors as seen on the first disk with the problem. Forcing fsck also caused the same errors as before. Possibly an unrelated issue: ls on some directories on the dirty-mounted ro filesystem sometimes worked, cp'ing the files to somewhere else, paniced the kernel. Again with the second disk, newfs and restoring data made all work happily again. Not a trace of any dma_read errors or uncorrectable reads. I realize there's not much hard debugging information here, but maybe this makes sense to a filesystem or ata guru. I experienced the problems on 5.x -STABLE kernels from late may, and -CURRENT kernels from the middle of June and July. I've not seen problems since, but then, I've not had any power failures either. I'm happy to help debug this further, if indeed it's a software bug, and not something with flaky hardware. Cc: Christopher Nehren who reported similar issues on Usenet and suggested a PR be filed. He might be able to add more useful information. For what it's worth, the disks were Maxtor 6Y200P0 and Maxtor 6E040L0 on a VIA 8235 UDMA133 controller and a VIA 8231 UDMA100 controller in my case. >How-To-Repeat: Lose power or panic the machine with a filesystem on an ATA disk and wait for phase of moon and other elements of faith to be properly aligned. I have been able to reproduce the problem (and the 'working well after newfs') three times by accident, never yet by force. >Fix: Hopefully! :-) >Release-Note: >Audit-Trail: >Unformatted: