From owner-freebsd-fs@FreeBSD.ORG  Sun Sep  9 17:27:50 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 24B1B16A420
	for <freebsd-fs@freebsd.org>; Sun,  9 Sep 2007 17:27:50 +0000 (UTC)
	(envelope-from kris@FreeBSD.org)
Received: from weak.local (hub.freebsd.org [IPv6:2001:4f8:fff6::36])
	by mx1.freebsd.org (Postfix) with ESMTP id 6867713C442;
	Sun,  9 Sep 2007 17:27:49 +0000 (UTC)
	(envelope-from kris@FreeBSD.org)
Message-ID: <46E42D14.5060605@FreeBSD.org>
Date: Sun, 09 Sep 2007 19:27:48 +0200
From: Kris Kennaway <kris@FreeBSD.org>
User-Agent: Thunderbird 2.0.0.6 (Macintosh/20070728)
MIME-Version: 1.0
To: Johannes Totz <jo_t@gmx.net>
References: <46E4225F.1020806@gmx.net>
In-Reply-To: <46E4225F.1020806@gmx.net>
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-fs@freebsd.org
Subject: Re: UFS not handling errors correctly
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 09 Sep 2007 17:27:50 -0000

Johannes Totz wrote:
> Hi!
> 
> Seems like UFS does not handle disk/write errors properly, causes silent
> corruptions and which causes a panic later during snapshot creation.
> 
>> #uname -a
>> FreeBSD alfred 6.2-STABLE FreeBSD 6.2-STABLE #0: Thu Jul 12 20:40:55 CEST 2007     root@alfred:/usr/obj/usr/src/sys/ALFRED  i386
> 
> One day a write error on one of my disks happened:
> 
>> Aug 22 05:24:39 alfred kernel: ad0: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=469004995
>> Aug 22 05:24:40 alfred kernel: ad0: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=469004995
>> Aug 22 05:24:40 alfred kernel: g_vfs_done():ufs/home[READ(offset=240130525184, length=2048)]error = 5
>> Aug 22 05:25:08 alfred kernel: ad0: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=490974155
>> Aug 22 05:25:08 alfred kernel: ad0: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=490974155
>> Aug 22 05:25:08 alfred kernel: g_vfs_done():ufs/home[READ(offset=251378735104, length=2048)]error = 5
> 
> This has never happened before and did not happen again (yet). A long
> test with smartctl reports "all fine". So lets attribute that to a
> cosmic ray (or neutrino, pick your favorite) hitting the controller.
> 
> The system continued to run fine afterwards.
> But: next morning during automatic snapshot creation it panic'ed with:
> 
>> Aug 23 06:38:14 alfred kernel: ffs_snapshot_mount: old format snapshot inode 8
>> Aug 23 06:38:14 alfred savecore: reboot after panic: snapacct_ufs2: bad block
> 
> So of course it restarted. And tried to do a background fsck. And failed
> again... and again... and again...
> 
>> Aug 23 07:08:15 alfred kernel: ffs_snapshot_mount: old format snapshot inode 4
>> Aug 23 07:08:15 alfred savecore: reboot after panic: snapacct_ufs2: bad block
> 
> The report inode varies but "bad block" is always the same.
> So this went on for about 10x until I had a chance to interrupt it (i.e.
> woke from slumber) and boot into single user mode.
> Multiple runs of fsck fixed the problem. Deleted all old snapshot files
> and system is fine. No further problems. Maybe some files got lost;
> can't tell, there are a few million on it.
> 
> Also see:
> http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/114676
> 
> Unfortunately I don't have time to dig into this. But I wanted to report
> it. Maybe someone already fixed it...

bg fsck cannot fix arbitrary filesystem corruption.  Nor is it intended to.

Kris