From owner-freebsd-fs@FreeBSD.ORG  Sun Sep  9 22:11:46 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2103316A420;
	Sun,  9 Sep 2007 22:11:46 +0000 (UTC)
	(envelope-from scode@hyperion.scode.org)
Received: from hyperion.scode.org (cl-1361.ams-04.nl.sixxs.net
	[IPv6:2001:960:2:550::2])
	by mx1.freebsd.org (Postfix) with ESMTP id 5538913C4A3;
	Sun,  9 Sep 2007 22:11:44 +0000 (UTC)
	(envelope-from scode@hyperion.scode.org)
Received: by hyperion.scode.org (Postfix, from userid 1001)
	id 0932323C490; Mon, 10 Sep 2007 00:11:42 +0200 (CEST)
Date: Mon, 10 Sep 2007 00:11:42 +0200
From: Peter Schuller <peter.schuller@infidyne.com>
To: Kris Kennaway <kris@FreeBSD.org>
Message-ID: <20070909221142.GA6435@hyperion.scode.org>
References: <46E4225F.1020806@gmx.net> <46E42D14.5060605@FreeBSD.org>
	<20070909200933.GA98161@hyperion.scode.org>
	<46E45E54.6040207@FreeBSD.org>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="xHFwDpU9dbj6ez1V"
Content-Disposition: inline
In-Reply-To: <46E45E54.6040207@FreeBSD.org>
User-Agent: Mutt/1.5.16 (2007-06-09)
Cc: freebsd-fs@freebsd.org, Johannes Totz <jo_t@gmx.net>
Subject: Re: UFS not handling errors correctly
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 09 Sep 2007 22:11:46 -0000


--xHFwDpU9dbj6ez1V
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

> Soft updates isn't journalling, so you can't "roll back" an error.  It=20
> works by maintaining knowledge of the on-disk state of data and ensuring =
=20
> that it only writes to disk in a suitable order so that the on-disk state=
=20
> is supposed to remain consistent.

I am aware of this, I was speaking generally. The least "committal"
solution being to just panic. The point I was trying to make was that
as long as errors are traditional and simple, as in not being able to
read a particular sector, or a write to a sector failed, aborting all
operations should not lead to corruption since that is exactly what
the filesystem has been designed to prevent (essentially panicing the
machine from the perspective of the on-disk filesystem even if the
system is not actually paniced, such as if the filesystem is unmounted
instead).

> Unfortunately there are many ways in which this can fail, mostly involvin=
g=20
> external factors violating the assumptions upon which soft updates relies=
=2E =20
> For example, the data written on disk may not correspond to the data=20
> dispatched by soft updates, due to things like write caching in the=20
> hardware, write reordering, data corruption, unpredictable disk behaviour=
=20
> during power loss, hardware failure, etc.

I am aware of this too (and paranoid about it).

> Similarly, background fsck assumes that the only filesystem errors it wil=
l=20
> encounter are those permitted by the soft updates model, which are=20
> "benign", i.e. non-fatal and correctable at runtime.  When the state of=
=20
> your disk departs from the realm of these assumptions, bg fsck may not be=
=20
> able to repair the damage.

My thinking was that in simple cases (e.g., say you put UFS on a geom
provider that simulates failure, or the disk has a transient write
failure on some particular sector, etc), unmounting the filesystem (or
remounting read-only) would lead to a filesystem with only expected
(and designed for) inconsistencies - assuming of course that there is
no other issues going on, such as random corruption on the drive or in
the I/O path.

In any case, I was not really looking to get into a debate. I only
commented because my reading of the original post was that of a
potential bug in UFS, rather than lack of understanding that fsck
cannot fix arbitrary errors. As with most such bug reports coming from
a real-life situation, one can never prove that there was not random
corruption along the I/O path or whatever else.

Since I know from personal experience, and my understanding from
previous ML traffic is that it is a known issue, the I/O failure
handling in UFS is not rock solid in terms of system stability; so
taking that a bit further and causing corruption did not seem like a
huge leap (e.g., perhaps continuing with a dependent write even though
the preveious write failed - not unthinkable without being familiar
with the code).

--=20
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>'
Key retrieval: Send an E-Mail to getpgpkey@scode.org
E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org


--xHFwDpU9dbj6ez1V
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4 (FreeBSD)

iD8DBQFG5G+eDNor2+l1i30RArmDAJ9dyRW7dTVopYFAczdAa0ydBEOZBQCfREWq
EzVSVUGfzCCFo3tMEUYlgW8=
=ZB6b
-----END PGP SIGNATURE-----

--xHFwDpU9dbj6ez1V--