From owner-freebsd-fs@FreeBSD.ORG Sun Sep 9 22:11:46 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2103316A420; Sun, 9 Sep 2007 22:11:46 +0000 (UTC) (envelope-from scode@hyperion.scode.org) Received: from hyperion.scode.org (cl-1361.ams-04.nl.sixxs.net [IPv6:2001:960:2:550::2]) by mx1.freebsd.org (Postfix) with ESMTP id 5538913C4A3; Sun, 9 Sep 2007 22:11:44 +0000 (UTC) (envelope-from scode@hyperion.scode.org) Received: by hyperion.scode.org (Postfix, from userid 1001) id 0932323C490; Mon, 10 Sep 2007 00:11:42 +0200 (CEST) Date: Mon, 10 Sep 2007 00:11:42 +0200 From: Peter Schuller To: Kris Kennaway Message-ID: <20070909221142.GA6435@hyperion.scode.org> References: <46E4225F.1020806@gmx.net> <46E42D14.5060605@FreeBSD.org> <20070909200933.GA98161@hyperion.scode.org> <46E45E54.6040207@FreeBSD.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="xHFwDpU9dbj6ez1V" Content-Disposition: inline In-Reply-To: <46E45E54.6040207@FreeBSD.org> User-Agent: Mutt/1.5.16 (2007-06-09) Cc: freebsd-fs@freebsd.org, Johannes Totz Subject: Re: UFS not handling errors correctly X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 09 Sep 2007 22:11:46 -0000 --xHFwDpU9dbj6ez1V Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable > Soft updates isn't journalling, so you can't "roll back" an error. It=20 > works by maintaining knowledge of the on-disk state of data and ensuring = =20 > that it only writes to disk in a suitable order so that the on-disk state= =20 > is supposed to remain consistent. I am aware of this, I was speaking generally. The least "committal" solution being to just panic. The point I was trying to make was that as long as errors are traditional and simple, as in not being able to read a particular sector, or a write to a sector failed, aborting all operations should not lead to corruption since that is exactly what the filesystem has been designed to prevent (essentially panicing the machine from the perspective of the on-disk filesystem even if the system is not actually paniced, such as if the filesystem is unmounted instead). > Unfortunately there are many ways in which this can fail, mostly involvin= g=20 > external factors violating the assumptions upon which soft updates relies= =2E =20 > For example, the data written on disk may not correspond to the data=20 > dispatched by soft updates, due to things like write caching in the=20 > hardware, write reordering, data corruption, unpredictable disk behaviour= =20 > during power loss, hardware failure, etc. I am aware of this too (and paranoid about it). > Similarly, background fsck assumes that the only filesystem errors it wil= l=20 > encounter are those permitted by the soft updates model, which are=20 > "benign", i.e. non-fatal and correctable at runtime. When the state of= =20 > your disk departs from the realm of these assumptions, bg fsck may not be= =20 > able to repair the damage. My thinking was that in simple cases (e.g., say you put UFS on a geom provider that simulates failure, or the disk has a transient write failure on some particular sector, etc), unmounting the filesystem (or remounting read-only) would lead to a filesystem with only expected (and designed for) inconsistencies - assuming of course that there is no other issues going on, such as random corruption on the drive or in the I/O path. In any case, I was not really looking to get into a debate. I only commented because my reading of the original post was that of a potential bug in UFS, rather than lack of understanding that fsck cannot fix arbitrary errors. As with most such bug reports coming from a real-life situation, one can never prove that there was not random corruption along the I/O path or whatever else. Since I know from personal experience, and my understanding from previous ML traffic is that it is a known issue, the I/O failure handling in UFS is not rock solid in terms of system stability; so taking that a bit further and causing corruption did not seem like a huge leap (e.g., perhaps continuing with a dependent write even though the preveious write failed - not unthinkable without being familiar with the code). --=20 / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller ' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org --xHFwDpU9dbj6ez1V Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4 (FreeBSD) iD8DBQFG5G+eDNor2+l1i30RArmDAJ9dyRW7dTVopYFAczdAa0ydBEOZBQCfREWq EzVSVUGfzCCFo3tMEUYlgW8= =ZB6b -----END PGP SIGNATURE----- --xHFwDpU9dbj6ez1V--