From owner-freebsd-fs  Fri Jan 31 16:34:42 2003
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id D49AE37B406
	for <freebsd-fs@freebsd.org>; Fri, 31 Jan 2003 16:34:38 -0800 (PST)
Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188])
	by mx1.FreeBSD.org (Postfix) with ESMTP id B997F43F75
	for <freebsd-fs@freebsd.org>; Fri, 31 Jan 2003 16:34:36 -0800 (PST)
	(envelope-from tlambert2@mindspring.com)
Received: from pool0203.cvx21-bradley.dialup.earthlink.net ([209.179.192.203] helo=mindspring.com)
	by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128)
	(Exim 3.33 #1)
	id 18elc7-00063u-00; Fri, 31 Jan 2003 16:34:32 -0800
Message-ID: <3E3B1582.39463573@mindspring.com>
Date: Fri, 31 Jan 2003 16:32:02 -0800
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Steve Byan <stephen_byan@maxtor.com>
Cc: freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org
Subject: Re: DEV_B_SIZE
References: <4912E0FE-3539-11D7-B26B-00306548867E@maxtor.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4fec0bdacb27578085064db9f0561ec03a2d4e88014a4647c350badd9bab72f9c350badd9bab72f9c
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-fs.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-fs>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-fs>
X-Loop: FreeBSD.org

Steve Byan wrote:
> There's a notion afoot in IDEMA to enlarge the underlying physical
> block size of disks to 4096 bytes while keeping a 512-byte logical
> block size for the interface. Unaligned accesses would involve either a
> read-modify-write or some proprietary mechanism that provides
> persistence without the latency cost of a read-modify-write.
> 
> Performance issues aside, it occurs to me that hiding the underlying
> physical block size may break many careful-write and
> transaction-logging mechanisms, which may depend on no more than one
> block being corrupted during a failure. In IDEMA's proposal, a power
> failure during a write of a single 512-byte logical block could result
> in the corruption of the full 4K block, i.e. reads of any of the
> 512-byte logical blocks in that 4K physical block  would return an
> uncorrectable ECC error.
> 
> I'd appreciate hearing examples where hiding the underlying physical
> block size would break a file system, database, transaction processing
> monitor, or whatever.  Please let me know if I may forward your reply
> to the committee. Thanks.

UFS directory operations are on the basis of physical disk blocks,
which are assumed to be DEVBSIZE in size (512b).  Minimally, the I/O
path would be broken by this change by changing the atomic unit size
to 4096.

The reason this would break is that the atomic write guarantee is
used to ensure that a single sector changes are recorded atomically.
This is important in rename operations from a short name to a longer
name, where the new name is allocated as a hard link in the new block;
the place this becomes problematic is where the new block and the old
block are the same block, unknown to the software.

The transaction in question is atomic file replacement; it involves:

	name	- name of the file
	name.1	- name of the file whose contents are to atomically
		  replace the contents of "name"
	name.2	- name of intermediate file for use in transaction
		  rollback/forward

The transaction is:

	---------------------------	-----------------------------
	files				view
	---------------------------	-----------------------------
	name				name
	+name.1				name	name.1
	explicit_sync(name.1)		name	name.1
	name	->	name.2		name	name.1	name.2
						name.1	name.2
	name	<-	name.1		name	name.1	name.2
					name		name.2
	-name.2				name
	---------------------------	-----------------------------

The failure recovery is:

	---------------------------	-----------------------------
	view				process
	---------------------------	-----------------------------
	name				[NULL]
	name	name.1			[ROLL BACK(partial file?)]
					-name.1
	name	name.1	name.2		[ROLL FORWARD]
					-name
					name	<-	name.1
					-name.2
	name		name.2		[ROLL FORWARD]
					-name.2
	---------------------------	-----------------------------

Currently, UFS is subject to damage through courruption of data in a
pending transaction.  A corrupt sector destroys data.  But this is a
weakness of UFS, and is not a uniform weakness of all FS's that must
provide the same transactional guarantees to the applications, for
the purposes of recovery.

In a journalling or log structured FS, the failure of a write of a
sector of data -- or rather, an extent or log or journal line -- is
recoverable: you get the previous contents, because the journal line
has not been replaced with new contents with a newer date stamp.  The
result is that it backs the transaction out for you.  But this is still
potentially a partial back-out, which can leave us with any of the views
of the directory contents, which we need to use to discern our recovery
strategy ([NULL]/[ROLL BACK]/[ROLL FORWARD]).

The risk is much higher in this case, in that the logging extents may
in fact be adjacent, and span the 4K boundary, while only being self
protecting from spanning a 512b boundary.  The net effect of this is
that rather than guaranteeing to only damage a single extent, you may
damage two extents containing pre- and post-operation data.  Unless
the filesystem maintains extents two back, or goes out of its way to
ensure non-adjacency (can this be done, in the face of sector sparing?),
this type of failure is unrecoverable.

The main issue with this is that you can not ensure physical alignment
of the underlying logical device that is acting as a backing store for
the FS.  This was and is a common performance problem for demand paged
virtual memory using OS's: MSDOS FAT FS's on drives that claim an odd
numbered physical sector count per track result in the first partition
being on an odd 512b boundary.  The result is that physical pages in
memory are spanned by every third 1K FS block, because they are offset
by 512b from the start of the disk.

So even if you are not considering the single sector issue as a design
flaw in UFS, and even if requiring recompilation is acceptable (it is,
IMO), you can't necessarily avoid the failure case.

Note: This is not an exhaustive list, this is just off the top of my
head; I could probably come up with other scenarios, as well... e.g. at
the very least, for FAT, you would probably be screwed with a number
larger than 1K, even if you were careful to make sure that the sectors
per track was an even multiple of your physical block size, since the
FAT entry in FAT FS's *is* the inode.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message