From owner-freebsd-fs Fri Jan 31 16:34:42 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D49AE37B406 for ; Fri, 31 Jan 2003 16:34:38 -0800 (PST) Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188]) by mx1.FreeBSD.org (Postfix) with ESMTP id B997F43F75 for ; Fri, 31 Jan 2003 16:34:36 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0203.cvx21-bradley.dialup.earthlink.net ([209.179.192.203] helo=mindspring.com) by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18elc7-00063u-00; Fri, 31 Jan 2003 16:34:32 -0800 Message-ID: <3E3B1582.39463573@mindspring.com> Date: Fri, 31 Jan 2003 16:32:02 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Steve Byan Cc: freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE References: <4912E0FE-3539-11D7-B26B-00306548867E@maxtor.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4fec0bdacb27578085064db9f0561ec03a2d4e88014a4647c350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Steve Byan wrote: > There's a notion afoot in IDEMA to enlarge the underlying physical > block size of disks to 4096 bytes while keeping a 512-byte logical > block size for the interface. Unaligned accesses would involve either a > read-modify-write or some proprietary mechanism that provides > persistence without the latency cost of a read-modify-write. > > Performance issues aside, it occurs to me that hiding the underlying > physical block size may break many careful-write and > transaction-logging mechanisms, which may depend on no more than one > block being corrupted during a failure. In IDEMA's proposal, a power > failure during a write of a single 512-byte logical block could result > in the corruption of the full 4K block, i.e. reads of any of the > 512-byte logical blocks in that 4K physical block would return an > uncorrectable ECC error. > > I'd appreciate hearing examples where hiding the underlying physical > block size would break a file system, database, transaction processing > monitor, or whatever. Please let me know if I may forward your reply > to the committee. Thanks. UFS directory operations are on the basis of physical disk blocks, which are assumed to be DEVBSIZE in size (512b). Minimally, the I/O path would be broken by this change by changing the atomic unit size to 4096. The reason this would break is that the atomic write guarantee is used to ensure that a single sector changes are recorded atomically. This is important in rename operations from a short name to a longer name, where the new name is allocated as a hard link in the new block; the place this becomes problematic is where the new block and the old block are the same block, unknown to the software. The transaction in question is atomic file replacement; it involves: name - name of the file name.1 - name of the file whose contents are to atomically replace the contents of "name" name.2 - name of intermediate file for use in transaction rollback/forward The transaction is: --------------------------- ----------------------------- files view --------------------------- ----------------------------- name name +name.1 name name.1 explicit_sync(name.1) name name.1 name -> name.2 name name.1 name.2 name.1 name.2 name <- name.1 name name.1 name.2 name name.2 -name.2 name --------------------------- ----------------------------- The failure recovery is: --------------------------- ----------------------------- view process --------------------------- ----------------------------- name [NULL] name name.1 [ROLL BACK(partial file?)] -name.1 name name.1 name.2 [ROLL FORWARD] -name name <- name.1 -name.2 name name.2 [ROLL FORWARD] -name.2 --------------------------- ----------------------------- Currently, UFS is subject to damage through courruption of data in a pending transaction. A corrupt sector destroys data. But this is a weakness of UFS, and is not a uniform weakness of all FS's that must provide the same transactional guarantees to the applications, for the purposes of recovery. In a journalling or log structured FS, the failure of a write of a sector of data -- or rather, an extent or log or journal line -- is recoverable: you get the previous contents, because the journal line has not been replaced with new contents with a newer date stamp. The result is that it backs the transaction out for you. But this is still potentially a partial back-out, which can leave us with any of the views of the directory contents, which we need to use to discern our recovery strategy ([NULL]/[ROLL BACK]/[ROLL FORWARD]). The risk is much higher in this case, in that the logging extents may in fact be adjacent, and span the 4K boundary, while only being self protecting from spanning a 512b boundary. The net effect of this is that rather than guaranteeing to only damage a single extent, you may damage two extents containing pre- and post-operation data. Unless the filesystem maintains extents two back, or goes out of its way to ensure non-adjacency (can this be done, in the face of sector sparing?), this type of failure is unrecoverable. The main issue with this is that you can not ensure physical alignment of the underlying logical device that is acting as a backing store for the FS. This was and is a common performance problem for demand paged virtual memory using OS's: MSDOS FAT FS's on drives that claim an odd numbered physical sector count per track result in the first partition being on an odd 512b boundary. The result is that physical pages in memory are spanned by every third 1K FS block, because they are offset by 512b from the start of the disk. So even if you are not considering the single sector issue as a design flaw in UFS, and even if requiring recompilation is acceptable (it is, IMO), you can't necessarily avoid the failure case. Note: This is not an exhaustive list, this is just off the top of my head; I could probably come up with other scenarios, as well... e.g. at the very least, for FAT, you would probably be screwed with a number larger than 1K, even if you were careful to make sure that the sectors per track was an even multiple of your physical block size, since the FAT entry in FAT FS's *is* the inode. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message