Date: Fri, 31 Jan 2003 10:16:41 -0800 (PST) From: Julian Elischer <julian@elischer.org> To: Steve Byan <stephen_byan@maxtor.com> Cc: freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE Message-ID: <Pine.BSF.4.21.0301311002110.45015-100000@InterJet.elischer.org> In-Reply-To: <4912E0FE-3539-11D7-B26B-00306548867E@maxtor.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 31 Jan 2003, Steve Byan wrote: > There's a notion afoot in IDEMA to enlarge the underlying physical > block size of disks to 4096 bytes while keeping a 512-byte logical > block size for the interface. Unaligned accesses would involve either a > read-modify-write or some proprietary mechanism that provides > persistence without the latency cost of a read-modify-write. > > Performance issues aside, it occurs to me that hiding the underlying > physical block size may break many careful-write and > transaction-logging mechanisms, which may depend on no more than one > block being corrupted during a failure. In IDEMA's proposal, a power > failure during a write of a single 512-byte logical block could result > in the corruption of the full 4K block, i.e. reads of any of the > 512-byte logical blocks in that 4K physical block would return an > uncorrectable ECC error. > > I'd appreciate hearing examples where hiding the underlying physical > block size would break a file system, database, transaction processing > monitor, or whatever. Please let me know if I may forward your reply > to the committee. Thanks. I presume that if such a drive were made, thre would be some way to identify it? It would be very easy to configure a filesystem to have a minimum writable unit size of 4k, and I assume that doing so would be slightly advantageous. (no Read/modify/write). it would however be good if we could easily identify when doing so was a good idea. Another idea would be to have some way that you could specify a block number and have teh drive tell you the first in the same group.. That would allow a filesystem to work out the alignment. It may not be able to access absolute block numbers, if it's going through some layers of translation, and some way of saying "am I alligned?" might be useful. One thing that does come to mind is that as you say, on power fail we would now be liable to lose a group of 8 sectors (4k) instead of 1 x 512 byte sector. Recovery algorythms might have to deal with this (should we actually decide to write one.. :-). Particularly if the block being written was the 1st, but the other 7 blocks contain data that the OS has no way of knowing that they are in jeopardy. In other words, I might know that block 1 is in danger and put it in a write log, (in a logging filesystem) but I have no way of knowing that the other 7 are in danger, so they may not be in the write log (assuming thAat the write log only holds the last N transactions.). I'd say that this means that the drive should hold the active 4k block in nvram or something.. You seem to have considered this but I'm in agreement that it could prove "nasty" in exactly the cases that are most important.. people use write logging etc. in cases where they care about the data and recovery time. these are exactly the people who are going to be the most pissed off to lose their data. .. If we can easily telll the system to use 4k frags or 4k blocknumbers (i.e. we can elect to expose the real blocksize) then we are probably in better shape. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.21.0301311002110.45015-100000>
