From owner-freebsd-fs Wed Jan 22 7:43:31 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3731437B401 for ; Wed, 22 Jan 2003 07:43:26 -0800 (PST) Received: from mcomail01.maxtor.com (mcomail01.maxtor.com [134.6.76.15]) by mx1.FreeBSD.org (Postfix) with ESMTP id 41B4743F5F for ; Wed, 22 Jan 2003 07:43:25 -0800 (PST) (envelope-from stephen_byan@maxtor.com) Received: from mcoexc03.mlm.maxtor.com (localhost.localdomain [127.0.0.1]) by mcomail01.maxtor.com (8.11.6/8.11.6) with ESMTP id h0MFXGq15755 for ; Wed, 22 Jan 2003 08:33:17 -0700 Received: from mmans02.mma.maxtor.com ([134.6.232.101]) by mcoexc03.mlm.maxtor.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id DH1P6M5X; Wed, 22 Jan 2003 08:43:23 -0700 Received: from maxtor.com by mmans02.mma.maxtor.com (8.8.8/1.1.22.3/08May01-0432PM) id KAA0000028005; Wed, 22 Jan 2003 10:43:08 -0500 (EST) Date: Wed, 22 Jan 2003 10:42:59 -0500 Subject: Re: JFS vs. Soft Updates (again) (was: Re: large filesystem, journaling filesystem support) Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v551) From: Steve Byan To: freebsd-fs@FreeBSD.ORG Content-Transfer-Encoding: 7bit In-Reply-To: <3E2DCC0C.FCAB2EFF@mindspring.com> Message-Id: <2F16E2F0-2E20-11D7-962B-00306548867E@maxtor.com> X-Mailer: Apple Mail (2.551) Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Tuesday, January 21, 2003, at 05:39 PM, Terry Lambert wrote: > Steve Byan wrote: >> On Monday, January 20, 2003, at 02:43 PM, Julian Elischer wrote: >>> it would be nice if the drive had enough NVram to hold that one >>> trashed >>> block so it could rewrite it on powerup. >> >> If enough customers show up waving dollar bills in their hands ... > > The disk manufacturers have historically not recognized new > markets, until after their competition has already entered them. It's a low margin, high volume, capital-intensive, technically challenging industry. Blowing one product generation means losing money big-time for a year. Adding a little cost that doesn't return value can blow your entire margin on a product. These considerations lead to risk-averse management, and I don't blame them a bit. (well, really, not very much :-) [snip] > There's a great book on this: > > The Innovator's Dilemma > Clayton M. Christensen > HarperBusiness > ISBN: 0-06-662069-4 > > The hard disk industry is one of his three major examples. 8-). > Yeah, it's required reading here :-) > FWIW: the major market you are not seeing here is ATA RAID arrays > that can compete with SCSI RAID arrays from other disk vendors, > where you can leverage the ATA economyies of scale that make SCSI > disks more expensive than ATA disks, in the first place. Basically, > the first ATA disk manufacturer to do this will spike much of their > competitions SCSI market,as soon as the software types become aware > of the change (see below). I think the industry is aware of this. However, as you note below, ATA disks are not yet quite up to snuff in this application. Regarding economies of scale and price of SCSI vs ATA, note that while the majority of the price differential is simply economies of scale, most of the remaining price differential between ATA and SCSI disks is due to the performance difference of the mechanics (5400 and 7200 vs 10K and 15K, 1.5 ms short-stroke seek rather than 0.750 ms), rather than the cost of the electronics for the host interface. > > >>> For us the problem is that the drive reports the write as having >>> happenned when it hasn't, so teh filesystem dependencies end up being >>> smashed, because teh filesystem is writing out data in dependency >>> order, >>> but if the data is written in a different order to the drive, >>> the drive can end up being in error in the case of failure. >> >> That's the cost of write-behind caching. SCSI gives you enough control >> to avoid this problem. ATA disks don't, but at least they're >> inexpensive. > > Which is why people call ATA drives "crap", and disk manufacturers > get upset about it: they are competing on size and spindle speed, > and somehow seem to have forgotten one of the purposes of their > products is to _reliably store data_. Seems to me some OS vendors also have forgotten this; one non-Unix file system of considerable popularity uses delayed-writes for all its metadata in order to achieve reasonable speed. As an unfortunate side-effect, chunks of your filesystem might disappear after a power failure. Come to think of it, doesn't Linux ext2fs make the same trade-off? > > The funny thing is that it would cost them nearly nothing, now > that they have tagged command queues for ATA drives, to put this > feature into ATA drives, as well... in fact, it may even be no > more than a firmware hack. While I'm not intimately familiar with the ATA firmware, I get push-back when talking with the folks who do the ATA products because they have a small code-base with very scarce CPU cycles and memory, so they're concerned about the resource cost of this extra code-path. (Note that this function affects only writes, which are not part of the tagged command queues in ATA - ATA queuing is only useful for reads; ATA gets write-queuing by delayed-write caching. ATA queuing only allows disconnecting between sending the command and transferring the data; it doesn't allow disconnecting between transferring the data and transferring the status. Hence ATA queuing is useless for writes.) They are also concerned about complexity; ATA product cycles are very short, so there's a desire to keep things simple, to minimize the risk of bugs. > > >> Ick, that could be a big number, maybe a couple of seconds in the very >> worst-case, I dunno for sure. I think you're probably talking a UPS >> rather than a large filter cap in the power supply. I think it's >> technically better to accept that you're not going to get all the data >> on the disk when power fails, and supply a "power fail" signal to the >> drive a few sector-times in advance of the power going out of the >> spec-limits. That way the drive could guarantee that it won't >> partially >> overwrite a sector. > > That's a really annoying point of view. 8-). It's an unfortunate reality of physics :-) > > The problem with this approach is that it requires cable changes > to the drive interface, unless you designate one of the "spare" > grounds as being inverse AC present signal; even so, you would > not be guaranteed that the motherboard/controller manufacturers > have all tied this pin active low in their designs, if it's > truly a "spare". That means the disks would not work with some > motherboards, which is death in a commodity market. Dunno about the ATA ASICs, but the SCSI ASICs have some GPIO pins that could profitably be mined to provide this functionality via a firmware change, using one of the pins on the option connector. The OEM would need an extra cable to connect to the drive, but this isn't a big deal unless you're into SCA connectors. Write it into your next purchase spec, wave dollar bills in front of the sales representative, and you could get this function. > I suspect that this is a good reason that, despite the design > being available in your head, no manufacturer has implemented > this, even if there was not computer hardware support for it. Actually, none of the big OEM's are interested, because they'd rather give you atomic writes by selling you a big expensive hardware RAID box. That's why the functionality hasn't been implemented. (joke: How to make big money in storage: 1. put price pressure on drive suppliers so that they are forced to manufacture crap. 2. design expensive storage system to rectify problems caused by step 1. 3. Profit! ) Show up with a reasonably-sized market and a feature request for something that can be implemented in firmware, and you can negotiate to get your feature. > > Basically, this means that we (filesystems engineers) have two > wishlist items for disk manufacturers: > > 1) Add logic to the ATA disks to provide the same control > over the ordering of operations (e.g. barriers and > completion notification) that SCSI disks have (per the > above, this may be nothing more than a firmware hack). As noted above, this is a rather large firmware hack. More like a re-write of significant portions of the code. Portions are not even implementable in ATA-land (i.e. write-queuing is broken in the interface definition). By the time you are done, you have special electronics requirements (more SRAM, faster CPU) that are too expensive to go into commodity drives. One could hypothesize a low-volume ATA drive with special electronics, but such low-volume ATA drives probably would cost only slightly less than higher-volume high-performance SCSI drives. Why not just buy the SCSI drive in the first place? > > 2) Provide the ability to obtain physical geometry > information from ATA disks, similar to the information > that is returned in SCSI mode page 2. Write this into your purchase spec and wave dollar bills in front of your sales rep,. Such info was available from Quantum ATA disks; it's probably available from Maxtor's, though I don't know for sure. It's solely a firmware change to provide this functionality. > The first can be a "must enable, disabled by default" item, and > the second could be a vendor-private command, which keeps both > of them from being visible to ignorant users of the disks. > > If you want to address throwing a chock in the wheels and/or > dumping the write queue to on-board NVRAM, assuming an inverse > AC fail notification, if it's turned on (off by default to > account for floating cable pins, rather than active low, on > some motherboards, to avoid sabotaging your existing market), > that would be nice too. ;^). How much extra would you pay? Would you buy sole-sourced drives to get these features? These are all do-able. Negotiate with your vendors. Ask to talk to the drive marketing folks, to get your message heard back at the plant. Regards, -Steve (not speaking for his employer) -------- Steve Byan Design Engineer Maxtor Corp. MS 1-3/E23 333 South Street Shrewsbury, MA 01545 (508) 770-3414 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message