Date: Wed, 22 Jan 2003 15:03:12 -0800 From: Terry Lambert <tlambert2@mindspring.com> To: Steve Byan <stephen_byan@maxtor.com> Cc: freebsd-fs@FreeBSD.ORG Subject: Re: JFS vs. Soft Updates (again) (was: Re: large filesystem, journaling filesystem support) Message-ID: <3E2F2330.F7A46C6E@mindspring.com> References: <2F16E2F0-2E20-11D7-962B-00306548867E@maxtor.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Steve Byan wrote: > > FWIW: the major market you are not seeing here is ATA RAID arrays > > that can compete with SCSI RAID arrays from other disk vendors, > > where you can leverage the ATA economyies of scale that make SCSI > > disks more expensive than ATA disks, in the first place. Basically, > > the first ATA disk manufacturer to do this will spike much of their > > competitions SCSI market,as soon as the software types become aware > > of the change (see below). > > I think the industry is aware of this. However, as you note below, ATA > disks are not yet quite up to snuff in this application. At this point, it is apparently a matter of firmware. Most ATA firmware is flashable. If a manufacturer were to release source code for its firmware, the disk drive manufacturers would see the same sort of things happen for that disk drive that happened for the Broadcom Tigin II product, which is able to sustain over 500,000 packets a second and 250,000 connections a second, if you are willing to rewrite the firmware on the card -- and if the manufacturer lets you do it. I don't expect this to happen, but it would certainly, over time, end up reducing your firmware costs, since you would then find that your hardware designes were constrained "to run the *good* firmware". > Regarding economies of scale and price of SCSI vs ATA, note that while > the majority of the price differential is simply economies of scale, > most of the remaining price differential between ATA and SCSI disks is > due to the performance difference of the mechanics (5400 and 7200 vs > 10K and 15K, 1.5 ms short-stroke seek rather than 0.750 ms), rather > than the cost of the electronics for the host interface. This is the spindle speed competition. I think they've already lost the size one, and both are at parity (in fact, SCSI tends to follow, rather than lead, in size, these days). I have to say that the thing that motivates people to buy SCSI is not the speed of the disk. The ATA burst transfer rates are significantly higher than SCSI these days, and with interleaved commands from the controller possible on both, the SCSI command overhead an latency is becoming significant. The most meaningful SCSI features are those you have already identified, which all basically boil down to being able to keep the speed without sacrificing reliability, by avoiding the introduction of what would otherwise be stall barriers. For ATA, you effectively have to disable write caching to achieve the same thing. SCSI is at the "acceptable speed" point, and pushing it to "much more than acceptable speed isn't really useful. An applications engineer can make a decision these days on the speed tradeoff, and ignore the disk manufacturers entirely, if they decide to do so. The applications that needed a lot of disk can live with write through caching with a minimum of barriers, as resuired. That, in fact, is the basis for soft updates, and it's the basis for the earlier technology of DOW (Delayed Ordered Writes) out of USL/Novell, and used in Reiser FS. The net effect of this is that it's possible to stuff 64G of RAM into the box you care about access speed on, and use hard drives as nothing more than a non-volatile mirror that's much slower. The next meaningful change that's going to come in SCSI is the ability to assert range locks on the device from multiple host masters. When this happens, it will break down one of the main barriers to scaling applications by throwing hardware at them; this is already the case with GFS (in a really primitive way, using a network lock manager, with significantly higher latency than is achievable with hardware support), and is one of the fundamental drivers for network attached storage, and for NFS servers from companies like NetApp and Auspex, for that matter, at this point. Notice that the drive to up-market has passed the "fast enough" point, as noted by Christensen in the book I already referenced, and which you noted was "required reading". One of the ironic things about "required reading" assignments in industries like the disk industry is that it's very much like the person who believes they can be rich, merely by buying what the rich buy: reading "The Innovator's Dilemma" will not make the industry any less likely to auger-in in an up-market spiral than anything in the past, because the economics of disruptive products remains the same. So the bottom line is that it's possible to add the features you think are not the market drivers to the ATA drives, and if you are right, then they simply will not be used, and the cost will be some time paid a firmware engineer. If, on the other hand, you're wrong, then you will capture significant market from your competitors. I imagine that your SCSI products division would fight this; I don't know where your margins are now, but I expect that a lot of them are in up-scale SCSI. The thing to do would be to talk to your bean-counters, and do a cost analysis, in the case that your ATA drive marginalized everyone else's SCSI drives -- even your own -- and see where that leaves you. It's possible that you could do this on one disk line, and if it starts selling well, let scarcity drive up the price, after which you can keep the price high, and the firmware difference could give you (and other manufacturers who follow you) the differential margin that you're now getting from SCSI. If this happens, you would be very happy, since you will have reduced your costs while not damaging your profitability, by way of pushing SCSI out (I suspect that what would happen instead is that SCSI would be pushed up-market with the locking and multiple mastering other advanced features, which can't be safely duplicated in ATA, for lack of connectors and multihost ATA interconnects). > >> That's the cost of write-behind caching. SCSI gives you enough control > >> to avoid this problem. ATA disks don't, but at least they're > >> inexpensive. > > > > Which is why people call ATA drives "crap", and disk manufacturers > > get upset about it: they are competing on size and spindle speed, > > and somehow seem to have forgotten one of the purposes of their > > products is to _reliably store data_. > > Seems to me some OS vendors also have forgotten this; one non-Unix file > system of considerable popularity uses delayed-writes for all its > metadata in order to achieve reasonable speed. As an unfortunate > side-effect, chunks of your filesystem might disappear after a power > failure. Come to think of it, doesn't Linux ext2fs make the same > trade-off? Windows didn't "invent" vitrual memory and they didn't "invent" protected mode operating, until the market forced it on them, and/or until they were forced to look to the server market for new market, after saturating the desktop. Microsoft is very much an "Innovator's dilemma" company, where you will not see innovation or new technology until it effects their market share to not have it. They are an evolutionary product company, at this point, and they will never deal with an issue until they have to deal with it, because doing otherwise detracts from their bottom line. Note that Windows NT *did* assress this with NTFS, and while it took a coule of false starts to get there, NTFS is now the default in Windows XP systems, from the factory. The EXT2FS *does* have the same problem, but realize that this is because there's a speed issue, and that issue comes from outside; I would argue that it has been addressed by faster transfer rates and tagged command queueing. I'd also argue that Linux knows this, and is tring to address it with all the myriad GFS/XFS/EXT3FS/ReiserFS/etc. projects, which all seek to not have those attributes, even if the authors don't seem to know *why* they are pursuing the goal, or if they do, *why* people are finally getting behind the cart and helping them push, when they struggled alone and forsaken for such a long time. It boils down to "market pressure from Windows XP". > > The funny thing is that it would cost them nearly nothing, now > > that they have tagged command queues for ATA drives, to put this > > feature into ATA drives, as well... in fact, it may even be no > > more than a firmware hack. > > While I'm not intimately familiar with the ATA firmware, I get > push-back when talking with the folks who do the ATA products because > they have a small code-base with very scarce CPU cycles and memory, so > they're concerned about the resource cost of this extra code-path. Worst case, you can make it a feature set (default off) that is followed by a soft reset, to put the drive into the mode. Then the only people who eat the cost are the people who turn it on, in the knowledge that they eat the cost. The funny thing is, this is the same argument you could have used to justify not putting in the knob to turn off write caching -- yet that knob is there. 8-). > (Note that this function affects only writes, which are not part of the > tagged command queues in ATA - ATA queuing is only useful for reads; > ATA gets write-queuing by delayed-write caching. ATA queuing only > allows disconnecting between sending the command and transferring the > data; it doesn't allow disconnecting between transferring the data and > transferring the status. Hence ATA queuing is useless for writes.) They > are also concerned about complexity; ATA product cycles are very short, > so there's a desire to keep things simple, to minimize the risk of bugs. Well, having it "off by default, seperate firmware image after soft reset when on" completely addresses these concerns, I think 8-). I was aware of the tagged command queueing/writing issue; it's very unfortunate that those issues aren't corrected, too. 8-). > > That's a really annoying point of view. 8-). > > It's an unfortunate reality of physics :-) You know, I keep bumping my head on physics; we should do something about that, don't you think? 8-) 8-). > > The problem with this approach is that it requires cable changes > > to the drive interface, unless you designate one of the "spare" > > grounds as being inverse AC present signal; even so, you would > > not be guaranteed that the motherboard/controller manufacturers > > have all tied this pin active low in their designs, if it's > > truly a "spare". That means the disks would not work with some > > motherboards, which is death in a commodity market. > > Dunno about the ATA ASICs, but the SCSI ASICs have some GPIO pins that > could profitably be mined to provide this functionality via a firmware > change, using one of the pins on the option connector. The OEM would > need an extra cable to connect to the drive, but this isn't a big deal > unless you're into SCA connectors. Write it into your next purchase > spec, wave dollar bills in front of the sales representative, and you > could get this function. For cheap devices, I'm not allowed to spec SCSI. 8-(. My own opinion here is that the companies I did the work for didn't really have an expectation of selling 100,000 units, despite their claims in the company meetings, and so single unit cost at the expense of repeat sales was an acceptable tradeoff for them. 8-( 8-(. > Actually, none of the big OEM's are interested, because they'd rather > give you atomic writes by selling you a big expensive hardware RAID > box. That's why the functionality hasn't been implemented. > > (joke: How to make big money in storage: > 1. put price pressure on drive suppliers so that they are forced to > manufacture crap. > 2. design expensive storage system to rectify problems caused by step 1. > 3. Profit! ) Cynical, cynical... > Show up with a reasonably-sized market and a feature request for > something that can be implemented in firmware, and you can negotiate to > get your feature. Anything short of building a multimillion dollar campany that I can do? It's not that I'm averse to that, you understand, it's just that I'd have to delay my gratification about 3 or 4 years... > > Basically, this means that we (filesystems engineers) have two > > wishlist items for disk manufacturers: > > > > 1) Add logic to the ATA disks to provide the same control > > over the ordering of operations (e.g. barriers and > > completion notification) that SCSI disks have (per the > > above, this may be nothing more than a firmware hack). > > As noted above, this is a rather large firmware hack. More like a > re-write of significant portions of the code. Portions are not even > implementable in ATA-land (i.e. write-queuing is broken in the > interface definition). By the time you are done, you have special > electronics requirements (more SRAM, faster CPU) that are too expensive > to go into commodity drives. One could hypothesize a low-volume ATA > drive with special electronics, but such low-volume ATA drives probably > would cost only slightly less than higher-volume high-performance SCSI > drives. Why not just buy the SCSI drive in the first place? The main answer here? I want a minimum level of functionality assurance from all disk drives, not just SCSI, so I can design software systems that don't have to care about the disks that someone slots into the chassis. It reduces overall software complexity to do this. Remember back when IDE drives could not do DMA, and had to use the host CPU for data transfers? That was Bad(tm). I'd like to get away from similar problems, now that that one has been solved. > > 2) Provide the ability to obtain physical geometry > > information from ATA disks, similar to the information > > that is returned in SCSI mode page 2. > > Write this into your purchase spec and wave dollar bills in front of > your sales rep,. Such info was available from Quantum ATA disks; it's > probably available from Maxtor's, though I don't know for sure. It's > solely a firmware change to provide this functionality Yep; I *knew* that an ATA manufacturer had supported it, but I couldn't point at the one, so I stayed away from that earlier when someone claimed ATA did not support it. Thanks for the ammo. 8-) 8-). > > The first can be a "must enable, disabled by default" item, and > > the second could be a vendor-private command, which keeps both > > of them from being visible to ignorant users of the disks. > > > > If you want to address throwing a chock in the wheels and/or > > dumping the write queue to on-board NVRAM, assuming an inverse > > AC fail notification, if it's turned on (off by default to > > account for floating cable pins, rather than active low, on > > some motherboards, to avoid sabotaging your existing market), > > that would be nice too. ;^). > > How much extra would you pay? Would you buy sole-sourced drives to get > these features? These are all do-able. Negotiate with your vendors. Ask > to talk to the drive marketing folks, to get your message heard back at > the plant. We *would* have paid this at Whistle, the chock-in-the-wheels, to avoid having an overly complex power supply turn into a standard supply, a triac, a cap, two regulators, an op-amp, and an optoisolator. 8-). Would have saves us maybe $35-$50 on COGS. I'll see what I can do about finding/creating a similar situation in the future, and using that to leverage the change, via purchases. What's the chances that, once it's written, this stuff will go into the standard production models? -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3E2F2330.F7A46C6E>
