Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 22 Jan 2003 10:42:59 -0500
From:      Steve Byan <stephen_byan@maxtor.com>
To:        freebsd-fs@FreeBSD.ORG
Subject:   Re: JFS vs. Soft Updates (again) (was: Re: large filesystem, journaling  filesystem support)
Message-ID:  <2F16E2F0-2E20-11D7-962B-00306548867E@maxtor.com>
In-Reply-To: <3E2DCC0C.FCAB2EFF@mindspring.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On Tuesday, January 21, 2003, at 05:39  PM, Terry Lambert wrote:

> Steve Byan wrote:
>> On Monday, January 20, 2003, at 02:43  PM, Julian Elischer wrote:
>>> it would be nice if the drive had enough NVram to hold that one 
>>> trashed
>>> block so it could rewrite it on powerup.
>>
>> If enough customers show up waving dollar bills in their hands ...
>
> The disk manufacturers have historically not recognized new
> markets, until after their competition has already entered them.

It's a low margin, high volume, capital-intensive, technically 
challenging industry. Blowing one product generation means losing money 
big-time for a year. Adding a little cost that doesn't return value can 
blow your entire margin on a product. These considerations lead to 
risk-averse management, and I don't blame them a bit. (well, really, 
not very much :-)

[snip]

> There's a great book on this:
>
> 	The Innovator's Dilemma
> 	Clayton M. Christensen
> 	HarperBusiness
> 	ISBN: 0-06-662069-4
>
> The hard disk industry is one of his three major examples.  8-).
>
Yeah, it's required reading here :-)

> FWIW: the major market you are not seeing here is ATA RAID arrays
> that can compete with SCSI RAID arrays from other disk vendors,
> where you can leverage the ATA economyies of scale that make SCSI
> disks more expensive than ATA disks, in the first place.  Basically,
> the first ATA disk manufacturer to do this will spike much of their
> competitions SCSI market,as soon as the software types become aware
> of the change (see below).

I think the industry is aware of this. However, as you note below, ATA 
disks are not yet quite up to snuff in this application.

Regarding economies of scale and price of SCSI vs ATA, note that while 
the majority of the price differential is simply economies of scale, 
most of the remaining price differential between ATA and SCSI disks is 
due to the performance difference of the mechanics (5400 and 7200 vs 
10K and 15K, 1.5 ms short-stroke seek rather than 0.750 ms), rather 
than the cost of the electronics for the host interface.
>
>
>>> For us the problem is that the drive reports the write as having
>>> happenned when it hasn't, so teh filesystem dependencies end up being
>>> smashed, because teh filesystem is writing out data in dependency
>>> order,
>>> but if the data is written in a different order to the drive,
>>> the drive can end up being in error in the  case of failure.
>>
>> That's the cost of write-behind caching. SCSI gives you enough control
>> to avoid this problem. ATA disks don't, but at least they're
>> inexpensive.
>
> Which is why people call ATA drives "crap", and disk manufacturers
> get upset about it: they are competing on size and spindle speed,
> and somehow seem to have forgotten one of the purposes of their
> products is to _reliably store data_.

Seems to me some OS vendors also have forgotten this; one non-Unix file 
system of considerable popularity uses delayed-writes for all its 
metadata in order to achieve reasonable speed. As an unfortunate 
side-effect, chunks of your filesystem might disappear after a power 
failure. Come to think of it, doesn't Linux ext2fs make the same 
trade-off?
>
> The funny thing is that it would cost them nearly nothing, now
> that they have tagged command queues for ATA drives, to put this
> feature into ATA drives, as well... in fact, it may even be no
> more than a firmware hack.

While I'm not intimately familiar with the ATA firmware, I get 
push-back when talking with the folks who do the ATA products because 
they have a small code-base with very scarce CPU cycles and memory, so 
they're concerned about the resource cost of this extra code-path. 
(Note that this function affects only writes, which are not part of the 
tagged command queues in ATA - ATA queuing is only useful for reads; 
ATA gets write-queuing by delayed-write caching. ATA queuing only 
allows disconnecting between sending the command and transferring the 
data; it doesn't allow disconnecting between transferring the data and 
transferring the status. Hence ATA queuing is useless for writes.) They 
are also concerned about complexity; ATA product cycles are very short, 
so there's a desire to keep things simple, to minimize the risk of bugs.
>
>
>> Ick, that could be a big number, maybe a couple of seconds in the very
>> worst-case, I dunno for sure. I think you're probably talking a UPS
>> rather than a large filter cap in the power supply. I think it's
>> technically better to accept that you're not going to get all the data
>> on the disk when power fails, and supply a "power fail" signal to the
>> drive a few sector-times in advance of the power going out of the
>> spec-limits. That way the drive could guarantee that it won't 
>> partially
>> overwrite a sector.
>
> That's a really annoying point of view.  8-).

It's an unfortunate reality of physics :-)
>
> The problem with this approach is that it requires cable changes
> to the drive interface, unless you designate one of the "spare"
> grounds as being inverse AC present signal; even so, you would
> not be guaranteed that the motherboard/controller manufacturers
> have all tied this pin active low in their designs, if it's
> truly a "spare".  That means the disks would not work with some
> motherboards, which is death in a commodity market.

Dunno about the ATA ASICs, but the SCSI ASICs have some GPIO pins that 
could profitably be mined to provide this functionality via a firmware 
change, using one of the pins on the option connector. The OEM would 
need an extra cable to connect to the drive, but this isn't a big deal 
unless you're into SCA connectors. Write it into your next purchase 
spec, wave dollar bills in front of the sales representative, and you 
could get this function.

> I suspect that this is a good reason that, despite the design
> being available in your head, no manufacturer has implemented
> this, even if there was not computer hardware support for it.

Actually, none of the big OEM's are interested, because they'd rather 
give you atomic writes by selling you a big expensive hardware RAID 
box. That's why the functionality hasn't been implemented.

(joke: How to make big money in storage:
1. put price pressure on drive suppliers so that they are forced to 
manufacture crap.
2. design expensive storage system to rectify problems caused by step 1.
3. Profit! )

Show up with a reasonably-sized market and a feature request for 
something that can be implemented in firmware, and you can negotiate to 
get your feature.

>
> Basically, this means that we (filesystems engineers) have two
> wishlist items for disk manufacturers:
>
> 1)	Add logic to the ATA disks to provide the same control
> 	over the ordering of operations (e.g. barriers and
> 	completion notification) that SCSI disks have (per the
> 	above, this may be nothing more than a firmware hack).

As noted above, this is a rather large firmware hack. More like a 
re-write of significant portions of the code. Portions are not even 
implementable in ATA-land (i.e. write-queuing is broken in the 
interface definition). By the time you are done, you have special 
electronics requirements (more SRAM, faster CPU) that are too expensive 
to go into commodity drives. One could hypothesize a low-volume ATA 
drive with special electronics, but such low-volume ATA drives probably 
would cost only slightly less than higher-volume high-performance SCSI 
drives. Why not just buy the SCSI drive in the first place?
>
> 2)	Provide the ability to obtain physical geometry
> 	information from ATA disks, similar to the information
> 	that is returned in SCSI mode page 2.

Write this into your purchase spec and wave dollar bills in front of 
your sales rep,. Such info was available from Quantum ATA disks; it's 
probably available from Maxtor's, though I don't know for sure. It's 
solely a firmware change to provide this functionality.

> The first can be a "must enable, disabled by default" item, and
> the second could be a vendor-private command, which keeps both
> of them from being visible to ignorant users of the disks.
>
> If you want to address throwing a chock in the wheels and/or
> dumping the write queue to on-board NVRAM, assuming an inverse
> AC fail notification, if it's turned on (off by default to
> account for floating cable pins, rather than active low, on
> some motherboards, to avoid sabotaging your existing market),
> that would be nice too.  ;^).

How much extra would you pay? Would you buy sole-sourced drives to get 
these features? These are all do-able. Negotiate with your vendors. Ask 
to talk to the drive marketing folks, to get your message heard back at 
the plant.

Regards,
-Steve (not speaking for his employer)

--------
Steve Byan <stephen_byan@maxtor.com>
Design Engineer
Maxtor Corp.
MS 1-3/E23
333 South Street
Shrewsbury, MA 01545
(508) 770-3414


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2F16E2F0-2E20-11D7-962B-00306548867E>