FreeBSD Mail Archives

Date:      Mon, 20 Jan 2003 11:43:12 -0800 (PST)
From:      Julian Elischer <julian@elischer.org>
To:        Steve Byan <stephen_byan@maxtor.com>
Cc:        freebsd-fs@FreeBSD.ORG
Subject:   Re: JFS vs. Soft Updates (again) (was: Re: large filesystem, journaling filesystem support)
Message-ID:  <Pine.BSF.4.21.0301201120430.39747-100000@InterJet.elischer.org>
In-Reply-To: <37CA8FF0-2CA5-11D7-962B-00306548867E@maxtor.com>

I hate to enter this argument but....


On Mon, 20 Jan 2003, Steve Byan wrote:

> 
> On Friday, January 17, 2003, at 05:27  AM, Terry Lambert wrote:
> 
> > No, the worst case following a power failure is a screwed disk
> > track.
> 
> I'm skeptical of this claim, unless you mean it in a way that strikes 
> me a rather unusual.
> 
> > Modern disk drives read and write a track at a time; this is to
> > avoid rotational latency that woul happen if you waited for a
> > hard "sector start" marker to come around, and it avoids the need
> > for "low level formatting".
> 
> I'm familiar with drives which will re-order their queue of writes for 
> a track (i.e. SCSI disks with write cache enabled, SCSI disks with 
> command-queued writes without a "ordered task" tag, or ATA disks with 
> caching enabled). But you seem to be implying by your mention of 
> "avoiding rotationaly latency ... waiting for a ... sector start 
> marker" and mention of "low level formatting" that there exists a 
> modern SCSI or ATA disk which writes by simply blasting a whole new 
> track whenever it writes, starting at the current rotational position.

WHen I was reading Quntum manuals a few years ago I got the distinct
impression that this might be possibly hapenning,
however they did say that if you lost power on a drive you could lose
the sector that you were writing at the time. They didn't say
anything about other sectors..
 
[...]

> I know that neither Maxtor's SCSI disks nor their ATA disks blast an 
> entire track in one fell swoop.

well that's good to know
> 
> See below - changes made for higher capacity and higher RPM have made 
> it impossible to use the regenerative braking trick on modern drives.
> 
> >
> > The net result is that a DC failure can result in an entire track
> > getting trashed, if it happens at the right time.
> 
> I'll agree that it can result in partial completion of a queue of 
> writes, with the order of completion being essentially unknowable, and 
> with at most one sector being corrupted, and hence having an invalid 
> ECC (and therefor returning a hard error if read).

it would be nice if the drive had enough NVram to hold that one trashed
block so it could rewrite it on powerup.

> 
> If that is your definition of "trashing an entire track", I'll accept 
> it. But if you are implying that more than one sector could be 
> unreadable, or that any sector would return data that had not been 
> written to it without giving an error indication, I disagree. The 
> remaining sectors of the track may have new data or old data, depending 
> on the disk scheduling algorithm, but they would not be "corrupt" in 
> the sense of being unreadable, or of returning bogus data without also 
> returning an error indication.


For us the problem is that the drive reports the write as having
happenned when it hasn't, so teh filesystem dependencies end up being
smashed, because teh filesystem is writing out data in dependency order,
but if the data is written in a different order to the drive,
the drive can end up being in error in the  case of failure.

> 
> If you wish to have writes complete to the media in the order in which 
> you issued then, then you must either
> a) disable write caching and not use SCSI command queuing for ordered 
> writes
> or
> b) enable write caching but do not use SCSI command queuing, and either
> b1) set the FUA bit in the SCSI CDB and not use command queuing for 
> ordered writes, or
> b2) follow the ATA write command with a "flush cache" command
> or
> c) enable write caching and SCSI command queuing, but
> c1) set the FUA bit in the SCSI CDB and ensure the command has the 
> "ordered task" attribute in its task tag, so that the command will not 
> be reordered.



that is good information
maybe the SCSI and ATA guys can experiment on whether any of these modes
gives us acceptable performance.

> 
> 
> I agree that it is a shame that drive manufacturers do not offer an 
> "atomic write" feature for a sector. Convince the system manufacturers 
> to supply a "power-fail" warning signal a few milliseconds in advance 
> of the loss of DC power, and I think the drive manufacturers would be 
> happy to provide an atomic write feature.

We did this on the Whustle Interjet-II. we couldn't trust the drive
manufacturer however, so we had a 70mS hold-up built in
which gave us enough time to do things from the kernel.
(The hardest place was japan where some places only have AC at about
90VAC and so that was where the 70mSec was measured. at
240VAC (Australia) we had almost 200mSecs from memory :-)

The holdup had to give the drive time to complete a seek, write to a
track, discover that there was a bad sector in that range, reseek and
write the bad part, and reseek back and complete the original write,
possibly overflowing to the next track..

It would be interesting if you could tell us what the minimum hold-up
would be for a drive to complete any particular given write where the
write could be up-to 128KB in size, all worst cases..


> We can no longer use the 
> rotational energy in the platters to keep up the power, because the 
> platter count and media diameter have both steadily decreased - as a 
> result, there is no longer enough rotational inertial to provide the 
> hold-up times needed. Note that it is this reduced platter count and 
> smaller disks which has enabled 10K and 15K RPM disks within the power 
> envelope allotted to a 3.5 inch disk drive.
> 
> Regards,
> -Steve (not speaking officially for his employer)
> --------
> Steve Byan <stephen_byan@maxtor.com>
> Design Engineer
> Maxtor Corp.
> MS 1-3/E23
> 333 South Street
> Shrewsbury, MA 01545
> (508) 770-3414
> 
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-fs" in the body of the message
> 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.21.0301201120430.39747-100000>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation