FreeBSD Mail Archives

Date:      Sat, 1 Apr 95 15:19:27 MST
From:      terry@cs.weber.edu (Terry Lambert)
To:        PVinci@ix.netcom.com (Paul Vinciguerra)
Cc:        hackers@FreeBSD.org
Subject:   Re: large filesystems/multiple disks [RAID]
Message-ID:  <9504012219.AA11992@cs.weber.edu>
In-Reply-To: <199504011440.GAA17939@ix3.ix.netcom.com> from "Paul Vinciguerra" at Apr 1, 95 06:40:04 am

> >It should also be noted that this type of arrangement is extremely
> >fragile -- it's order n^2 for n disks more fragile than file systems
> >not spanning disks at all.  You shouldn't attempt this type of
> >thing without being ready to do backups.  Basically, a failure of
> >one disk could theoretically take out all of you real file systems
> >sitting on logical partitions that spanned that one disk.  Pretty
> >gruesome, really.
> 
> Why?  Isn't this what the world is moving toward, ala RAID?  It's my 
> understanding that RAID spanning/striping adds about 20% to the file 
> system, but IMPROVES reliability/stability.
> 
> Most RAID systems I've investigated will let you pull a drive from the 
> array swap it with a new drive, and the system will rebuild the data ... 
> or so the claims go .. 

No one said anything at all about striping or optimistic replication
here.  All we were talking about was logical partitions that span
multiple physical partitions/drives.

It's fragile because you could for instance have four file systems
with blocks in the same 16M area of a disk.

A failure of a 16M are of a disk where the FS is mapped physically
instead of logically could damage at most two file systems in the
statistically unlikely event of the 16M are spanning a partition
boundry.

A failure of the same 16M area in a logical disk allocation
environment could potentially samage 4 file systems (assuming a
4M allocation uniti identical to those used by AIX).

Thus the damage of a typical failure (failures typically occur in
a physically contiguous area of the disk) is multiplied.


This isn't truly n^2; that's just a close approximation of the
rate of growth of risk per disk added, not the probability itself.
The actual probability is dependent on the average logical partion
size divided by the allocation unit size relative to the number of
allocation units in a particular set of storage.

Also, this is more applicable to additional drives than initial
drives, since the initial parittioning is likely to cause the
allocation units in a particular logical partition to start life
as contiguous regions (ith the exception of the designated overflow).
When you start adding disks AFTER that, then you basically randomly
allocate blocks.


If we start to examine file systems that "know" about the orgin of
logical allocation units (breaking the logical partitioning as an
abstraction, somewhat), then the file system can employ strategies
(which none of the BSD file systems implement) to ensure data
integrity.


RAID is an array redundancy mechanism, where the striping is a means
of replication for data integrity (not necessarily optimistic
redundant replication to speed access).  The point of RAID is to
have an array of n disks that store n' disks worth of data so
that if a single failure occurs, no data is lost.

The RAID approach is vastly more complex than simple volume spanning,
and IBM JFS is not a soft implementation of RAID.  It is true,
however, that what you end up with is a logical n' number of disks
as far as the machine is concerned.




Actually, I love file system stuff; if you're really interested, I'd
encourage you to look first at what you have at hand and have source
for in FreeBSD; there are a number of papers under the name "Ficus"
at ftp.cs.ucla.edu; these should probably be added to the documentation
that is considered part of the information in 4.4BSD, notably John
Heidemann's masters thesis.

After that, I'd suggest looking at the IBM publications on JFS; they
are mostly AIX manuals.  I can get the IBM publication numbers from
home if necessary, but any 3.1 manual set would help.

There's also an IBM "Device driver writer's" manual which includes
a disk with a sample GFS implementation (IBM's abstraction for an
installable file system).  This is supplementatry documentation,
and must be ordered seperately.

There's an internal USL publication called "SVR4 File System Writer's
guide" that covers the VM and DNLC interfaces much more thoroughly
than "The Magic Garden Explained" (a book I'd recommend, too).  You
can't get it from USL without a source license, but Dell did provide
to to Dell UNIX owners in a slightly edited format at one time.

There are a number of RAID papers at wuarchive.wustl.edu, and there
is nearly every Usenix file system paper available online at the
ftp.sage.usenix.org site.  I think this is where I got the RAID II
(RAID the second, not RAID level 2) papers.

There are some documents that you won't find elsewhere at the uk
document archive -- src.ic.ac.uk (I think).

Oh, and I'd recommend looking at the "Choices" stuff out of the
University Kentucky (can't remember the FTP address, sorry).

Finally, there is a draft copy of SPEC 1170 and a number of very
detailed papers on file systems at ftp.digibd.com; this is the
majority of the ftp.uiunix.ui.org archives that I pieced together
after UNIX International went under.


					Terry Lambert
					terry@cs.weber.edu
---
Any opinions in this posting are my own and not those of my present
or previous employers.

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9504012219.AA11992>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation