From owner-freebsd-current  Thu Feb  1 12:41:41 1996
Return-Path: owner-current
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id MAA27714
          for current-outgoing; Thu, 1 Feb 1996 12:41:41 -0800 (PST)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id MAA27708
          for <freebsd-current@freebsd.org>; Thu, 1 Feb 1996 12:41:36 -0800 (PST)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id NAA13606; Thu, 1 Feb 1996 13:37:05 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199602012037.NAA13606@phaeton.artisoft.com>
Subject: Re: invalid primary partition table: no magic
To: julian@ref.tfs.com (Julian Elischer)
Date: Thu, 1 Feb 1996 13:37:04 -0700 (MST)
Cc: terry@lambert.org, bde@zeta.org.au, freebsd-current@freebsd.org,
        j@uriah.heep.sax.de
In-Reply-To: <199602010102.RAA11946@ref.tfs.com> from "Julian Elischer" at Jan 31, 96 05:02:55 pm
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-current@freebsd.org
Precedence: bulk

> consider a BSD partition put at the beginning of a disk..
> 
> at the beginning of the disk we therefore have a fdisk slice AND and disklabel
> slice... which gets to claim the disk? :)
> Possibly the fdisk-slice probe()  method knows that if a BSD  subslice
> starts at 0 then to NOT grab control :)

Well, the fdisk table has to start at 0, so only one can occupy the space
at a time.  My first reaction would be to disallow that arrangement at
all, or, less optimally, have the BSD subslice code force an *unclaim* by
the FDISK code.

I wasn't really thinking in terms of allowing the claim process to
continue indefinitely, I was thinking of some method of prioritization.

For instance, the difference between a FAT and a VFAT volume is allowing
VFAT to claim the volume first if there is any VFAT data on it already.


> > Because we get range restriction guarantees, FS events in the kernel
> > on the /dev/wc0/t0/p0 devie can't screw up the contents of any other
> > slice, period.  No matter what bad calculations the MSDOSFS makes.
> 
> I THINK we might have that now anyhow.. My guess our problem isn't just 
> disk overshoots.

No.  But for developement, protection against overshoots is an issue.

Personally, I think overshoots are the current *worst* problem of the
DOSFS code -- there should be no way for it to corrupt an area not
designated as under its control, which is basically what's happening
in the FIPS case.  It's irrelevant that a robust DOSFS would not make
the cluster count cache write mistake that causes that particular
problem: you can't trust the FS's to be robust -- and you *should* not.

> > So now we have the device/slice mess straightened out.
> > 
> > (*) One thing we may want to consider is that the "no claimant" cases
> >     above are nearly the perfect mechanism to cause a callback to the
> >     VFS code to ask each VFS if it want to "claim" a device -- causing
> >     it to be mounted.
> 
> I don't know about that.. we don't know where to mount it..
> such leaf nodes however would probably be given some sort of descriptive name 
> however, describing what they are...

There are two or three nice approaches to this.


The first would be to cause the fstab to be accessed and the device
mounted based on that information.  This requires a premount of root.

This is somewhat unsatisfactory, in that devices may probe out of
mount dependency order; it would require establishing a shadow frame
under which translucent mounts could take place.

On the other hand, such a frame is useful for establishing a /dev mapping
prior to a root mapping, or allowing a root remapping.  It is also a
useful mechanism for a nomadic computing environment, where mirrored
resources exist at multiple access locations.  Functionally, this would
cause mounts to default to being "union".


The second would use a "drive" paradigm -- similar to DOS.  Each mount
as a result of a callback establishes a "drive mapping".  If you wanted
to, you could set up such a mapping to use "//drivename/..." to access
the per-drive "root".  This has the advantage of not needing a mapping
to a fixed location in the FS hierarchy (impling a mount oder that may
be different from the device discovery order).

This would be extremely useful for a nomadic computing system, since an
installed software package could be in "//package-name/" on a remote
resource.  If I took my laptop from a company location in AZ to one in
MA, as long as I could get authenticated to the local net, I would
have access to the package without having it installed locally.

The problem that this has is still there, though: you would have to have
an fstab that mapped by resource name rather than by device and/or FS
type.  The implication of this is clear: the fsck file system recovery
would need to run as part of the kernel for an automount situation.  On
the other hand, there has been little work (other than mine) on an fstyp
type of FS auto-recognition.  An fsck utility could simply be run on all
resources, and call the per-fs-type checker based on the results of a
type identification.  This is moderately SVR4'ish, but is very modular,
allowing for drop in addition of supported file system types.

One missing key piece in a user space implementation is forced cleaning
after a boot count has been exceeded.  To accomplish that requires some
of the recent work that has been done for our commercial product:  First,
the sync'er process needs to restore the file system to a "clean" state
after a period of inactivity.  This lets an fssync type operation set
the FS clean, and if the device is locaked against user access during
the period, the cleaner can be run against a mounted FS.  Second, root
mounts on unclean FS's are R/O if the clean bit is not set.  The mount
becomes a two-stage "mount, then remount".  If the remount fails, you
leave the FS mounted but read-only.  The consistency check can force a
remount on the FS being marked clean.

At this point the fstab is just a mechanism for specifying resource to
hierarchy mapping and options.  To allow the widest range of possibility,
the remount as read/write should take place when the resource is mapped
to a hierarchy locationfrom the identified resource list -- in the /etc/rc
file's mount of fstab partitions.


A third approach (these three are not the only possible ones!) would be
to use the "last mounted on location" and a bit to tag whether that has
been changed -- requiring an API in the modification of the fstab to
"notify" the underlying FS of changes.


There are additional "enhancements" that can be forseen -- for instance,
it would be relatively trivial to identify all resources before mount
and sort the mount order based on the implied graph dependency of the
previous mount locations.  This may fail when there is a very complex
mapping -- like a mount of one or more FS's on a vnconfig'ed file on
an existing FS that is not yet mounted -- etc..


>  The way I've been looking at it 
> is that there are many stackable "disk-like object" drivers that have a bunch
> of methods. The default method is to simply supply an offset and the handle
> to the next layer down, however there are at LEAST the following methods
> available:
> 
> probe
> attach
> doIO   <------ These two are really mutually exlusive
> offset <------ This is a special case of doIO for common simple cases
> parent <------ not used for such things as CCD drivers
> 
> so that if type.diIO is NULL then you simply add type.offset and switch to
> type.parent, which might in turn have a doIO or offset..  etc.
> eventually you hit the methods that were expoerted by the physical
> device driver. Devices always have a doIO method.

The "non-doIO" case is what I have internally called a logical partition
driver.  This is a simple sector remapping.

The "doIO" case can be broken into at least three type of drivers: I/O
by intent, I/O by side effect, and I/O by proxy.  The "intent" describes
physical device drivers.  The "side effect" describes multiplex drivers,
like a CDROM or tape changer.  The "proxy" describes volume concatenation,
block level compression, and media perfection drivers. I think "proxy"
drivers need to be given first shot at an "arriving" physical device.


> basically
> 1) when you register a new 'disk-like' object, the 
> 	'disk-object' handler creates a DEVFS entry for it and 
> 	calls the 'probe' method of all known
> 	types until one says "I can handle this".

This defeats your "two types would claim it if they were allowed to"
scenario -- but I agree with it.  I would fix your scenario by fiat
(making it illegal) or by specifying priority, and mandating  that
the author establish an order (and leave a large space between for
binary insertion, like I did with kerninit).

This doesn't necessarily allow support for "host drive + drivespace
drive" both being visible -- something that might be desirable.  The
final fix on that is handled by the "drivespace driver" having a higher
priority and reexporting the host drive as a pseduo-drive (if it's still
allowable) with a tag saying "don't claim" to himself.

> 2) the new method is 'stacked' and it's 'attach' method is called.
> 3) The attach method will 'register' any sub-partitions it finds, 
> 	(goto 1 for ewach such sub partition)
> 4) Any sub partition that doesn't have a 'claiment'
> 	still has it's devfs entry which becomes the only source of
> 	actions.

I'd add that you could "collapse" a logical device stack to avoid
unnecessary cruft.  Specifically, you'd have to have a "collapsed"
logical device record with a pointer to the original.  Consider:

1)	disk has DOS partition table, claimed by DOS partition driver,
	partition 2 is exported as a offset/length/ptr-to-phys.
2)	partition 2 has BSD slice code, claimed by BSD slice driver,
	slice 'a' is is exported as offset/length/ptr-to-P2-part
3)	Reference is "collapsed" to "offset/length/ptr-to-phys, with
	a pointer to the proginal slice 'a' export.
4)	I/O skips logical placement calculations inhernet in stack
	traversal of an "uncollapsed" stack.
5)	Geometry modifications must operate on the uncollapsed stack
	and recalculate the collapsed as necessary.
6)	"proxy" layers limit collapsability.
7)	Stack existance limits the ability to damage structures
	necessary for maintaining currently mounted FS's.

> Notes:
> A 'type' might be a CCD driver, which recognises a label saying
> "part 4 of a 5 part volume"

Definite agreement here.  There must be a recognition mechanism unique
across all type instances... for historal screwups, like FAT vs. VFAT,
you have to punt to ordering and more in-depth analysis than a simple
magic number..

> Every time you register a new 'disk-like' device, a 'structure is allocated,
> and the 'next' ID is incremented. an entry is put into a hash table so
> that that structure can be easily located, given that ID number.
> The ID number is the minor number..

I think the use of major/minor is not really necessary, except as a
method of exporting devices into the name space at a particular layer
-- that is, it can be maintained by an integer parameter initialized
to zero, the address of which is passed in on each registration by a
layer: it is associated with the export interface structure that
causes the device name to show up.

> This means that there is no encoding of bits in minor numbers.

This is a good idea -- this information should be encoded in the
hierarchy in the devfs anyway, IMO.  One example of this is pty
designation, which I'd like to see set up as a directory of cloning
devices aquired by ioctl() on the controlling device (the directory
their name space export occurs in).

> It also means that minors might be differnt each time you boot
> (that's why devfs).. 

Programs should nominally ignore minor numbers in any case.

> The whole thind hangs off a NEW major number and might be done in
> parallel witht eh existing system for a while..

Time to murder mknod, MAKEDEV, etc., IMO.  To hell with them.  8-).

Really, we should consider throwing them out entirely; we've been
in a migration state for some time.  The missing piece is the devfs
/dev and / mount-interaction... and we're discussing that here.


> Because probing can be tricky I plan on passing 'context' hints
> at probe time so that various probe routines are not working totally in
> the dark as to what happenned before..
> (e.g. finding a fdisk slice within an fdisk slice is legal but
> should be treated differently (I think block numbers are not absolute in
> extended partitions .. needs confirmation). 

I think this is exposed by the hiearchy in the example I posted; the real
issue is applicability of interfaces at the higher layers.  This can be
handled by physical and logical attribution -- both of which "bleed up".

The physical attribute bits are determined (and assigned) by the device
driver (removable media, read-only media, arrival notification, etc.).

The logical attribute bits are set by each layer (has media perfection,
don't allow more, has compression, don't allow more, etc.).

Finally, there is attribution by identifier, where, for instance, DOS
partitioning can disallow itself by virtue of a predecessor device
having DOS compression already.  This would have to be carefully
handled, since this would disallow vnconfig'ed devices as disks unless
attribution could be changed at vnconfig level (creation of ISOFS
images, etc. would need this).

> Writing a new disk driver get's to be really simple..
> write basic IO routines,
> register a disk-like device.. stand back and await work..

Yes, exactly.  It also provides for the ability to do "media arrival" for
removable devices that do notification, and FS callback for "media
validation" for already mounted FS's on removable media without
notification (I have shot myself in the foot swapping floppies on a
mounted drive on more than one occasion).


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.