From owner-freebsd-current Thu Feb 1 12:41:41 1996 Return-Path: owner-current Received: (from root@localhost) by freefall.freebsd.org (8.7.3/8.7.3) id MAA27714 for current-outgoing; Thu, 1 Feb 1996 12:41:41 -0800 (PST) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id MAA27708 for ; Thu, 1 Feb 1996 12:41:36 -0800 (PST) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id NAA13606; Thu, 1 Feb 1996 13:37:05 -0700 From: Terry Lambert Message-Id: <199602012037.NAA13606@phaeton.artisoft.com> Subject: Re: invalid primary partition table: no magic To: julian@ref.tfs.com (Julian Elischer) Date: Thu, 1 Feb 1996 13:37:04 -0700 (MST) Cc: terry@lambert.org, bde@zeta.org.au, freebsd-current@freebsd.org, j@uriah.heep.sax.de In-Reply-To: <199602010102.RAA11946@ref.tfs.com> from "Julian Elischer" at Jan 31, 96 05:02:55 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-current@freebsd.org Precedence: bulk > consider a BSD partition put at the beginning of a disk.. > > at the beginning of the disk we therefore have a fdisk slice AND and disklabel > slice... which gets to claim the disk? :) > Possibly the fdisk-slice probe() method knows that if a BSD subslice > starts at 0 then to NOT grab control :) Well, the fdisk table has to start at 0, so only one can occupy the space at a time. My first reaction would be to disallow that arrangement at all, or, less optimally, have the BSD subslice code force an *unclaim* by the FDISK code. I wasn't really thinking in terms of allowing the claim process to continue indefinitely, I was thinking of some method of prioritization. For instance, the difference between a FAT and a VFAT volume is allowing VFAT to claim the volume first if there is any VFAT data on it already. > > Because we get range restriction guarantees, FS events in the kernel > > on the /dev/wc0/t0/p0 devie can't screw up the contents of any other > > slice, period. No matter what bad calculations the MSDOSFS makes. > > I THINK we might have that now anyhow.. My guess our problem isn't just > disk overshoots. No. But for developement, protection against overshoots is an issue. Personally, I think overshoots are the current *worst* problem of the DOSFS code -- there should be no way for it to corrupt an area not designated as under its control, which is basically what's happening in the FIPS case. It's irrelevant that a robust DOSFS would not make the cluster count cache write mistake that causes that particular problem: you can't trust the FS's to be robust -- and you *should* not. > > So now we have the device/slice mess straightened out. > > > > (*) One thing we may want to consider is that the "no claimant" cases > > above are nearly the perfect mechanism to cause a callback to the > > VFS code to ask each VFS if it want to "claim" a device -- causing > > it to be mounted. > > I don't know about that.. we don't know where to mount it.. > such leaf nodes however would probably be given some sort of descriptive name > however, describing what they are... There are two or three nice approaches to this. The first would be to cause the fstab to be accessed and the device mounted based on that information. This requires a premount of root. This is somewhat unsatisfactory, in that devices may probe out of mount dependency order; it would require establishing a shadow frame under which translucent mounts could take place. On the other hand, such a frame is useful for establishing a /dev mapping prior to a root mapping, or allowing a root remapping. It is also a useful mechanism for a nomadic computing environment, where mirrored resources exist at multiple access locations. Functionally, this would cause mounts to default to being "union". The second would use a "drive" paradigm -- similar to DOS. Each mount as a result of a callback establishes a "drive mapping". If you wanted to, you could set up such a mapping to use "//drivename/..." to access the per-drive "root". This has the advantage of not needing a mapping to a fixed location in the FS hierarchy (impling a mount oder that may be different from the device discovery order). This would be extremely useful for a nomadic computing system, since an installed software package could be in "//package-name/" on a remote resource. If I took my laptop from a company location in AZ to one in MA, as long as I could get authenticated to the local net, I would have access to the package without having it installed locally. The problem that this has is still there, though: you would have to have an fstab that mapped by resource name rather than by device and/or FS type. The implication of this is clear: the fsck file system recovery would need to run as part of the kernel for an automount situation. On the other hand, there has been little work (other than mine) on an fstyp type of FS auto-recognition. An fsck utility could simply be run on all resources, and call the per-fs-type checker based on the results of a type identification. This is moderately SVR4'ish, but is very modular, allowing for drop in addition of supported file system types. One missing key piece in a user space implementation is forced cleaning after a boot count has been exceeded. To accomplish that requires some of the recent work that has been done for our commercial product: First, the sync'er process needs to restore the file system to a "clean" state after a period of inactivity. This lets an fssync type operation set the FS clean, and if the device is locaked against user access during the period, the cleaner can be run against a mounted FS. Second, root mounts on unclean FS's are R/O if the clean bit is not set. The mount becomes a two-stage "mount, then remount". If the remount fails, you leave the FS mounted but read-only. The consistency check can force a remount on the FS being marked clean. At this point the fstab is just a mechanism for specifying resource to hierarchy mapping and options. To allow the widest range of possibility, the remount as read/write should take place when the resource is mapped to a hierarchy locationfrom the identified resource list -- in the /etc/rc file's mount of fstab partitions. A third approach (these three are not the only possible ones!) would be to use the "last mounted on location" and a bit to tag whether that has been changed -- requiring an API in the modification of the fstab to "notify" the underlying FS of changes. There are additional "enhancements" that can be forseen -- for instance, it would be relatively trivial to identify all resources before mount and sort the mount order based on the implied graph dependency of the previous mount locations. This may fail when there is a very complex mapping -- like a mount of one or more FS's on a vnconfig'ed file on an existing FS that is not yet mounted -- etc.. > The way I've been looking at it > is that there are many stackable "disk-like object" drivers that have a bunch > of methods. The default method is to simply supply an offset and the handle > to the next layer down, however there are at LEAST the following methods > available: > > probe > attach > doIO <------ These two are really mutually exlusive > offset <------ This is a special case of doIO for common simple cases > parent <------ not used for such things as CCD drivers > > so that if type.diIO is NULL then you simply add type.offset and switch to > type.parent, which might in turn have a doIO or offset.. etc. > eventually you hit the methods that were expoerted by the physical > device driver. Devices always have a doIO method. The "non-doIO" case is what I have internally called a logical partition driver. This is a simple sector remapping. The "doIO" case can be broken into at least three type of drivers: I/O by intent, I/O by side effect, and I/O by proxy. The "intent" describes physical device drivers. The "side effect" describes multiplex drivers, like a CDROM or tape changer. The "proxy" describes volume concatenation, block level compression, and media perfection drivers. I think "proxy" drivers need to be given first shot at an "arriving" physical device. > basically > 1) when you register a new 'disk-like' object, the > 'disk-object' handler creates a DEVFS entry for it and > calls the 'probe' method of all known > types until one says "I can handle this". This defeats your "two types would claim it if they were allowed to" scenario -- but I agree with it. I would fix your scenario by fiat (making it illegal) or by specifying priority, and mandating that the author establish an order (and leave a large space between for binary insertion, like I did with kerninit). This doesn't necessarily allow support for "host drive + drivespace drive" both being visible -- something that might be desirable. The final fix on that is handled by the "drivespace driver" having a higher priority and reexporting the host drive as a pseduo-drive (if it's still allowable) with a tag saying "don't claim" to himself. > 2) the new method is 'stacked' and it's 'attach' method is called. > 3) The attach method will 'register' any sub-partitions it finds, > (goto 1 for ewach such sub partition) > 4) Any sub partition that doesn't have a 'claiment' > still has it's devfs entry which becomes the only source of > actions. I'd add that you could "collapse" a logical device stack to avoid unnecessary cruft. Specifically, you'd have to have a "collapsed" logical device record with a pointer to the original. Consider: 1) disk has DOS partition table, claimed by DOS partition driver, partition 2 is exported as a offset/length/ptr-to-phys. 2) partition 2 has BSD slice code, claimed by BSD slice driver, slice 'a' is is exported as offset/length/ptr-to-P2-part 3) Reference is "collapsed" to "offset/length/ptr-to-phys, with a pointer to the proginal slice 'a' export. 4) I/O skips logical placement calculations inhernet in stack traversal of an "uncollapsed" stack. 5) Geometry modifications must operate on the uncollapsed stack and recalculate the collapsed as necessary. 6) "proxy" layers limit collapsability. 7) Stack existance limits the ability to damage structures necessary for maintaining currently mounted FS's. > Notes: > A 'type' might be a CCD driver, which recognises a label saying > "part 4 of a 5 part volume" Definite agreement here. There must be a recognition mechanism unique across all type instances... for historal screwups, like FAT vs. VFAT, you have to punt to ordering and more in-depth analysis than a simple magic number.. > Every time you register a new 'disk-like' device, a 'structure is allocated, > and the 'next' ID is incremented. an entry is put into a hash table so > that that structure can be easily located, given that ID number. > The ID number is the minor number.. I think the use of major/minor is not really necessary, except as a method of exporting devices into the name space at a particular layer -- that is, it can be maintained by an integer parameter initialized to zero, the address of which is passed in on each registration by a layer: it is associated with the export interface structure that causes the device name to show up. > This means that there is no encoding of bits in minor numbers. This is a good idea -- this information should be encoded in the hierarchy in the devfs anyway, IMO. One example of this is pty designation, which I'd like to see set up as a directory of cloning devices aquired by ioctl() on the controlling device (the directory their name space export occurs in). > It also means that minors might be differnt each time you boot > (that's why devfs).. Programs should nominally ignore minor numbers in any case. > The whole thind hangs off a NEW major number and might be done in > parallel witht eh existing system for a while.. Time to murder mknod, MAKEDEV, etc., IMO. To hell with them. 8-). Really, we should consider throwing them out entirely; we've been in a migration state for some time. The missing piece is the devfs /dev and / mount-interaction... and we're discussing that here. > Because probing can be tricky I plan on passing 'context' hints > at probe time so that various probe routines are not working totally in > the dark as to what happenned before.. > (e.g. finding a fdisk slice within an fdisk slice is legal but > should be treated differently (I think block numbers are not absolute in > extended partitions .. needs confirmation). I think this is exposed by the hiearchy in the example I posted; the real issue is applicability of interfaces at the higher layers. This can be handled by physical and logical attribution -- both of which "bleed up". The physical attribute bits are determined (and assigned) by the device driver (removable media, read-only media, arrival notification, etc.). The logical attribute bits are set by each layer (has media perfection, don't allow more, has compression, don't allow more, etc.). Finally, there is attribution by identifier, where, for instance, DOS partitioning can disallow itself by virtue of a predecessor device having DOS compression already. This would have to be carefully handled, since this would disallow vnconfig'ed devices as disks unless attribution could be changed at vnconfig level (creation of ISOFS images, etc. would need this). > Writing a new disk driver get's to be really simple.. > write basic IO routines, > register a disk-like device.. stand back and await work.. Yes, exactly. It also provides for the ability to do "media arrival" for removable devices that do notification, and FS callback for "media validation" for already mounted FS's on removable media without notification (I have shot myself in the foot swapping floppies on a mounted drive on more than one occasion). Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.