Date: Sat, 13 Dec 1997 03:05:27 -0800 (PST) From: Julian Elischer <julian@whistle.com> To: Mike Smith <mike@smith.net.au> Cc: bgingery@gtcs.com, hackers@FreeBSD.ORG Subject: Re: blocksize on devfs entries (and related) Message-ID: <Pine.BSF.3.95.971213021310.29160D-100000@current1.whistle.com> In-Reply-To: <199712130848.TAA01888@word.smith.net.au>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 13 Dec 1997, Mike Smith wrote:
>
> I haven't noticed any commentary on this, Brian, so I thought I should
> raise a few points that you appear to have missed.
>
> > Theoretically, the physical layout of the device should be stored
> > whether or not there's any filesystem on it.
The problem is that on all new devices the layout is both hidden and
not easily describable. Nor can we describe the layouts of media that have
not yet been invented. Track/cyl/sector geometry descriptions can not be
used to describe modern disks and the picture is muddied by track buffers
and reverse block write order (for example).
>
> This is a fundamentally flawed approach, and I am glad that Julian's
> new SLICE model (at this stage) completely ignores any incidental
> parametric information associated with an extent.
>
> > To me some answers to these ...
> >
> > 1. physical block/sector size needs to be stored by DEVICE
> > this may or may not match the logical blocksize of any
> > filesystem resident on the device. Optimal transfer blocksize
> > for each of read and write ALSO need to be stored.
>
> Physical blocksize vs. logical blocksize is a problematic issue. On
> one hand, there is the desire to maintain simplicity by mandating a
> single blocksize across all boundaries and forcing translation at the
> lowest practical level. The downside with this is dealing with legal
> logical block operations that result in partial block operations at the
> lowest level.
>
In my slice code I propogate blocksize up. Each layer can make the
decision as to what blocksize it wishes to export further up. A disk array
would probably export a blocksize equal to the largest blocksize of it's
component parts.
I might also propogate up a maximum acceptible chunk, and an optimal one,
but how to interpolate these at higher layers becomes non-solvable without
breaking layering. (or by preceding every IO call with a callthat asks
"If I were to do IO to location X how big would you like it to be?")
This is not an answer.
> One approach with plenty of historical precedent is to use a blocksize
> "sufficiently large" that it is a multiple of the likely device
> blocksizes, and make that the 'uniform standard'. Another is to
> cascade blocksizes upwards, where the blocksize at a given point in the
> tree is the lowest common multiple of that of all points below. This
> obviously requires some extra smarts in each layer that consumes
> multiple lower layers.
That's what I'm expecting to do (but have not yet done as I have not yet
written a multiplexing handler)
>
> > 2. physical layout (sect/track, tracks/cyl) also needs to
> > be stored for any DASD. Also any OTHER known info which
> > may be used to optimize the filesystem building process for
> > the device, such as rotational speed, seek timing .. If
> > this is not stored with driver info in the devfs, then
> > some pointer or common reference point should be made to
> > the "file entry" that contains the info.
>
> Physical layout is a joke, and has been for many years. This
> suggestion costs you a lot of credibility.
Some of this information will be available by asking for a disklabel
struct from any slice.
fields such as RPM will be propogated up from the device(s) if possible
where fields such as drive size are reflections of that particular slice.
>
> Qualitative parametric information may be useful, eg. "this disk is
> slow", presuming that a set of usefully general metrics can be
> established. Unfortunately, obtaining measurements such as this can be
> slow, and the results are often nondeterministic.
In the face of confusion, do nothing..
I will probably punt on must of the complicated issues, believing that
the drive manufacturers will just do the best they can :)
>
> > 3. If at the controller level it is possible to concatinate
> > or RAID join devices, that information needs to be stored
> > for the device. If this is intrinsic to the device driver
> > or the physical device - no matter.
>
> This is not useful. An upper layer should not care whether the extent
> it is consuming is a concatenation of extents. This is an issue for
> management tools, which should have an OOB technique for recovering
> structure information.
As Mike says, The whole aim of the slice layers and related things is to
HIDE this.. In earlier email you indicated that you thought I should make
nodes in the devfs, in parallel to the device nodes, that DESCRIBED
the devices..
e.g. /dev/raid0.description migh appear to be a file that says
"An array of 5 drives, each of 12.5GB"
I'm not sure if I understood this correctly, but I think that when you say
'stored with' I believe this may be what you mean. If so then my answer
is "nice but no dice". device information will probably not be retrieved
in this manner, however I have not thought it through fully either.
>
> > 6. When a device is opened ro, if the underlying hardware has
> > ANY indication that it's a ro open, then if it is later upgraded
> > there should at least be a hook for it to be notified that it
> > has been upgraded. Current state (ro/rw) should be avaialable
> > to user processes without "testing it by opening a write file"
> > to a filesystem (or even raw device).
>
> The RO->RW upgrade notification is a contentious issue, but one that
> definitely needs thinking through. How would you suggest it be
> handled? Should the standard be to reopen the device, or pass a
> special ioctl, or add a new device entrypoint?
r/w 'openness' is and should be propogated up and down. I've already
started work on this. The difficult bit is not in the SLICE code, but
rather in teh existing system code that doesn't notify the slice code when
this happens.
>
> > Other thoughts. Especially WRT possible experimental work, and
> > emulators, it will be QUITE convenient to have everything that can
> > be used to optimize the construction of a filesystem (of any of many
> > many kinds) or slice-out and construct a filesystem. As wine, dosemu
> > and bochs (to just name three) expand the emulations supporting other
> > OSs, being free with filesystems for those OSs, other than purely
> > "native" becomes all the more important.
>
> I can't actually parse this; I'm not sure if you're actually trying to
> say anything at all.
I THINK he is saying he wants to be able to partition files as devices.
We can already do this with vn devices. And my SLICE code supports them
fully.
>
> > SoftPC/SoftWindows and Bochs both create internally what amounts to a
> > FAT filesystem within a file - a vnode filesystem, but not using
> > system provisions for it. That pretty well eliminates "device" access
> > to the filesystem and (e.g.) doing a mount_msdos on 'em for other
> > processing and data exchange, without adapting the emulator's code
> > to *parallel* what we've already got with FreeBSD.
>
> Incorrect. It is relatively straightforward to create a vnode disk,
> slice it, build a FAT filesystem in one slice and then pass that slice
> to your favorite PC emulator.
>
> > Yet, why deny these the optimization information which will allow
> > them to map (within the constraints of their architecture) a new
> > filesystem for best throughput, if it's actually available.
>
> Because any "optimisation information" that you could pass them would
> be wrong. Any optimisation attempting to operate based on incorrect
> parameters can only be a pessimisation.
>
A file being exported as a device to an emulator would have acces
charateristics that are dependant on it's stored allocations. This wold be
almost impossible to pass back to an emulator.
> > Now let me raise some additional questions --
> >
> >
> > Should a DASD be mappable ONLY with horizontal slices?
> > With what we're all doing today, it seems that taking a certain
> > number of cylinders for slices is best - but other access methods
> > may find an underlying physical structure more convenient if
> > a slice specifies a range of heads and cylinders that do NOT
> > presume that all heads/cylinders from starting to ending according
> > to physical layout are part of the same slice. It may be quite
> > convenient to have a cluster of heads across physical devices
> > forming a logical device or slice, without fully dedicating those
> > physical devices to that use.
>
what you are describing is possible, but would behave badly with respect
to locality of reference.
> This is a nonsense question in the context of ZBR and "logical extent"
> devices (eg. SCSI, ATAPI, most ATA devices).
The language is a bit harsh, but the idea is correct.. We cannot try
outguess how the manufacturers have layed out the disk. It's not easily
describable, an in the case of block reallocation, possibly not even
constant with time.
> > > And, I'll mention again, DISK formats are not the only
> > random-access mass-storage formats on the horizon! I'm guessing
> > that for speed of inclusion into product lines, all will emulate
> > a disk drive - but that may not be the most efficient way of using
> > them (in fact, probably not). They also can be expected to have
> > "direct access" methods according to their physical architecture,
> > with some form of tree-access the MOST efficient!
>
Most new access methods will either:
1/ be totally random access .... (we get no gain from using geometry)
2/ emulate a disk, (using the apparent geometry may be wrong pessimal
in the face of teh underlying REAL geometry)
3/ have a differnt geometry charateristic that we cannot try guess ahead
of time.
> In most cases, the internal architecture of the device will be
> optimised for two basic operations; retrieval of large contiguous
> extents, and read/write of small randomly scattered regions.
>
> Data access patterns are unlikely to change radically, particularly
> given the momentum that modern systems have. I'll let you work out
> what the two above are, and why they are so common. But trust me, they
> are.
Manufacturers will be making devices to be optimised to do what we want
to do. so for us to try do something different may even slow things down.
B>
> > Finally - one of the most powerful potentials of the devfs is
> > handling non-DASD devices! The connecting or turning-on of a device
> > (nic/fax/printer/external-modem/scanner/parallel-to-parallel conn-
> > ection to another PC, even industrial controls of some kind) SHOULD
> > cause it to "arrive". If its turn-on generates a signal that can be
> > caught by a minimal driver, that may trigger a load of a full driver
> > (arrival event) and its inclusion in the devfs listings. Similarly,
> > killing such a device might trigger an immediate or delayed unloading
> > of the same driver, and removal from the devfs.
>
DEVFS is PRIMARILY for non DASD devices.. I have only added the DASD
support in the last month.
> This is trivially obvious, and forms the basic argument for the use of
> DEVFS. You fail to draw the parallel between system startup and the
> conceptual "massive arrival of devices" which is still the major
> argument for such a system.
>
> mike
>
Startup is as Mike says, just a special case of the more general
"A device has appeared" case. If we can get this to be true for all teh
drivers, then a truely dynamic FreeBSD will be a lot closer..
so far I have participated in the following steps towards this:
[A] making devsw[] and array of pointers and making each driver add it's
own entry at the time that it is initialised.
[B] Adding an initialisation routine to every driver called by SYSINIT
(eventually this same routine should be also called by the LKM init code,
which is why some drivers have init routines even though they strictly
don't need them. This is thinking ahead..
eventually the LKM code will do:
call init,
for each possible such device
call probe
call attach
OR
call init
ask all busses (e.g. pci) if they have any of these, and if found, call
the attach routine
by structuring the device drivers this way now we are trying to make the
transition easier.
[C] adding DEVFS so that arriving devices become visible to the user
[D] adding SLICE code to cope with arriving DASD partitions.
[E] writing a generic SCSI system to allow scsi devices to be attached
indepedently of the adapter type
[F] getting the BIOS boot code going so that booting is independent of
device types
[G] added at_fork(), at_shutdown() at_exit() facilities to allow kernel
LKM modules to be notified of these events
I'm sure I've forgotten some, but you must see a pattern here. Over the
last 5 years I've been constantly working towards a more modular and
dynamic freeBSD. Many of the things I have added are not really fully
appreciated yet, but will become so as soon as more pieces of the jigsaw
are completed.
AS for teh SLICE an dDEVFS code..
most of the comments so far are
"so, what's differnt"
the system seems to be exactly as before..
(Which of course is exactly what I want)
The outwards appearance of no change, but an internal complete rebuild to
be more modular.
future goals include:
complete removal of major numbers
complete removal of the dev_t type from FreeBSD
At least as we know it. it may be replaced by a
"device reference" of another type.
julian
>
>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.3.95.971213021310.29160D-100000>
