From owner-freebsd-hackers  Tue Dec  9 14:13:08 1997
Return-Path: <owner-freebsd-hackers>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.7/8.8.7) id OAA10842
          for hackers-outgoing; Tue, 9 Dec 1997 14:13:08 -0800 (PST)
          (envelope-from owner-freebsd-hackers)
Received: from home.gtcs.com (home.gtcs.com [206.54.69.238])
          by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id OAA10825
          for <hackers@FreeBSD.ORG>; Tue, 9 Dec 1997 14:12:58 -0800 (PST)
          (envelope-from bruce@gtcs.com)
Received: from gtcs.com (localhost.gtcs.com [127.0.0.1])
	by home.gtcs.com (8.8.5/8.8.5) with ESMTP id PAA07923
	for <hackers@FreeBSD.ORG>; Tue, 9 Dec 1997 15:09:44 -0700 (MST)
Disposition-Notification-To: bgingery@gtcs.com
Message-Id: <199712092209.PAA07923@home.gtcs.com>
Date: Tue, 9 Dec 1997 15:09:42 -0700 (MST)
From: bgingery@gtcs.com
Reply-To: bgingery@gtcs.com
Subject: Re: blocksize on devfs entries (and related)
To: hackers@FreeBSD.ORG
In-Reply-To: <199712082322.PAA27177@hub.freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/plain; CHARSET=US-ASCII
Sender: owner-freebsd-hackers@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

Julian Elischer <julian@whistle.com> alerted:
-> In spec_getpages() the size of the device's blocks is incorrectly
-> deduced from the blocksize of the filesystem in which the device
-> resided ... This so obvioously wrong that i'm not worried about 
-> whether it SHOULD be fixed, just HOW?
[munch]
-> How does this information GET to this location.?
[munch]
-> When a device is 'upgraded' to read-write from read-only, the vnode
-> is consulted, to see it it is permissable, but the device itself is
-> not notified fo the change.

Theoretically, the physical layout of the device should be stored
whether or not there's any filesystem on it.  Besides, an optimizing
filesystem may *not* match the parameters of the device.  Slew, may
be intrinsic on the device low-level formatting, or may be as high as
the filesystem on it.  Logical blocksize MAY or MAY NOT be a multiple
or sub-multiple of the physical sector (or other block) size.  With
read-aheads and large device buffering, the optimal physical handling
size for a device may be quite different from its basic blocksize, but
constraints of a specific filesystem above that may cause it to need
to store values that are NOT directly related to the device parameters.

Yet, especially with slicing, each slice creation needs hardware info
AND possibly enclosing slice info, both for the best slice arrangement
and for passing through to a filesystem creation routine.

I see this devfs as some departure from previous handling, and a move
in the right direction.  Yet, let's not loose anything that's there in
the move, and let's try to take up anything that's been overlooked in
the past.

To me some answers to these ...

     1.  physical block/sector size needs to be stored by DEVICE
        this may or may not match the logical blocksize of any
        filesystem resident on the device.  Optimal transfer blocksize
        for each of read and write ALSO need to be stored.
        
     2.  physical layout (sect/track, tracks/cyl) also needs to
        be stored for any DASD.  Also any OTHER known info which
        may be used to optimize the filesystem building process for
        the device, such as rotational speed, seek timing ..  If
        this is not stored with driver info in the devfs, then
        some pointer or common reference point should be made to
        the "file entry" that contains the info.
        
     3.  If at the controller level it is possible to concatinate
        or RAID join devices, that information needs to be stored
        for the device.  If this is intrinsic to the device driver
        or the physical device - no matter.
        
     4.  If the device is "virtual", built on a vnode structure with
        variable-sized (or as is more common today, FIXED size)
        underlying file, this needs to be known.  I don't think I've
        seen variable-sized-devices anyplace but on NeXT's old swapfile
        structure, yet. With various emulators, I can see this becoming
        a VERY useful thing.  I've seen times recently when it would
        be handy to have a secondary swap in a variable-sized file,
        as primary swap is on my old NeXTcube!
        
     5.  Some kind of "relative timing" metric should be avaliable
        for the device, and separately for writing and reading.

     6.  When a device is opened ro, if the underlying hardware has
        ANY indication that it's a ro open, then if it is later upgraded
        there should at least be a hook for it to be notified that it
        has been upgraded.  Current state (ro/rw) should be avaialable
        to user processes without "testing it by opening a write file"
        to a filesystem (or even raw device). 

  Other thoughts.  Especially WRT possible experimental work, and
  emulators, it will be QUITE convenient to have everything that can
  be used to optimize the construction of a filesystem (of any of many
  many kinds) or slice-out and construct a filesystem.  As wine, dosemu
  and bochs (to just name three) expand the emulations supporting other
  OSs, being free with filesystems for those OSs, other than purely
  "native" becomes all the more important.

  SoftPC/SoftWindows and Bochs both create internally what amounts to a
  FAT filesystem within a file - a vnode filesystem, but not using
  system provisions for it.  That pretty well eliminates "device" access
  to the filesystem and (e.g.) doing a mount_msdos on 'em for other
  processing and data exchange, without adapting the emulator's code
  to *parallel* what we've already got with FreeBSD.

  Ideally, wine, softpc, dosemu, bochs, mtools, and mount_msdos (etc)
  would have NO idea what the device is on where the FAT filesystem
  resides. That is not the business of that "layer". Similarly for a
  MINIX or OS/2 under bochs, etc, for the filesystem in use.  These
  would not care if it's a whole drive somewhere, a slice of a drive,
  slice of a slice, or virtual filesystem that resides totally within a
  file in a UFS.

  Yet, why deny these the optimization information which will allow
  them to map (within the constraints of their architecture) a new
  filesystem for best throughput, if it's actually available.

  Now let me raise some additional questions --


       Should a DASD be mappable ONLY with horizontal slices?
   With what we're all doing today, it seems that taking a certain
   number of cylinders for slices is best - but other access methods
   may find an underlying physical structure more convenient if
   a slice specifies a range of heads and cylinders that do NOT
   presume that all heads/cylinders from starting to ending according
   to physical layout are part of the same slice.  It may be quite
   convenient to have a cluster of heads across physical devices
   forming a logical device or slice, without fully dedicating those
   physical devices to that use.

       And, I'll mention again, DISK formats are not the only
   random-access mass-storage formats on the horizon!  I'm guessing
   that for speed of inclusion into product lines, all will emulate
   a disk drive - but that may not be the most efficient way of using
   them (in fact, probably not).  They also can be expected to have
   "direct access" methods according to their physical architecture,
   with some form of tree-access the MOST efficient!

       Finally - one of the most powerful potentials of the devfs is
   handling non-DASD devices!  The connecting or turning-on of a device
   (nic/fax/printer/external-modem/scanner/parallel-to-parallel conn-  
   ection to another PC, even industrial controls of some kind) SHOULD
   cause it to "arrive".  If its turn-on generates a signal that can be
   caught by a minimal driver, that may trigger a load of a full driver
   (arrival event) and its inclusion in the devfs listings.  Similarly,
   killing such a device might trigger an immediate or delayed unloading
   of the same driver, and removal from the devfs.

	Bruce Gingery	<bgingery@gtcs.com>