Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 16 Jul 2000 12:29:27 +0200
From:      Poul-Henning Kamp <phk@critter.freebsd.dk>
To:        Robert Watson <rwatson@FreeBSD.ORG>
Cc:        Warner Losh <imp@village.org>, Kelly Yancey <kbyanc@posi.net>, Julian Elischer <julian@elischer.org>, Dan Nelson <dnelson@emsphone.com>, Adrian Chadd <adrian@FreeBSD.ORG>, freebsd-arch@FreeBSD.ORG
Subject:   DEVFS, the complete picture (Was: Re: SysctlFS)
Message-ID:  <2365.963743367@critter.freebsd.dk>
In-Reply-To: Your message of "Sun, 16 Jul 2000 04:48:16 EDT." <Pine.NEB.3.96L.1000716044526.27475A-100000@fledge.watson.org> 

next in thread | previous in thread | raw e-mail | index | archive | help

OK, now I finally have time to sit down and write an email with the
complete picture about devfs.

For a moment, disregard jails and rootmounts and let us just look at
cloning.

Cloning means that a device driver doesn't have to call make_dev()
on all potential instances up front.

This makes most difference for pseudo-devices, tun, ppp, slip, pty,
md bpf and so on, but other "actual" drivers like fd could use it
as well to avoid calling make_dev() for every conceiveable format
of floppydisk.

Implementing cloning without devfs would be a gross hack: we would
have to magically notice that /dev was searched and nothing found,
and I think we might just as well forget everything about that idea.

Implementing cloning with devfs is simple:

    Device-drivers can call devfs during their initialization and
    register a "clone()" function with devfs.  (They obviously have
    to deregister it again at dettach time).

    When devfs::VOP_LOOKUP() fails to find the name it is being told
    to look for, it will call all registered clone() routines 
    successively with the sought after name as argument.

    Each driver clone routine examine the name, and if it can
    instantiate a device of that name, it does so with make_dev()
    and return EEXISTS.  If it cannot it returns 0.  If it
    can determine for good that the name should not exist at this
    time it returns ENOENT;

    If a clone routine return EEXISTS, devfs::VOP_LOOKUP()
    immediately retries the lookup, and returns the result.

    If a clone routine returns ENOENT, devfs::VOP_LOOKUP() fails
    with ENOENT;

    When a clone routine returns 0, devfs::VOP_LOOKUP() calls the
    next clone routine in turn.

    If when all clone routines have been called none of them have
    instantiated, devfs::VOP_LOOKUP() returns ENOENT;

    The dev_t's created this way at not special in any way, all normal
    rules and rights apply.  The only thing special about this is
    the "lazy creation" of dev_t's.


Next, let us look at the rootfs:

Today when we boot a FreeBSD system, various magic code finds and 
mounts a root filesystem from which we execute /sbin/init (and the
rest becomes history).

A part of this h0h0magic, is to take a device name, and come up
with a vnode from which we can mount it, despite the fact that we
have no filesystems mounted which can instantiate that vnode.
Rather hackish, all in all.

Other magic code will do similar gyrations to mount a NFS root
filesystem.

This obviously is a chicken and egg issue, and there are probably
no solution which is universally acceptable.  My personal preference
is somewhat in the direction of what AIX have done, but with some
slight modifications:

    Kernel initializes, probes devices and all that.

    Kernel mounts a devfs instance on /

    Kernel mounts a preloaded (or compiled in) md(4) instance
    in /bootfs

    Kernel executes /bootfs/init

    /bootfs/init examines the environment to find the kind of desired
    root filesystem.

        nfs: /bootfs/init will initialize a network interface (using
             DHCP for instance) and union mount (not unionfs!) the root
             filesystem on /

        ufs: /bootfs/init will execute "/bootfs/fsck -p $device", and
	     afterwards unionmount (still not unionfs!) the device on /
	    
	others: Whatever is needed. 

    After mounting the desired root filesystem, /bootfs/init does an
    execl("/sbin/init", "/sbin/init", 0); so that the "real" init(8)
    is started as pid==1 as required.

I see many advantages to this scheme, the main thing is that a lot
of h0h0magic code moves from the kernel into userland.

The /bootfs md(4) instance can be kept around, it will be very small,
but it can also be unmounted and if our VM system is taught how to,
the RAM can be recycled.

This scheme will also take all the pain out of things like raid-5
rootfs:  No more kernel h0h0magic code needed, just add the vinum
program to /bootfs and DTRT.

/bootfs/init could conveniently be a shell script btw.


Finally, jails:

The only reason there could ever be to mount a devfs in a jail
partition is to get access to the cloning facility, mainly for
ptys.  For the /dev/null, /dev/zero etc cases, a good oldfashioned
mknod(8) will do just fine.  Remember: the main reason for devfs
is to cater for dynamic devices, the main thing we don't want to
see pop up in jails is dynamic devices.

So the devfs vs jail issue almost entirely boils down to "what do
we do about ptys in jails" and considering that it actually works
now in "the good old way", I frankly can't see much reason to
not just continue that way.  Few jails are pty intensive anyway.


Summary:

1. Forget about jails in the context of devfs, we don't need it.

2. We can argue if we should unionmount the "real root" over a
   devfs, or if we should mount devfs on /dev.  Both arguments
   have some amount of merit: The former is cleaner, the latter
   is more like it used to be.

3. Cloning while not strictly a must, is highly desireable.

--
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD coreteam member | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2365.963743367>