Date: Sun, 16 Jul 2000 12:29:27 +0200 From: Poul-Henning Kamp <phk@critter.freebsd.dk> To: Robert Watson <rwatson@FreeBSD.ORG> Cc: Warner Losh <imp@village.org>, Kelly Yancey <kbyanc@posi.net>, Julian Elischer <julian@elischer.org>, Dan Nelson <dnelson@emsphone.com>, Adrian Chadd <adrian@FreeBSD.ORG>, freebsd-arch@FreeBSD.ORG Subject: DEVFS, the complete picture (Was: Re: SysctlFS) Message-ID: <2365.963743367@critter.freebsd.dk> In-Reply-To: Your message of "Sun, 16 Jul 2000 04:48:16 EDT." <Pine.NEB.3.96L.1000716044526.27475A-100000@fledge.watson.org>
next in thread | previous in thread | raw e-mail | index | archive | help
OK, now I finally have time to sit down and write an email with the complete picture about devfs. For a moment, disregard jails and rootmounts and let us just look at cloning. Cloning means that a device driver doesn't have to call make_dev() on all potential instances up front. This makes most difference for pseudo-devices, tun, ppp, slip, pty, md bpf and so on, but other "actual" drivers like fd could use it as well to avoid calling make_dev() for every conceiveable format of floppydisk. Implementing cloning without devfs would be a gross hack: we would have to magically notice that /dev was searched and nothing found, and I think we might just as well forget everything about that idea. Implementing cloning with devfs is simple: Device-drivers can call devfs during their initialization and register a "clone()" function with devfs. (They obviously have to deregister it again at dettach time). When devfs::VOP_LOOKUP() fails to find the name it is being told to look for, it will call all registered clone() routines successively with the sought after name as argument. Each driver clone routine examine the name, and if it can instantiate a device of that name, it does so with make_dev() and return EEXISTS. If it cannot it returns 0. If it can determine for good that the name should not exist at this time it returns ENOENT; If a clone routine return EEXISTS, devfs::VOP_LOOKUP() immediately retries the lookup, and returns the result. If a clone routine returns ENOENT, devfs::VOP_LOOKUP() fails with ENOENT; When a clone routine returns 0, devfs::VOP_LOOKUP() calls the next clone routine in turn. If when all clone routines have been called none of them have instantiated, devfs::VOP_LOOKUP() returns ENOENT; The dev_t's created this way at not special in any way, all normal rules and rights apply. The only thing special about this is the "lazy creation" of dev_t's. Next, let us look at the rootfs: Today when we boot a FreeBSD system, various magic code finds and mounts a root filesystem from which we execute /sbin/init (and the rest becomes history). A part of this h0h0magic, is to take a device name, and come up with a vnode from which we can mount it, despite the fact that we have no filesystems mounted which can instantiate that vnode. Rather hackish, all in all. Other magic code will do similar gyrations to mount a NFS root filesystem. This obviously is a chicken and egg issue, and there are probably no solution which is universally acceptable. My personal preference is somewhat in the direction of what AIX have done, but with some slight modifications: Kernel initializes, probes devices and all that. Kernel mounts a devfs instance on / Kernel mounts a preloaded (or compiled in) md(4) instance in /bootfs Kernel executes /bootfs/init /bootfs/init examines the environment to find the kind of desired root filesystem. nfs: /bootfs/init will initialize a network interface (using DHCP for instance) and union mount (not unionfs!) the root filesystem on / ufs: /bootfs/init will execute "/bootfs/fsck -p $device", and afterwards unionmount (still not unionfs!) the device on / others: Whatever is needed. After mounting the desired root filesystem, /bootfs/init does an execl("/sbin/init", "/sbin/init", 0); so that the "real" init(8) is started as pid==1 as required. I see many advantages to this scheme, the main thing is that a lot of h0h0magic code moves from the kernel into userland. The /bootfs md(4) instance can be kept around, it will be very small, but it can also be unmounted and if our VM system is taught how to, the RAM can be recycled. This scheme will also take all the pain out of things like raid-5 rootfs: No more kernel h0h0magic code needed, just add the vinum program to /bootfs and DTRT. /bootfs/init could conveniently be a shell script btw. Finally, jails: The only reason there could ever be to mount a devfs in a jail partition is to get access to the cloning facility, mainly for ptys. For the /dev/null, /dev/zero etc cases, a good oldfashioned mknod(8) will do just fine. Remember: the main reason for devfs is to cater for dynamic devices, the main thing we don't want to see pop up in jails is dynamic devices. So the devfs vs jail issue almost entirely boils down to "what do we do about ptys in jails" and considering that it actually works now in "the good old way", I frankly can't see much reason to not just continue that way. Few jails are pty intensive anyway. Summary: 1. Forget about jails in the context of devfs, we don't need it. 2. We can argue if we should unionmount the "real root" over a devfs, or if we should mount devfs on /dev. Both arguments have some amount of merit: The former is cleaner, the latter is more like it used to be. 3. Cloning while not strictly a must, is highly desireable. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD coreteam member | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2365.963743367>