From owner-freebsd-current Sun Mar 30 13:00:47 1997 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.5/8.8.5) id NAA12250 for current-outgoing; Sun, 30 Mar 1997 13:00:47 -0800 (PST) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.50]) by freefall.freebsd.org (8.8.5/8.8.5) with SMTP id NAA12244 for ; Sun, 30 Mar 1997 13:00:43 -0800 (PST) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id NAA08427; Sun, 30 Mar 1997 13:45:32 -0700 From: Terry Lambert Message-Id: <199703302045.NAA08427@phaeton.artisoft.com> Subject: Re: A new Kernel Module System To: dfr@nlsystems.com (Doug Rabson) Date: Sun, 30 Mar 1997 13:45:31 -0700 (MST) Cc: current@freebsd.org In-Reply-To: from "Doug Rabson" at Mar 30, 97 10:25:38 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-current@freebsd.org X-Loop: FreeBSD.org Precedence: bulk Well, as the architect of both the original LKM code and the original SYSINT() code, I can probably be expected to have some comments and corrections... so here they are. Let me know if I need to clarify anything. > A new Kernel Module System > > A proposal to replace the current LKM system with a new > implementation allowing both static and dynamically loaded > modules. > > > > 1. The current LKM system > > 1.1. Description > > The current LKM system only supports dynamically loaded modules. > Each module is either one of a small number of specially supported > types or is a `catch all' misc module. The modules are a.out > object files which are linked against the kernel's symbol table > using ld(1). Each module has a single entry point which is called > when the module is loaded and unloaded. Or queried for status. The single entry point is a multiplex entry point. This design requirement was forced on us by a.out format, which only has a single externally accessable member offset which can be set to a symbol offset at link time: unsigned long a_entry; /* entry point */ The small number of specially supported types of modules is an artifact of the kernel interfaces between subsystems being badly defined in the context of the overall kernel architecture. The kernel architecture must be corrected to be more orthoganal before there can be support for sugh things a "replacement VM system" modules, or, less ambitiously, "external pagers", and so on. One of the problems is that it is not possible to differentiate these subsystems in a loaded kernel. Again, this design requirement was forced on us by our a.out format. > 1.2. Lifecycle of an LKM > > The user initiates a module load (either by mounting a filesystem > or by explicitly calling modload(8)). This is a hack brought on because the loader is not in kernel space. This was a policy decision by the FreeBSD core team. > The module is loaded in three stages. First memory in the kernel > address space is allocated for the module. Second, ld(1) is used > to link the module's object file to run at the address which has > been allocated for it. The kernel's symbol table is used to > resolve external symbol references from the module. Lastly the > relocated module is loaded into the kernel. Six stages: o link module at load address 0 o determine size of linked module in pages o allocate that many contiguous pages in kernel space o link module at load address , whre is the base of the page allocation space o Push the module across the user/kernel boundry (this limitation is brought on by the lack of the kernel to page from a mapped image space, as with executables, because paging of modules would result in faults in kernel, not user, space. Alternately, it could be done by forcing a preload of the image, but the VM system does not support forcing an image into core, nor keeping unmodified pages there once it has been done). o Invoke the LKM control to call the module entry > The first thing the kernel does with the new module is to call its > entry point to inform it that it has been loaded. If this call > returns an error (e.g. because a device probe failed), the module > is discarded. For syscalls, filesystems, device drivers and exec > format handlers, common code in the lkm subsystem handles this > load event. > > When the module is no longer needed, it can be unloaded using > modunload(8). The module's entry point is called to inform it of > the event (again this is handled in common code for most modules) > and the kernel's memory is reclaimed. A module which believes it is in use may return EBUSY and veto the unload. There is currently no mechanism for delayed unload because there is not sufficiently advanced kernel counting semaphore/mutex support. > 1.3. Limitations > > Since the link stage is performed outside the kernel, modules can > only be loaded after the system is fully initialised (or at least > until after filesystems have been mounted). This makes automatic > module loading during boot hard or impossible. Kernel initiated > module loads (e.g. as a result of detecting a PCI device which is > supported by a driver in a module) are virtually impossible. Yes. > Statically loaded drivers initialise themselves using SYSINIT(9) > along with various tables created by config(8) to add their > entries to the various device switch tables. Making a statically > loaded driver into a loadable module requires extra code to mimic > this process. As a result, most drivers cannot be built as > modules. Sort of. I actually introduced SYSINIT() specifically to deal with the issues involved in a fully dynamic kernel configuration. The main problem with SYSINIT() is, once again, one of a.out format. It is possible to modify the modules to use common startup code, and to have that code reference the non-agregated SYSINIT() data directly. The main issue here is the linking against the kernel symbol space, since that link will cause the agregation of the SYSINIT() values with those of the existing kernel (the SYSINIT() implementation is by way of linker set). Because the SYSINIT() code uses linker sets, there must be a variant compilation to prevent the linker set from being agregated. Moving the LKM loader into the kernel, and defining an "exported symbol set" for the kernel which was resolvable by the kernel LKM loader, would be one way of resolving this problem: a kernel linker could choose to not agregate the module linker sets with the existing kernel linker sets. Ideally, the SYSINIT() information would like to not be in linker sets at all. To accomplish this would require some form of support for multiple sections in a single executable, and attribution of these sections so that former-linker-set-data can be identified as such, and the section treated the same no matter how the load takes place. In short, it means moving away from a.out. > 2. A new module system > > 2.1. Features > > * Support for both statically and dynamically loaded modules. > > * Dynamically loaded modules are relocated and linked by the > kernel using a built in kernel symbol table. > > * Static loaded modules are identical to dynamic modules in every > way. To include a static module in a kernel, the module's > object file is simply included in the kernel's link. > > * Modules initialise and register themselves with the kernel using > SYSINIT(9). SYSINIT() is an image-global link-time configuration mechanism. It must be implemented on top of something other than linker sets for this to be an achievable goal. Since this was one of the design considerations for SYSINIT(), this should be relatively trivial to do. Once this is one, the SYSINIT() becomes an actual function call reference, not a linker set reference, and the conditional compilation issues for static vs. dynamic modules simply go away. > * All devices drivers and filesystems and other subsystems are > implemented as modules. This is the subsystem granularity issue again. The kernel is not sufficiently abstracted in its interfaces, at this point, to handle more than the existing subsystems, and perhaps one or two more, as modular components. Part of this problem is that there is no HAL -- "Hardware Abstraction Layer" -- in the kernel. Another part is that after a HAL service set has been defined, it is still necessary to allow a single hardware subsystem to provide one or more HAL insterfaces. It is no accident that the pcaudio driver is not supported on the NetBSD "Multia" port, even though there exists the proper hardware to implement it on that particular Alpha machine. It doesn't exist on other Alpha machines, and NetBSD operates in an LCD -- "least common denominator" -- mode on a per processor architecture basis. > * Statically loaded modules are informed when the system shuts > down. System shutdown would appear to a statically loaded > module as an unload event. Various drivers use at_shutdown(9) > to tidy up device state before rebooting. This process can > happen from the module's unload handler. Yes. One issue, however, is enforcing inverse dependency order on the unload event dispatching. Module dependencies are not always defined in terms of symbol space relationships between modules. Among other issues is a three module stack, or HAL services provided by a module. For example, it's likely that one of the default kernel services will be a kernel "printf" of some kind. But there will not be an explicit dependency on a module calling a kernel "printf" on the module which implements the console for the kernel. For modules which the kernel consumes to provide services to other modules as if they were kernel services, or for which the kernel "wraps" the services of the consumed module, the dependency must be implicit. This becomes even more difficult when there are cyclic dependencies (and thus the dependencies can not be represented simply as a directed acyclic graph). One fix for this would be to define two "zones" of services that a module can provide: service encapsulated by the kernel, and services agregated by the kernel. Clearly, unload order would remove the agregated services before removing the encapsulated services. Just as the current SYSINIT() code is used in a set order in init_main.c, so would shutdon have to occur in a set inverse order in the encapsulated services modules. > * A desirable feature would be to support dependencies between > modules. Each module would define a symbol table. If a module > depends upon another, the dependant module's symbol table is > used to resolve undefined symbols. I believe this is a baseline requirement. It is simply a matter of having per module symbol zones, and establishing a reference count per module based on its symbols in its zone being consumed. When the module consuming goes away, the count is decremented. This implies a module dependency list per module. > 2.2. Kernel configuration > > Statically loaded modules are specified by a kernel configuration > file, either implicitly by a controller, disk, tape or device > keyword or explicitly with a new module keyword. Ideally, you would specify only modules. This implies that you remove the data differences between conditionally compiled code (the only conditional that has been identified as truly being necessary at compile time is "DEBUG"). The main issue here is dependency promiscuious knowledge of the structure on iteration. For example, the proc structure. If the proc structure is permitted to change size for debugging purposes, the only code that should be affected is the code that uses the additional fileds, and code which iterates the structures and therefore must know the structure size. It is only because the iteration of the proc structure is frequently (and incorrectly) by code outside the scope of the compilation directive, that the structures can not simply be made to overlay as if they were opaque pointers (ie: the debug data would be hidden). This is an issue of interface abstraction, both internal to the kernel (where commercial entities who provide binary kernel modules would be well served by not having dependencies on the size of structures exported by a kernel service -- one or more HAL interfaces), and where kernel interfaces are exported as data abstractions instead of functional abstractions ('w', 'ps', 'route', 'ifconfig', etc.). An audit of the use of the sizeof() keyword in all code in the kernel, and all code which uses the kvm header/library would be a Good Idea in general. > 2.3. Devices > > Several types of device exist. Currently devices are configured > into a kernel using various tables built by config(8) and ld(1). > To make it easier to add devices and drivers to a running kernel, > I suggest that all drivers use SYSINIT(9) to register themselves > with the system. This should be coordinated with call-based interfacing for devfs; the devfs abstraction should take precedence, such that the values of device major numbers lose importance in the ability to reference the devices. This may mean a change in syntax, and this should be kept in mind. > 2.3.1. ISA devices > > Currently ISA devices are included in a kernel by using config(8) > to generate a list of device instances in ioconf.c which reference > drivers statically compiled into the kernel. Few drivers support > dynamic loading and those that do have hardcoded device instances > built into the LKM (see sys/i386/isa/joy.c for an example). > > ISA drivers will register themselves by name using SYSINIT(9). > This would happen either at boot time for static drivers or at > module load time for dynamic drivers. Within the bounds of whether SYSINIT() is a data or a functional interface. It needs to be functional for this to work, and it is currently data. > Device instances (struct isa_device) will refer to their driver by > name rather than by pointer. The name to driver mapping is > performed and the device is probed and attached as normal. I'm not clear on why this abstraction is necessary or useful??? > Statically configured devices are placed in a table by config(8) > and modules containing their drivers are added to the kernel > Makefile. I would prefer that modules be build as seperate single object files (potentially agregating multiple object files into one using "ld -r"). A configuration is then simply a list of modules. I'm not sure if I like the idea of keeping a "config" around as anything other than a set of linker directives (in the a,out case), or as a vastly preferrable alternative, as input to an ELF section librarian for an agregate kernel image. This second would still leave us configurable at the binary level, while leaving the future direction open for fallback (firmware based) drivers and not yet requiring that the fallback drivers exist in order to get through boot stage (if the image is in a single file, then the single file is generally accessable without fallback driver support). > When an ISA device is configured dynamically, first the module > which contains its driver is loaded if not already present and > secondly a system call is used to create a new device instance and > to call the driver to probe and attach the new device. It is > probably worth writing a new utility, isaconf(8), which can add > new ISA device instances to the kernel. Careful, this is rocky terrain. ISA devices (non-PnP ones) have a nasty habit of having probe order dependencies. In addition, it might be useful to seperate the probe sequence from the device instance. There is still no useful VM mechanism for dealing with object persistance and/or kernel paging of non-paging critical code and data within the kernel itself. One problem here is that the distinction of "high persistance" and "low persistance" VM objects is made *ONLY* at the kernel/user seperation, with all kernel objects considered to be high persistance. With a load mechanism in place, and unlike the current statically loaded kernel, in common rather than rare use, these "medium persistance" objects could become a serious issue regarding fragmentation of the kernel VM space. Some of the recently discussed techniques for recovering contiguous memory spaces for drivers that need them, late in the kernel lifetime, would probably work, but most of these techniques are very high overhead. What is needed is (1) kernel paging support and (2) policy attribution of modular components so that the paging policy can be modified based on the object persistance. Obviously, probe code is typically used only once (exceptions: PCMCIA, laptop pluggable devices, etc.) and is never needed again. This gets into issues of section coloring and ELF section/segment support before it gets any cleaner. So a bit of caution is highly recommended. > A desirable feature for a new module system would be to allow > drivers to `detach' themselves from device instances, allowing a > dynamically loaded driver to be unloaded cleanly. This goes for shutdown of UART FIFO's, for instance, which will not be correctly reset by most BIOS. > If a driver is unloaded, it releases any resources such as > interrupts allocated for devices attached to it. These devices > become unassigned, as if they were not successfully probed. This > allows driver developers to repeatedly load and unload modules > without rebooting. With only issues of eventual VM space fragmentation, as subsequent versions of drivers change size... this is an issue we will eventually have to address, but it is livable to force the developer to reboot (IMO) at present. > Supporting static as well as dynamic modules makes the single > module per object file paradigm of the existing LKM system > difficult to maintain. A better approach is to separate the idea > of a kernel module (a single kernel subsystem) from the idea of a > kernel object file. The boot kernel should be thought of as > simply a kernel object file which contains the modules that were > configured statically. Dependencies between modules are also > better treated as dependencies between object files (since they > are typically linking dependencies). Is this ELF advocacy, or something else? The object module per LKM is still a valid approach (ld -r). Perhaps you are considering a module that is set up as two distinct (and reusable) components? If so, I would argue that allowing dependencies and breaking it into two modules accomplishes much the same thing. > The new system will use a kernel linker which can load object > files into the kernel address space. After loading, sysinits from > the new object file are run, allowing any modules contained > therein to register themselves. The linker will keep track of > which modules are contained in which object so that when a user > unloads the object, the modules can be informed of the event. Other than name, there is no difference between this and the "_entry" mechanism for identifying entry points, IMO. The big issue is, again, the distinction between data (linker set) based and function call based SYSINIT() mechanisms. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.