From owner-freebsd-current  Sun Mar 30 13:00:47 1997
Return-Path: <owner-current>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id NAA12250
          for current-outgoing; Sun, 30 Mar 1997 13:00:47 -0800 (PST)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.50])
          by freefall.freebsd.org (8.8.5/8.8.5) with SMTP id NAA12244
          for <current@freebsd.org>; Sun, 30 Mar 1997 13:00:43 -0800 (PST)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id NAA08427; Sun, 30 Mar 1997 13:45:32 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199703302045.NAA08427@phaeton.artisoft.com>
Subject: Re: A new Kernel Module System
To: dfr@nlsystems.com (Doug Rabson)
Date: Sun, 30 Mar 1997 13:45:31 -0700 (MST)
Cc: current@freebsd.org
In-Reply-To: <Pine.BSF.3.95q.970330101633.1828A-100000@kipper.nlsystems.com> from "Doug Rabson" at Mar 30, 97 10:25:38 am
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-current@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk


Well, as the architect of both the original LKM code and the original
SYSINT() code, I can probably be expected to have some comments and
corrections... so here they are.  Let me know if I need to clarify
anything.


>                      A new Kernel Module System
> 
> A proposal to replace the current LKM system with a new
> implementation allowing both static and dynamically loaded
> modules.
> 
> 
> 
> 1.   The current LKM system
> 
> 1.1.      Description
> 
> The current LKM system only supports dynamically loaded modules. 
> Each module is either one of a small number of specially supported
> types or is a `catch all' misc module.  The modules are a.out
> object files which are linked against the kernel's symbol table
> using ld(1).  Each module has a single entry point which is called
> when the module is loaded and unloaded.

Or queried for status.  The single entry point is a multiplex entry
point.  This design requirement was forced on us by a.out format,
which only has a single externally accessable member offset which
can be set to a symbol offset at link time:

     unsigned long	a_entry;	/* entry point */


The small number of specially supported types of modules is an
artifact of the kernel interfaces between subsystems being badly
defined in the context of the overall kernel architecture.  The
kernel architecture must be corrected to be more orthoganal before
there can be support for sugh things a "replacement VM system"
modules, or, less ambitiously, "external pagers", and so on.

One of the problems is that it is not possible to differentiate
these subsystems in a loaded kernel.  Again, this design requirement
was forced on us by our a.out format.



> 1.2.      Lifecycle of an LKM
> 
> The user initiates a module load (either by mounting a filesystem
> or by explicitly calling modload(8)).

This is a hack brought on because the loader is not in kernel space.
This was a policy decision by the FreeBSD core team.


> The module is loaded in three stages.  First memory in the kernel
> address space is allocated for the module.  Second, ld(1) is used
> to link the module's object file to run at the address which has
> been allocated for it.  The kernel's symbol table is used to
> resolve external symbol references from the module.  Lastly the
> relocated module is loaded into the kernel.

Six stages:

o	link module at load address 0
o	determine size of linked module in pages
o	allocate that many contiguous pages in kernel space
o	link module at load address <X>, whre <X> is the base
	of the page allocation space
o	Push the module across the user/kernel boundry (this
	limitation is brought on by the lack of the kernel to
	page from a mapped image space, as with executables,
	because paging of modules would result in faults in
	kernel, not user, space.  Alternately, it could be
	done by forcing a preload of the image, but the VM
	system does not support forcing an image into core,
	nor keeping unmodified pages there once it has been
	done).
o	Invoke the LKM control to call the module entry

> The first thing the kernel does with the new module is to call its
> entry point to inform it that it has been loaded.  If this call
> returns an error (e.g. because a device probe failed), the module
> is discarded.  For syscalls, filesystems, device drivers and exec
> format handlers, common code in the lkm subsystem handles this
> load event.
> 
> When the module is no longer needed, it can be unloaded using
> modunload(8).  The module's entry point is called to inform it of
> the event (again this is handled in common code for most modules)
> and the kernel's memory is reclaimed.

A module which believes it is in use may return EBUSY and veto the
unload.

There is currently no mechanism for delayed unload because there is
not sufficiently advanced kernel counting semaphore/mutex support.


> 1.3.      Limitations
> 
> Since the link stage is performed outside the kernel, modules can
> only be loaded after the system is fully initialised (or at least
> until after filesystems have been mounted).  This makes automatic
> module loading during boot hard or impossible.  Kernel initiated
> module loads (e.g. as a result of detecting a PCI device which is
> supported by a driver in a module) are virtually impossible.

Yes.


> Statically loaded drivers initialise themselves using SYSINIT(9)
> along with various tables created by config(8) to add their
> entries to the various device switch tables.  Making a statically
> loaded driver into a loadable module requires extra code to mimic
> this process.  As a result, most drivers cannot be built as
> modules.

Sort of.

I actually introduced SYSINIT() specifically to deal with the issues
involved in a fully dynamic kernel configuration.

The main problem with SYSINIT() is, once again, one of a.out format.
It is possible to modify the modules to use common startup code,
and to have that code reference the non-agregated SYSINIT() data
directly.  The main issue here is the linking against the kernel
symbol space, since that link will cause the agregation of the
SYSINIT() values with those of the existing kernel (the SYSINIT()
implementation is by way of linker set).

Because the SYSINIT() code uses linker sets, there must be a variant
compilation to prevent the linker set from being agregated.  Moving
the LKM loader into the kernel, and defining an "exported symbol set"
for the kernel which was resolvable by the kernel LKM loader, would
be one way of resolving this problem: a kernel linker could choose to
not agregate the module linker sets with the existing kernel linker
sets.

Ideally, the SYSINIT() information would like to not be in linker sets
at all.  To accomplish this would require some form of support for
multiple sections in a single executable, and attribution of these
sections so that former-linker-set-data can be identified as such,
and the section treated the same no matter how the load takes place.

In short, it means moving away from a.out.


> 2.   A new module system
> 
> 2.1.      Features
> 
> *  Support for both statically and dynamically loaded modules.
> 
> *  Dynamically loaded modules are relocated and linked by the
>    kernel using a built in kernel symbol table.
> 
> *  Static loaded modules are identical to dynamic modules in every
>    way.  To include a static module in a kernel, the module's
>    object file is simply included in the kernel's link.
> 
> *  Modules initialise and register themselves with the kernel using
>    SYSINIT(9).

SYSINIT() is an image-global link-time configuration mechanism.  It
must be implemented on top of something other than linker sets for
this to be an achievable goal.  Since this was one of the design
considerations for SYSINIT(), this should be relatively trivial to
do.  Once this is one, the SYSINIT() becomes an actual function
call reference, not a linker set reference, and the conditional
compilation issues for static vs. dynamic modules simply go away.


> *  All devices drivers and filesystems and other subsystems are
>    implemented as modules.

This is the subsystem granularity issue again.  The kernel is not
sufficiently abstracted in its interfaces, at this point, to handle
more than the existing subsystems, and perhaps one or two more,
as modular components.

Part of this problem is that there is no HAL -- "Hardware Abstraction
Layer" -- in the kernel.  Another part is that after a HAL service
set has been defined, it is still necessary to allow a single hardware
subsystem to provide one or more HAL insterfaces.  It is no accident
that the pcaudio driver is not supported on the NetBSD "Multia" port,
even though there exists the proper hardware to implement it on that
particular Alpha machine.  It doesn't exist on other Alpha machines,
and NetBSD operates in an LCD -- "least common denominator" -- mode
on a per processor architecture basis.


> *  Statically loaded modules are informed when the system shuts
>    down.  System shutdown would appear to a statically loaded
>    module as an unload event.  Various drivers use at_shutdown(9)
>    to tidy up device state before rebooting.  This process can
>    happen from the module's unload handler.

Yes.  One issue, however, is enforcing inverse dependency order on
the unload event dispatching.

Module dependencies are not always defined in terms of symbol space
relationships between modules.  Among other issues is a three module
stack, or HAL services provided by a module.

For example, it's likely that one of the default kernel services will
be a kernel "printf" of some kind.  But there will not be an explicit
dependency on a module calling a kernel "printf" on the module which
implements the console for the kernel.

For modules which the kernel consumes to provide services to other
modules as if they were kernel services, or for which the kernel
"wraps" the services of the consumed module, the dependency must be
implicit.

This becomes even more difficult when there are cyclic dependencies
(and thus the dependencies can not be represented simply as a directed
acyclic graph).  One fix for this would be to define two "zones" of
services that a module can provide: service encapsulated by the kernel,
and services agregated by the kernel.  Clearly, unload order would
remove the agregated services before removing the encapsulated services.

Just as the current SYSINIT() code is used in a set order in init_main.c,
so would shutdon have to occur in a set inverse order in the encapsulated
services modules.


> *  A desirable feature would be to support dependencies between
>    modules.  Each module would define a symbol table.  If a module
>    depends upon another, the dependant module's symbol table is
>    used to resolve undefined symbols.

I believe this is a baseline requirement.  It is simply a matter of
having per module symbol zones, and establishing a reference count
per module based on its symbols in its zone being consumed.  When the
module consuming goes away, the count is decremented.  This implies
a module dependency list per module.


> 2.2.      Kernel configuration
> 
> Statically loaded modules are specified by a kernel configuration
> file, either implicitly by a controller, disk, tape or device
> keyword or explicitly with a new module keyword.

Ideally, you would specify only modules.  This implies that you
remove the data differences between conditionally compiled code
(the only conditional that has been identified as truly being
necessary at compile time is "DEBUG").  The main issue here is
dependency promiscuious knowledge of the structure on iteration.

For example, the proc structure.

If the proc structure is permitted to change size for debugging
purposes, the only code that should be affected is the code that
uses the additional fileds, and code which iterates the structures
and therefore must know the structure size.

It is only because the iteration of the proc structure is frequently
(and incorrectly) by code outside the scope of the compilation
directive, that the structures can not simply be made to overlay
as if they were opaque pointers (ie: the debug data would be hidden).
This is an issue of interface abstraction, both internal to the
kernel (where commercial entities who provide binary kernel modules
would be well served by not having dependencies on the size of
structures exported by a kernel service -- one or more HAL interfaces),
and where kernel interfaces are exported as data abstractions instead
of functional abstractions ('w', 'ps', 'route', 'ifconfig', etc.).

An audit of the use of the sizeof() keyword in all code in the kernel,
and all code which uses the kvm header/library would be a Good Idea
in general.


> 2.3.      Devices
> 
> Several types of device exist.  Currently devices are configured
> into a kernel using various tables built by config(8) and ld(1). 
> To make it easier to add devices and drivers to a running kernel,
> I suggest that all drivers use SYSINIT(9) to register themselves
> with the system.

This should be coordinated with call-based interfacing for devfs;
the devfs abstraction should take precedence, such that the values
of device major numbers lose importance in the ability to reference
the devices.  This may mean a change in syntax, and this should be
kept in mind.


> 2.3.1.    ISA devices
> 
> Currently ISA devices are included in a kernel by using config(8)
> to generate a list of device instances in ioconf.c which reference
> drivers statically compiled into the kernel.  Few drivers support
> dynamic loading and those that do have hardcoded device instances
> built into the LKM (see sys/i386/isa/joy.c for an example).
> 
> ISA drivers will register themselves by name using SYSINIT(9). 
> This would happen either at boot time for static drivers or at
> module load time for dynamic drivers.

Within the bounds of whether SYSINIT() is a data or a functional
interface.  It needs to be functional for this to work, and it is
currently data.


> Device instances (struct isa_device) will refer to their driver by
> name rather than by pointer.  The name to driver mapping is
> performed and the device is probed and attached as normal. 

I'm not clear on why this abstraction is necessary or useful???


> Statically configured devices are placed in a table by config(8)
> and modules containing their drivers are added to the kernel
> Makefile.

I would prefer that modules be build as seperate single object
files (potentially agregating multiple object files into one using
"ld -r").

A configuration is then simply a list of modules.

I'm not sure if I like the idea of keeping a "config" around as
anything other than a set of linker directives (in the a,out case),
or as a vastly preferrable alternative, as input to an ELF section
librarian for an agregate kernel image.

This second would still leave us configurable at the binary level,
while leaving the future direction open for fallback (firmware based)
drivers and not yet requiring that the fallback drivers exist in order
to get through boot stage (if the image is in a single file, then the
single file is generally accessable without fallback driver support).


> When an ISA device is configured dynamically, first the module
> which contains its driver is loaded if not already present and
> secondly a system call is used to create a new device instance and
> to call the driver to probe and attach the new device.  It is
> probably worth writing a new utility, isaconf(8), which can add
> new ISA device instances to the kernel.

Careful, this is rocky terrain.

ISA devices (non-PnP ones) have a nasty habit of having probe order
dependencies.

In addition, it might be useful to seperate the probe sequence from
the device instance.  There is still no useful VM mechanism for
dealing with object persistance and/or kernel paging of non-paging
critical code and data within the kernel itself.  One problem
here is that the distinction of "high persistance" and "low persistance"
VM objects is made *ONLY* at the kernel/user seperation, with all
kernel objects considered to be high persistance.  With a load
mechanism in place, and unlike the current statically loaded kernel,
in common rather than rare use, these "medium persistance" objects could
become a serious issue regarding fragmentation of the kernel VM space.
Some of the recently discussed techniques for recovering contiguous
memory spaces for drivers that need them, late in the kernel lifetime,
would probably work, but most of these techniques are very high overhead.
What is needed is (1) kernel paging support and (2) policy attribution
of modular components so that the paging policy can be modified based
on the object persistance.  Obviously, probe code is typically used
only once (exceptions: PCMCIA, laptop pluggable devices, etc.) and is
never needed again.  This gets into issues of section coloring and ELF
section/segment support before it gets any cleaner.  So a bit of
caution is highly recommended.


> A desirable feature for a new module system would be to allow
> drivers to `detach' themselves from device instances, allowing a
> dynamically loaded driver to be unloaded cleanly.

This goes for shutdown of UART FIFO's, for instance, which will not
be correctly reset by most BIOS.


> If a driver is unloaded, it releases any resources such as
> interrupts allocated for devices attached to it.  These devices
> become unassigned, as if they were not successfully probed.  This
> allows driver developers to repeatedly load and unload modules
> without rebooting.

With only issues of eventual VM space fragmentation, as subsequent
versions of drivers change size... this is an issue we will eventually
have to address, but it is livable to force the developer to reboot
(IMO) at present.


> Supporting static as well as dynamic modules makes the single
> module per object file paradigm of the existing LKM system
> difficult to maintain.  A better approach is to separate the idea
> of a kernel module (a single kernel subsystem) from the idea of a
> kernel object file.  The boot kernel should be thought of as
> simply a kernel object file which contains the modules that were
> configured statically.  Dependencies between modules are also
> better treated as dependencies between object files (since they
> are typically linking dependencies).

Is this ELF advocacy, or something else?

The object module per LKM is still a valid approach (ld -r).  Perhaps
you are considering a module that is set up as two distinct (and
reusable) components?  If so, I would argue that allowing dependencies
and breaking it into two modules accomplishes much the same thing.


> The new system will use a kernel linker which can load object
> files into the kernel address space.  After loading, sysinits from
> the new object file are run, allowing any modules contained
> therein to register themselves.  The linker will keep track of
> which modules are contained in which object so that when a user
> unloads the object, the modules can be informed of the event.

Other than name, there is no difference between this and the "_entry"
mechanism for identifying entry points, IMO.  The big issue is, again,
the distinction between data (linker set) based and function call
based SYSINIT() mechanisms.



					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.