Date: Fri, 21 Aug 2009 14:10:35 +0200 From: Fabien Thomas <fabien.thomas@netasq.com> To: Julian Elischer <julian@elischer.org> Cc: FreeBSD Net <freebsd-net@freebsd.org> Subject: Re: pf and vimage Message-ID: <1321ED43-81C5-4507-AFC0-4B2DEE71BB78@netasq.com> In-Reply-To: <4A8D76FE.7040302@elischer.org> References: <4A8CFDAF.1000309@delphij.net> <200908201108.39177.max@love2party.net> <4A8D76FE.7040302@elischer.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Thanks very useful!
Do you have an "official" page to look for update.
What do you think of putting it on the FreeBSD Wiki?
Fabien
Le 20 août 09 à 18:17, Julian Elischer a écrit :
> there were some people looking at adding vnet support to pf.
> Since we discussed it last, the rules of the game have
> significantly changed for the better. With the addition
> of some new facilitiesin FreeBSD, the work needed to virtualize
> a module has significantly decreased.
>
>
> The following doc gives the new rules..
>
>
> August 17 2009
> Julian Elischer
>
> ===================
> Vimage: what is it?
> ===================
>
> Vimage is a framework in the BSD kernel which allows a co-operating
> module
> to operate on multiple independent instances of its state so that it
> can
> participate in a virtual machine / virtual environment scenario. It
> refers
> to a part of the Jail infrastructure in FreeBSD. For historical
> reasons
> "Virtual network stack enabled jails"(1) are also known as "vimage
> enabled
> jails"(2) or "vnet enabled jails"(3). The currently correct term is
> the
> latter, which is a contraction of the first. In the future other
> parts of
> the system may be virtualized using the same technology and the term
> to
> cover all such components would be VIMAGE enhanced modules.
>
> The implementation approach taken by the vimage framework is a
> redefinition
> of selected global state variables to evaluate to constructs that
> allow for
> the virtualized state to be stored and resolved in appropriate
> instances of
> 'jail' specific container storage regions. The code operating on
> virtualized
> state has to conform to a set of rules described further below.
> Among other
> things in order to allow for all the changes to be conditionally
> compilable.
> i.e. permitting the virtualized code to fall back to operation on
> global state.
>
> The rest of this document will discuss NETWORK virtualization
> though the concepts may be true in the future for other parts of the
> system.
>
> The most visible change throughout the existing code is typically
> replacement
> of direct references to global variables with macros; foo_bar thus
> becomes
> V_foo_bar. V_foo_bar macros will resolve back to the foo_bar global
> in
> default kernel builds, and alternatively to the logical equivalent of
> some_base_pointer->_foo_bar for "options VIMAGE" kernel configs.
>
> Prepending of "V_" prefixes to variable references helps in
> visual discrimination between global and virtualized state.
> It is also possible to use an alternative syntax, of VNET(foo_bar) to
> achieve the same thing. The developers felt that V_foo_bar was less
> visually distracting while still providing enough clues to the reader
> that the variable is virtualized. In fact the V_foo_bar macro is
> locally defined near the definition of foo_bar to be an alias for
> VNET(foo_bar) so the two are not only equivalent, they are the same.
>
> The framework also extends the sysctl infrastructure to support
> access to
> virtualized state through introduction of the SYSCTL_VNET family of
> macros;
> those also automatically fall back to their standard SYSCTL
> counterparts
> in default kernel builds.
>
> Transparent libkvm(3) lookups are provided to virtualized variables
> which permits userland binaries such as netstat to operate unmodified
> on "options VIMAGE" kernels, though this may have some security
> implications.
>
> Vnets are associated with jails. In 8.0, every process is
> associated with
> a jail, usually the default (null) jail, and jails currently hang
> off of
> a processes ucred. This relationship defines a process's
> administrative
> affinity to a vnet and thus indirectly to all of its state. All
> network
> interfaces and sockets hold pointers back to their associated vnets.
> This relationship is obviously entirely independent from proc->ucred-
> >jail
> bindings. Hence, when a process opens a socket, the socket will get
> bound
> to a vnet instance hanging off of proc->ucred->jail->vnet, but once
> such a
> socket->vnet binding gets established, it cannot be changed for the
> entire
> socket lifetime.
>
> The mapping of a from a thread to a vnet should always be done via the
> TD_TO_VNET macro as the path may change in the future as we get more
> experience with using the system.
>
> Certain classes of network interfaces (Ethernet in particular) can be
> reassigned from one vnet to another at any time. By definition all
> vnets
> are independent and can communicate only if they are explicitly
> provided with communication paths. Currently mainly netgraph is used
> to
> establish inter-vnet datapaths, though other paths are being explored
> such as the 'epair' back-to-back virtual interface pair, in which
> the different sides may exist in different jails.
>
> In network traffic processing the vnet affinity is defined either by
> the
> inbound interface or by the socket / pcb -> vnet binding. However,
> there
> are many functions in the network stack that cannot implicitly fetch
> the vnet context from their standard arguments. Instead of explicitly
> extending argument lists of such functions with a struct vnet *,
> the concept of a "current vnet", a per-thread variable was introduced,
> which can be fetched efficiently via the curvnet macro. The correct
> network context has to be set on entry to the network stack (socket
> operations, packet reception, or timer-driven functions) and cleared
> on exit.
> This must be done via provided CURVNET_SET() / CURVNET_RESTORE()
> family of
> macros, which allow for "stacking" of curvnet context setting and
> provide
> additional debugging info in INVARIANTS kernel configs. In most cases
> however a developer writing virtualized code will not have to set /
> restore the curvnet context unless the code would include timer-driven
> events, given that those are inherently vnet-contextless on entry.
>
> The current rule is that when not in networking code, the result of
> the 'curvnet' macro will return NULL and evaluating a V_xxx (or
> VNET(xxx))
> macro will result in an kernel page-fault error. While this is not
> strictly
> necessary, it aids in debugging and assurance of program correctness.
> Note this does NOT mean that TD_TO_VNET(curthread) is invalid.
> A thread is always associated with a vnet, but just the efficient
> "curvnet" access method is disabled along with the ability to resolve
> virtualized symbols.
>
>
> Converting / virtualizing existing code
> =======================================
>
> There are several steps need in virtualisation.
>
> 1/ Decide whether the module needs to be virtualised.
>
> If the module is a driver for specific hardware, it makes sense that
> there be only one instance of the driver as there is only one
> piece of
> physical hardware. There are changes in the networking code to
> allow
> physical (or virtual) interfaces to be moved between vnets. This
> generally requires NO changes to the network drivers of the classes
> covered (e.g. ethernet). Currently if your module is does not have
> any
> networking facet, the answer is "no" by default.
>
> 2/ If the module is to be virtualised, decide which attributes of the
> module should be virtualised.
>
> For example, It may make sense that there be a single central pool
> of "struct foo" and a single uma zone for them to come from, with
> a single
> lock guarding it. It might also make sense if the "foo_debug" sysctl
> controls all the instances at once, while on the other hand, the
> "foo_mode" sysctl might make better sense if it were controllable
> on a virtual system by virtual system basis.
>
> 3/ Work out what global variables and structures are to be
> virtualised to
> achieve the behaviour required for part #2.
>
> 4/ Work out for all the code paths through the module, how the
> thread entering
> the module can divine which virtual environment it is on.
>
> Some examples:
> * Since interfaces are all assigned to one vnet or another, an
> incoming
> packet has a pointer to the receive interface, which in turn has a
> pointer back to the vnet. Often "curvnet" will already have been
> set
> by the time your code is called anyhow.
> * Similarly, on any request from outside the kernel, (direct or
> indirect)
> the current thread has a way to get to the current virtual
> environment
> instance via TD_TO_VNET(curthread). For existing sockets the vnet
> context must be used via so->so_vnet since the thread's vnet might
> change after socket creation.
> * Timer initiated actions usually have a (void *) argument which
> points to
> some private structure for the module. It should be possible to
> add
> a pointer to the appropriate module instance into whatever
> structure
> that points to.
> * Sometimes an action (timer trigerred or trigerred by module load
> or
> unload simply has to check all the vimage or module instances.
> There are macro (pairs) for this which will iterate through all
> the
> VNET or instances. (see sample code below).
>
> This covers most of the cases, however in some cases it may still be
> required for the module to stash away the virtual environment
> instance
> somewhere, and make associated changes in the code.
>
> 5/ Decide which parts of the initialization and teardown are per
> jail and
> which parts are global, and separate out the code accordingly.
> Global initialization is done using the SYSINIT facility.
> Per jail initialization is done using VNET_SYSINIT().
> Per jail teardown is doen using VNET_SYSUNINIT().
> Global teardown is done using SYSUNIT().
> In addition, the modevent handler is called with various event
> types before
> any of these are called. The modevent handler may veto load or
> teardown.
> On Shutdown, only the modevent handler is called so it may have to
> simulate
> the calling of the other handlers if clean shutdown is a requirement
> of your module. (see sample code below). Don't forget to unregister
> event handlers, and destroy locks and condition variables.
>
> 6/ Add the code described below to the files that make up the module.
>
> Details: (VNET implementation details)
>
> Firstly the file <net/vnet.h> must be included. Depending on what
> code you use you may find you also need one or more of: <sys/proc.h>,
> <sys/ucred.h> and <sys/jail.h>. These requirements may change slightly
> as the ABI settles.
>
> Having decided which variables need to be virtualized, the definition
> of thosvariables needs to be modified to use the VNET_DEFINE() macro.
> For example:
>
> static int foo = 3;
> struct bar thebar = { 1,2,3 };
>
> would become:
>
> static VNET_DEFINE(int, foo) = 3;
> VNET_DEFINE(struct bar, thebar) = { 1,2,3 };
>
> extern int foo;
> in an include file might become:
> VNET_DECLARE(int foo);
>
> Normal rules regarding 'static/extern' apply. The initial values
> that you
> give in this way will be stored and used as the initial values for
> EACH NEW INSTANCE of these variables as new jails/vnets are created.
>
> As mentioned above, accesses to virtualized symbols are achieved via
> macros,
> which generally are of the same name as the original symbol but with
> a "V_"
> prepended, thus the head of the interface list, called 'ifnet' is
> replaced
> whereever used with "V_ifnet". We do this, by adding the following
> lines after the definitions above:
>
> #define V_foo VNET(foo)
> #define V_thebar VNET(thebar)
>
> --- side-note ---
> In SCTP, because the code is shared with
> other OS's they are replaced with a macro MODULE_GLOBAL(modulename,
> symbol).
> (this may simplify in light of recent changes).
> --------------
>
> In addition, should any of your values need to be changed or viewed
> via sysctl, the following SYSCTL definitions would be needed:
>
> SYSCTL_VNET_PROC(_net_inet, OID_AUTO, thebar,
> CTLTYPE_?? | CTLFLAG_RW | CTLFLAG_SECURE3, &VNET_NAME(thebar), 0,
> thebar, "?", "the bar is open");
> {[XXX] robert fix this is possible ^^^}
> SYSCTL_VNET_INT(_net_inet, OID_AUTO, foo,
> CTLFLAG_RW, &VNET_NAME(foo), 0, "size of foo");
>
>
> In the current version of vimage, when VIMAGE is not compiled into
> the kernel, the macros evaluate to a direct reference to the one and
> only
> symbol/variable, so that there is no speed penalty for those not
> using vnets.
>
> When VIMAGE is compiled in, the macro will evaluate to an access to
> an offset
> into a data structure that is accessed on a per-vet basis. The vnet
> used for this is always curvnet. For this reason an attempt to access
> such a variable while curvnet is not valid, will result in an
> exception.
>
> To ensure that curvnet has a valid value when needed one needs to
> add the following code on all entry code paths into the networking
> code:
> int
> my_func(int arg)
> {
> CURVNET_SET(TD_TO_VNET(curthread));
> do_my_network_stuff(arg);
> CURVNET_RESTORE();
> return (0);
> }
>
> The initial value is usually something like "TD_TO_VNET(curthread)
> which in turn is a macro that derives the vnet affinity from the
> current
> thread. It could also be (m->m_ifp->if_vnet) if we were receiving
> an mbuf,
> or so->so_vnet if we had a socket involved.
>
> Usually, when a packet enters the system it is carried through the
> processing
> path via a single thread, and that thread will set its virtual
> environment
> reference to that indicated by the packet on picking up that new
> packet.
> This means that in the normal inbound processing path as well as the
> outgoing process path the current thread can be used to indicate the
> current virtual environment and curvet will always be valid once most
> user supplied code is reached. In timer events, it is sometimes
> necessary to add an "outer loop" to iterate through all the possible
> vnets
> if there is just one timer for all instances.
>
> When a new loadable module is virtualised the module definitions
> and intializers need to be examined. The following example illustrates
> what is needed in the case that you are not loading a new protocol,
> or domain.
> (for that see later)
>
> ============= sample skeleton code ==========
>
> /* init on boot or module load */
> static int
> mymod_init(void)
> {
> return (error);
> }
>
> /****************
> * Stuff that must be initialized for every instance
> * (including the first of course).
> */
> static int
> mymod_vnet_init(const void *unused)
> {
> return (0);
> }
>
> /**********************
> * Called for the removal of the last instance only on module unload.
> */
> static void
> mymod_uninit(void)
> {
> }
>
> /***********************
> * Called for the removal of each instance.
> */
> static int
> mymod_vnet_uninit(const void *unused)
> {
> return (0)
> }
>
> mymod_modevent(module_t mod, int type, void *unused)
> {
> int err = 0;
>
> switch (type) {
> case MOD_LOAD:
> /* check that loading is ok */
> break;
>
> case MOD_UNLOAD:
> /* check that unloading is ok */
> break;
>
> case MOD_QUIESCE:
> /* warning: try stop processing */
> /* maybe sleep 1 mSec or something to let threads get out */
> break;
>
> case MOD_SHUTDOWN:
> /*
> * this is called once but you may want to shut down
> * things in each jail, or something global.
> * In that case it's up to us to simulate the SYSUNINIT()
> * or the VNET_SYSUNINIT()
> */
> {
> VNET_ITERATOR_DECL(vnet_iter);
> VNET_LIST_RLOCK();
> VNET_FOREACH(vnet_iter) {
> CURVNET_SET(vnet_iter);
> mymod_vnet_uninit(NULL);
> CURVNET_RESTORE();
> }
> VNET_LIST_RUNLOCK();
> }
> /* you may need to shutdown something global. */
> mymod_uninit();
> break;
>
> default:
> err = EOPNOTSUPP;
> break;
> }
> return err;
> }
>
> static moduledata_t mymodmod = {
> "mymod",
> mymod_modevent,
> 0
> };
>
> /* define execution order using constants from /sys/sys/kernel.h */
> #define MYMOD_MAJOR_ORDER SI_SUB_PROTO_BEGIN /* for
> example */
> #define MYMOD_MODULE_ORDER (SI_ORDER_ANY + 64) /* not
> fussy */
> #define MYMOD_SYSINIT_ORDER (MYMOD_MODULE_ORDER + 1) /* a bit
> later */
> #define MYMOD_VNET_ORDER (MYMOD_MODULE_ORDER + 2) /* later
> still */
>
> DECLARE_MODULE(mymod, mymodmod, MYMOD_MAJOR_ORDER,
> MYMOD_MODULE_ORDER);
> MODULE_DEPEND(mymod, ipfw, 2, 2, 2); /* depend on ipfw version
> (exactly) 2 */
> MODULE_VERSION(mymod, 1);
>
> SYSINIT(mymod_init, MYMOD_MAJOR_ORDER, MYMOD_SYSINIT_ORDER,
> mymod_init, NULL);
> SYSUNINIT(mymod_uninit, MYMOD_MAJOR_ORDER, MYMOD_SYSINIT_ORDER,
> mymod_uninit, NULL);
>
> VNET_SYSINIT(mymod_vnet_init, MYMOD_MAJOR_ORDER, MYMOD_VNET_ORDER,
> mymod_vnet_init, NULL);
> VNET_SYSUNINIT(mymod_vnet_uninit, MYMOD_MAJOR_ORDER, MYMOD_VNET_ORDER,
> mymod_vnet_uninit, NULL);
>
>
> ========== end sample code =======
>
> On BOOT, the order of evaluation will be:
> In a NON-VIMAGE kernel where the module is compiled:
> MODEVENT, SYSINIT and VNET_SYSINIT both runm with order defined
> by their
> order declarations. {good foot shooting material if you get it
> wrong!}
>
> In a VIMAGE kernel where the module is compiled in:
> MODEVNET, SYSINIT and VNET_SYSINIT all run with order defined by
> their
> order declarations. AND in addition, the VNET_SYSINIT is
> repeated once for every existing or new jail/vnet.
>
> On loading a vnet enabled kernel module after boot:
> MODEVENT("event = load");
> SYSINIT()
> VNET_SYSINIT() for every existing jail
> AND in addition, VNET_SYSINIT being called for each new jail
> created.
>
> On unloading of module:
> MODEVENT("event = MOD_QUIESCE")
> MODEVENT("event = MOD_UNLOAD")
> VNET_SYSUNINIT called for every jail/vnet
> SYSUNINIT
>
> On system shutdown:
> MODEVENT(shutdown)
>
> NOTICE that while the order of the SYSINIT and VNET_SYSINIT is
> reversed from
> that of SYSUNINIT and VNET_SYSUNINIT, MODEVENTS do not follow
> this rule and thus it is dangerous to initialise and uninitialise
> things which are order dependent using MODEVENTs.
>
> Or, put another way,
> Since MODEVENT is called first during module load, it would, by the
> assumption that everything is reversed, be easy to assume that
> MODEVENT
> is called AFTER the SYSINITS during unload. This is in fact not
> the case. (and I have the scars to prove it).
>
> It might be make some sense if the "QUIESCE" was called before the
> SYSINIT/SYSUNINIT and the UNLOAD called after.. with a millisecond
> sleep between them, but this is not the case either.
>
> Since initial values are copied into the virtualized variables
> on each new instantiatin, it is quite possible to have modules for
> which
> some of the above methods are not needed, and they may be left out.
> (but not the modevent).
>
> Sometimes there is a need to iterate through the vnets.
> See the modevent shutdown handler (above) for an example of how to
> do this.
> Don't forget the locks.
>
> In the case where you are loading a new protocol, or domain
> (protocol family)
> there are some "shortcuts" that are in place to allow you to
> maintain a bit
> more source compatibility with older revisions of FreeBSD. It must be
> added that the sample code above works just fine for protocols,
> however
> protcols also have an aditional initialization vector which is via the
> prtocol structure, which has a pr_init() entry.
> When a protocol is registered using pf_proto_register(), the pr_init()
> for the protocol is called once for every existing vnet. in addition,
> it will be called for each new vnet. The pr_destroy() method will be
> called
> as well on vnet teardown. The pf_proto_register() funcion can be
> called
> either from a modevent handler of from the SYSINIT() if you have
> one, and
> the pf_proto_unregister() called from the SYSUNINIT or the unload
> modevent handler.
>
> If you are adding a whole new protocol domain, (protocol family) then
> you should add the VNET_DOMAIN_SET(domainname) (e,g, inet, inet6)
> macro. These use VNET_SYSINIT internally to indirectly call the
> dom_init() and pr_init() functions for each vnet, (and the
> equivalent for
> teardown.) In this case one needs to be absolutely sure that both
> your
> domain and protocol initializers can be called multiple times, once
> for
> each vnet. One can still add SYSINITs for once only initialization,
> or use the modevent handler. I prefer to do as much explicitly
> in the SYSINITS and VNET_SYSINITS as then you have no surprises.
>
> finally:
> The command to make a new jail with a new vnet:
> jail -c host.hostname=test path=/ vnet command=/bin/tcsh
> jail -c host.hostname=test path=/ children.max=4 vnet command=/bin/
> tcsh
> (children.max allows hierarchical jail creation).
> Note that the command must come last.
>
>
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1321ED43-81C5-4507-AFC0-4B2DEE71BB78>
