Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 21 Aug 2009 14:10:35 +0200
From:      Fabien Thomas <fabien.thomas@netasq.com>
To:        Julian Elischer <julian@elischer.org>
Cc:        FreeBSD Net <freebsd-net@freebsd.org>
Subject:   Re: pf and vimage 
Message-ID:  <1321ED43-81C5-4507-AFC0-4B2DEE71BB78@netasq.com>
In-Reply-To: <4A8D76FE.7040302@elischer.org>
References:  <4A8CFDAF.1000309@delphij.net> <200908201108.39177.max@love2party.net> <4A8D76FE.7040302@elischer.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Thanks very useful!
Do you have an "official" page to look for update.
What do you think of putting it on the FreeBSD Wiki?

Fabien

Le 20 ao=FBt 09 =E0 18:17, Julian Elischer a =E9crit :

> there were some people looking at adding vnet support to pf.
> Since we discussed it last, the rules of the game have
> significantly changed for the better. With the addition
> of some new facilitiesin FreeBSD, the work needed to virtualize
> a module has significantly decreased.
>
>
> The following doc gives the new rules..
>
>
> August 17 2009
> Julian Elischer
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> Vimage: what is it?
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> Vimage is a framework in the BSD kernel which allows a co-operating =20=

> module
> to operate on multiple independent instances of its state so that it =20=

> can
> participate in a virtual machine / virtual environment scenario. It =20=

> refers
> to a part of the Jail infrastructure in FreeBSD. For historical =20
> reasons
> "Virtual network stack enabled jails"(1) are also known as "vimage =20
> enabled
> jails"(2) or "vnet enabled jails"(3).  The currently correct term is =20=

> the
> latter, which is a contraction of the first. In the future other =20
> parts of
> the system may be virtualized using the same technology and the term =20=

> to
> cover all such components would be VIMAGE enhanced modules.
>
> The implementation approach taken by the vimage framework is a =20
> redefinition
> of selected global state variables to evaluate to constructs that =20
> allow for
> the virtualized state to be stored and resolved in appropriate =20
> instances of
> 'jail' specific container storage regions.  The code operating on =20
> virtualized
> state has to conform to a set of rules described further below. =20
> Among other
> things in order to allow for all the changes to be conditionally =20
> compilable.
> i.e.  permitting the virtualized code to fall back to operation on =20
> global state.
>
> The rest of this document will discuss NETWORK virtualization
> though the concepts may be true in the future for other parts of the
> system.
>
> The most visible change throughout the existing code is typically =20
> replacement
> of direct references to global variables with macros; foo_bar thus =20
> becomes
> V_foo_bar.  V_foo_bar macros will resolve back to the foo_bar global =20=

> in
> default kernel builds, and alternatively to the logical equivalent of
> some_base_pointer->_foo_bar for "options VIMAGE" kernel configs.
>
> Prepending of "V_" prefixes to variable references helps in
> visual discrimination between global and virtualized state.
> It is also possible to use an alternative syntax, of VNET(foo_bar) to
> achieve the same thing. The developers felt that V_foo_bar was less
> visually distracting while still providing enough clues to the reader
> that the variable is virtualized. In fact the V_foo_bar macro is
> locally defined near the definition of foo_bar to be an alias for
> VNET(foo_bar) so the two are not only equivalent, they are the same.
>
> The framework also extends the sysctl infrastructure to support =20
> access to
> virtualized state through introduction of the SYSCTL_VNET family of =20=

> macros;
> those also automatically fall back to their standard SYSCTL =20
> counterparts
> in default kernel builds.
>
> Transparent libkvm(3) lookups are provided to virtualized variables
> which permits userland binaries such as netstat to operate unmodified
> on "options VIMAGE" kernels, though this may have some security =20
> implications.
>
> Vnets are associated with jails.  In 8.0, every process is =20
> associated with
> a jail, usually the default (null) jail, and jails currently hang =20
> off of
> a processes ucred.  This relationship defines a process's =20
> administrative
> affinity to a vnet and thus indirectly to all of its state. All =20
> network
> interfaces and sockets hold pointers back to their associated vnets.
> This relationship is obviously entirely independent from proc->ucred-=20=

> >jail
> bindings.  Hence, when a process opens a socket, the socket will get =20=

> bound
> to a vnet instance hanging off of proc->ucred->jail->vnet, but once =20=

> such a
> socket->vnet binding gets established, it cannot be changed for the =20=

> entire
> socket lifetime.
>
> The mapping of a from a thread to a vnet should always be done via the
> TD_TO_VNET macro as the path may change in the future as we get more
> experience with using the system.
>
> Certain classes of network interfaces (Ethernet in particular) can be
> reassigned from one vnet to another at any time.  By definition all =20=

> vnets
> are independent and can communicate only if they are explicitly
> provided with communication paths. Currently mainly netgraph is used =20=

> to
> establish inter-vnet datapaths, though other paths are  being explored
> such as the 'epair' back-to-back virtual interface pair, in which
> the different sides may exist in different jails.
>
> In network traffic processing the vnet affinity is defined either by =20=

> the
> inbound interface or by the socket / pcb -> vnet binding.  However, =20=

> there
> are many functions in the network stack that cannot implicitly fetch
> the vnet context from their standard arguments.  Instead of explicitly
> extending argument lists of such functions with a struct vnet *,
> the concept of a "current vnet", a per-thread variable was introduced,
> which can be fetched  efficiently via the curvnet macro.  The correct
> network context has to be set on entry to the network stack (socket
> operations, packet reception, or timer-driven functions) and cleared =20=

> on exit.
> This must be done via provided CURVNET_SET() / CURVNET_RESTORE() =20
> family of
> macros, which allow for "stacking" of curvnet context setting and =20
> provide
> additional debugging info in INVARIANTS kernel configs.  In most cases
> however a developer writing virtualized code will not have to set /
> restore the curvnet context unless the code would include timer-driven
> events, given that those are inherently vnet-contextless on entry.
>
> The current rule is that when not in networking code, the result of
> the 'curvnet' macro will return NULL and evaluating a V_xxx (or =20
> VNET(xxx))
> macro will result in an kernel page-fault error. While this is not =20
> strictly
> necessary, it aids in debugging and assurance of program correctness.
> Note this does NOT mean that TD_TO_VNET(curthread) is invalid.
> A thread is always associated with a vnet, but just the efficient
> "curvnet" access method is disabled along with the ability to resolve
> virtualized symbols.
>
>
> Converting / virtualizing existing code
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> There are several steps need in virtualisation.
>
> 1/ Decide whether the module needs to be virtualised.
>
>   If the module is a driver for specific hardware, it makes sense that
>   there be only one instance of the driver as there is only one =20
> piece of
>   physical hardware.  There are changes in the networking code to =20
> allow
>   physical (or virtual) interfaces to be moved between vnets.  This
>   generally requires NO changes to the network drivers of the classes
>   covered (e.g. ethernet). Currently if your module is does not have =20=

> any
>   networking facet, the answer is "no" by default.
>
> 2/ If the module is to be virtualised, decide which attributes of the
>   module should be virtualised.
>
>   For example, It may make sense that there be a single central pool
>   of "struct foo" and a single uma zone for them to come from, with =20=

> a single
>   lock guarding it. It might also make sense if the "foo_debug" sysctl
>   controls all the instances at once, while on the other hand, the
>   "foo_mode" sysctl might make better sense if it were controllable
>   on a virtual system by virtual system basis.
>
> 3/ Work out what global variables and structures are to be =20
> virtualised to
>   achieve the behaviour required for part #2.
>
> 4/ Work out for all the code paths through the module, how the =20
> thread entering
>   the module can divine which virtual environment it is on.
>
>   Some examples:
>   * Since interfaces are all assigned to one vnet or another, an =20
> incoming
>     packet has a pointer to the receive interface, which in turn has a
>     pointer back to the vnet. Often "curvnet" will already have been =20=

> set
>     by the time your code is called anyhow.
>   * Similarly, on any request from outside the kernel, (direct or =20
> indirect)
>     the current thread has a way to get to the current virtual =20
> environment
>     instance via TD_TO_VNET(curthread).  For existing sockets the vnet
>     context must be used via so->so_vnet since the thread's vnet might
>     change after socket creation.
>   * Timer initiated actions usually have a (void *) argument which =20
> points to
>     some private structure for the module. It should be possible to =20=

> add
>     a pointer to the appropriate module instance into whatever =20
> structure
>     that points to.
>   * Sometimes an action (timer trigerred or trigerred by module load =20=

> or
>     unload simply has to check all the vimage or module instances.
>     There are macro (pairs) for this which will iterate through all =20=

> the
>     VNET or instances. (see sample code below).
>
>   This covers most of the cases, however in some cases it may still be
>   required for the module to stash away the virtual environment =20
> instance
>   somewhere, and make associated changes in the code.
>
> 5/ Decide which parts of the initialization and teardown are per =20
> jail and
>   which parts are global, and separate out the code accordingly.
>   Global initialization is done using the SYSINIT facility.
>   Per jail initialization is done using VNET_SYSINIT().
>   Per jail teardown is doen using VNET_SYSUNINIT().
>   Global teardown is done using SYSUNIT().
>   In addition, the modevent handler is called with various event =20
> types before
>   any of these are called. The modevent handler may veto load or =20
> teardown.
>   On Shutdown, only the modevent handler is called so it may have to =20=

> simulate
>   the calling of the other handlers if clean shutdown is a requirement
>   of your module. (see sample code below). Don't forget to unregister
>   event handlers, and destroy locks and condition variables.
>
> 6/ Add the code described below to the files that make up the module.
>
> Details:  (VNET implementation details)
>
> Firstly the file <net/vnet.h> must be included. Depending on what
> code you use you may find you also need one or more of: <sys/proc.h>,
> <sys/ucred.h> and <sys/jail.h>. These requirements may change slightly
> as the ABI settles.
>
> Having decided which variables need to be virtualized, the definition
> of thosvariables needs to be modified to use the VNET_DEFINE() macro.
> For example:
>
> static int foo =3D 3;
> struct bar thebar =3D { 1,2,3 };
>
> would become:
>
> static VNET_DEFINE(int, foo) =3D 3;
> VNET_DEFINE(struct bar, thebar) =3D { 1,2,3 };
>
> extern int foo;
> in an include file might become:
> VNET_DECLARE(int foo);
>
> Normal rules regarding 'static/extern' apply. The initial values =20
> that you
> give in this way will be stored and used as the initial values for
> EACH NEW INSTANCE of these variables as new jails/vnets are created.
>
> As mentioned above, accesses to virtualized symbols are achieved via =20=

> macros,
> which generally are of the same name as the original symbol but with =20=

> a "V_"
> prepended, thus the head of the interface list, called 'ifnet' is =20
> replaced
> whereever used with "V_ifnet".  We do this, by adding the following
> lines after the definitions above:
>
> #define V_foo			VNET(foo)
> #define V_thebar		VNET(thebar)
>
> --- side-note ---
> In SCTP, because the code is shared with
> other OS's they are replaced with a macro MODULE_GLOBAL(modulename, =20=

> symbol).
> (this may simplify in light of recent changes).
> --------------
>
> In addition, should any of your values need to be changed  or viewed
> via sysctl, the following SYSCTL definitions would be needed:
>
> SYSCTL_VNET_PROC(_net_inet, OID_AUTO, thebar,
>    CTLTYPE_?? | CTLFLAG_RW | CTLFLAG_SECURE3, &VNET_NAME(thebar), 0,
>    thebar, "?", "the bar is open");
> {[XXX] robert fix this is possible ^^^}
> SYSCTL_VNET_INT(_net_inet, OID_AUTO, foo,
>    CTLFLAG_RW, &VNET_NAME(foo), 0, "size of foo");
>
>
> In the current version of vimage, when VIMAGE is not compiled into
> the kernel, the macros evaluate to a direct reference to the one and =20=

> only
> symbol/variable, so that there is no speed penalty for those not =20
> using vnets.
>
> When VIMAGE is compiled in, the macro will evaluate to an access to =20=

> an offset
> into a data structure that is accessed on a per-vet basis. The vnet
> used for this is always curvnet. For this reason an attempt to access
> such a variable while curvnet is not valid, will result in an =20
> exception.
>
> To ensure that curvnet has a valid value when needed one needs to
> add the following code on all entry code paths into the networking =20
> code:
> int
> my_func(int arg)
> {
>        CURVNET_SET(TD_TO_VNET(curthread));
>                do_my_network_stuff(arg);
>        CURVNET_RESTORE();
>        return (0);
> }
>
> The initial value is usually something like "TD_TO_VNET(curthread)
> which in turn is a macro that derives the vnet affinity from the =20
> current
> thread.  It could also be (m->m_ifp->if_vnet) if we were receiving =20
> an mbuf,
> or so->so_vnet if we had a socket involved.
>
> Usually, when a packet enters the system it is carried through the =20
> processing
> path via a single thread, and that thread will set its virtual =20
> environment
> reference to that indicated by the packet on picking up that new =20
> packet.
> This means that in the normal inbound processing path as well as the
> outgoing process path the current thread can be used to indicate the
> current virtual environment and curvet will always be valid once most
> user supplied code is reached. In timer events, it is sometimes
> necessary to add an "outer loop" to iterate through all the possible =20=

> vnets
> if there is just one timer for all instances.
>
> When a new loadable module is virtualised the module definitions
> and intializers need to be examined. The following example illustrates
> what is needed in the case that you are not loading a new protocol, =20=

> or domain.
> (for that see later)
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D sample skeleton code =
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> /* init on boot or module load */
> static int
> mymod_init(void)
> {
>        return (error);
> }
>
> /****************
> * Stuff that must be initialized for every instance
> * (including the first of course).
> */
> static int
> mymod_vnet_init(const void *unused)
> {
>        return (0);
> }
>
> /**********************
> * Called for the removal of the last instance only on module unload.
> */
> static void
> mymod_uninit(void)
> {
> }
>
> /***********************
> * Called for the removal of each instance.
> */
> static int
> mymod_vnet_uninit(const void *unused)
> {
>        return (0)
> }
>
> mymod_modevent(module_t mod, int type, void *unused)
> {
>        int err =3D 0;
>
>        switch (type) {
>        case MOD_LOAD:
> 		/* check that loading is ok */
>                break;
>
>        case MOD_UNLOAD:
> 		/* check that unloading is ok */
>                break;
>
>        case MOD_QUIESCE:
> 		/* warning: try stop processing */
> 		/* maybe sleep 1 mSec or something to let threads get =
out */
>                break;
>
>        case MOD_SHUTDOWN:
> 		/*
> 		 * this is called once  but you may want to shut down
> 		 * things in each jail, or something global.
> 		 * In that case it's up to us to simulate the =
SYSUNINIT()
> 		 * or the VNET_SYSUNINIT()
> 		 */
> 		{
> 			VNET_ITERATOR_DECL(vnet_iter);
> 			VNET_LIST_RLOCK();
> 			VNET_FOREACH(vnet_iter) {
> 				CURVNET_SET(vnet_iter);
> 				mymod_vnet_uninit(NULL);
> 				CURVNET_RESTORE();
> 			}
> 			VNET_LIST_RUNLOCK();
> 		}
> 		/* you may need to shutdown something global. */
> 		mymod_uninit();
>                break;
>
>        default:
>                err =3D EOPNOTSUPP;
>                break;
>        }
>        return err;
> }
>
> static moduledata_t mymodmod =3D {
>        "mymod",
>        mymod_modevent,
>        0
> };
>
> /* define execution order using constants from /sys/sys/kernel.h */
> #define MYMOD_MAJOR_ORDER      SI_SUB_PROTO_BEGIN         /* for =20
> example */
> #define MYMOD_MODULE_ORDER     (SI_ORDER_ANY + 64)        /* not =20
> fussy */
> #define MYMOD_SYSINIT_ORDER    (MYMOD_MODULE_ORDER + 1)   /* a bit =20
> later */
> #define MYMOD_VNET_ORDER       (MYMOD_MODULE_ORDER + 2)   /* later =20
> still */
>
> DECLARE_MODULE(mymod, mymodmod, MYMOD_MAJOR_ORDER, =20
> MYMOD_MODULE_ORDER);
> MODULE_DEPEND(mymod, ipfw, 2, 2, 2); /* depend on ipfw version =20
> (exactly) 2 */
> MODULE_VERSION(mymod, 1);
>
> SYSINIT(mymod_init, MYMOD_MAJOR_ORDER, MYMOD_SYSINIT_ORDER,
>   mymod_init, NULL);
> SYSUNINIT(mymod_uninit, MYMOD_MAJOR_ORDER, MYMOD_SYSINIT_ORDER,
>   mymod_uninit, NULL);
>
> VNET_SYSINIT(mymod_vnet_init, MYMOD_MAJOR_ORDER, MYMOD_VNET_ORDER,
>   mymod_vnet_init, NULL);
> VNET_SYSUNINIT(mymod_vnet_uninit, MYMOD_MAJOR_ORDER, MYMOD_VNET_ORDER,
>   mymod_vnet_uninit, NULL);
>
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D end sample code =3D=3D=3D=3D=3D=3D=3D
>
> On BOOT, the order of evaluation will be:
>  In a NON-VIMAGE kernel where the module is compiled:
>     MODEVENT, SYSINIT and VNET_SYSINIT both runm with order defined =20=

> by their
>     order declarations. {good foot shooting material if you get it =20
> wrong!}
>
>  In a VIMAGE kernel where the module is compiled in:
>     MODEVNET, SYSINIT and VNET_SYSINIT all run with order defined by =20=

> their
>     order declarations.  AND in addition, the VNET_SYSINIT is
>     repeated once for every existing or new jail/vnet.
>
> On loading a vnet enabled kernel module after boot:
>      MODEVENT("event =3D load");
>      SYSINIT()
>      VNET_SYSINIT() for every existing jail
>        AND in addition, VNET_SYSINIT being called for each new jail =20=

> created.
>
> On unloading of module:
>      MODEVENT("event =3D MOD_QUIESCE")
>      MODEVENT("event =3D MOD_UNLOAD")
>      VNET_SYSUNINIT called for every jail/vnet
>      SYSUNINIT
>
> On system shutdown:
>      MODEVENT(shutdown)
>
> NOTICE that while the order of the SYSINIT and VNET_SYSINIT is =20
> reversed from
> that of SYSUNINIT and VNET_SYSUNINIT, MODEVENTS do not follow
> this rule and thus it is dangerous to initialise and uninitialise
> things which are order dependent using MODEVENTs.
>
> Or, put another way,
> Since MODEVENT is called first during module load, it would, by the
> assumption that everything is reversed, be easy to assume that =20
> MODEVENT
> is called AFTER the SYSINITS during unload.  This is in fact not
> the case. (and I have the scars to prove it).
>
> It might be make some sense if the "QUIESCE" was called before the
> SYSINIT/SYSUNINIT and the UNLOAD called after.. with a millisecond
> sleep between them, but this is not the case either.
>
> Since initial values are copied into the virtualized variables
> on each new instantiatin, it is quite possible to have modules for =20
> which
> some of the above methods are not needed, and they may be left out.
> (but not the modevent).
>
> Sometimes there is a need to iterate through the vnets.
> See the modevent shutdown handler (above) for an example of how to =20
> do this.
> Don't forget the locks.
>
> In the case where you are loading a new protocol, or domain =20
> (protocol family)
> there are some "shortcuts" that are in place to allow you to =20
> maintain a bit
> more source compatibility with older revisions of FreeBSD. It must be
> added that the sample code above works just fine for protocols, =20
> however
> protcols also have an aditional initialization vector which is via the
> prtocol structure, which has a pr_init() entry.
> When a protocol is registered using pf_proto_register(), the pr_init()
> for the protocol is called once for every existing vnet. in addition,
> it will be called for each new vnet. The pr_destroy() method will be =20=

> called
> as well on vnet teardown. The pf_proto_register() funcion can be =20
> called
> either from a modevent handler of from the SYSINIT() if you have =20
> one, and
> the pf_proto_unregister() called from the SYSUNINIT or the unload
> modevent handler.
>
> If you are adding a whole new protocol domain, (protocol family) then
> you should add the VNET_DOMAIN_SET(domainname) (e,g, inet, inet6)
> macro. These use VNET_SYSINIT internally to indirectly call the
> dom_init() and pr_init()  functions for each vnet, (and the =20
> equivalent for
> teardown.)  In this case one needs to be absolutely sure that both =20
> your
> domain and protocol initializers can be called multiple times, once =20=

> for
> each vnet. One can still add SYSINITs for once only initialization,
> or use the modevent handler. I prefer to do as much explicitly
> in the SYSINITS and VNET_SYSINITS as then you have no surprises.
>
> finally:
> The command to make a new jail with a new vnet:
> jail -c host.hostname=3Dtest path=3D/ vnet command=3D/bin/tcsh
> jail -c host.hostname=3Dtest path=3D/ children.max=3D4 vnet =
command=3D/bin/=20
> tcsh
> (children.max allows hierarchical jail creation).
> Note that the command must come last.
>
>
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1321ED43-81C5-4507-AFC0-4B2DEE71BB78>