From owner-freebsd-net@FreeBSD.ORG Fri Aug 21 12:42:04 2009 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 15A65106568B for ; Fri, 21 Aug 2009 12:42:04 +0000 (UTC) (envelope-from fabien.thomas@netasq.com) Received: from netasq.netasq.com (netasq.netasq.com [213.30.137.178]) by mx1.freebsd.org (Postfix) with ESMTP id 24B978FC64 for ; Fri, 21 Aug 2009 12:42:02 +0000 (UTC) Received: from [10.2.1.5] (unknown [10.0.0.126]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (Client did not present a certificate) by netasq.netasq.com (Postfix) with ESMTP id 4B4C81BB98; Fri, 21 Aug 2009 14:11:08 +0200 (CEST) From: Fabien Thomas To: Julian Elischer In-Reply-To: <4A8D76FE.7040302@elischer.org> References: <4A8CFDAF.1000309@delphij.net> <200908201108.39177.max@love2party.net> <4A8D76FE.7040302@elischer.org> Message-Id: <1321ED43-81C5-4507-AFC0-4B2DEE71BB78@netasq.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed; delsp=yes Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Apple Message framework v936) Date: Fri, 21 Aug 2009 14:10:35 +0200 X-Mailer: Apple Mail (2.936) Cc: FreeBSD Net Subject: Re: pf and vimage X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: fabient@freebsd.org List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Aug 2009 12:42:04 -0000 Thanks very useful! Do you have an "official" page to look for update. What do you think of putting it on the FreeBSD Wiki? Fabien Le 20 ao=FBt 09 =E0 18:17, Julian Elischer a =E9crit : > there were some people looking at adding vnet support to pf. > Since we discussed it last, the rules of the game have > significantly changed for the better. With the addition > of some new facilitiesin FreeBSD, the work needed to virtualize > a module has significantly decreased. > > > The following doc gives the new rules.. > > > August 17 2009 > Julian Elischer > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Vimage: what is it? > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > Vimage is a framework in the BSD kernel which allows a co-operating =20= > module > to operate on multiple independent instances of its state so that it =20= > can > participate in a virtual machine / virtual environment scenario. It =20= > refers > to a part of the Jail infrastructure in FreeBSD. For historical =20 > reasons > "Virtual network stack enabled jails"(1) are also known as "vimage =20 > enabled > jails"(2) or "vnet enabled jails"(3). The currently correct term is =20= > the > latter, which is a contraction of the first. In the future other =20 > parts of > the system may be virtualized using the same technology and the term =20= > to > cover all such components would be VIMAGE enhanced modules. > > The implementation approach taken by the vimage framework is a =20 > redefinition > of selected global state variables to evaluate to constructs that =20 > allow for > the virtualized state to be stored and resolved in appropriate =20 > instances of > 'jail' specific container storage regions. The code operating on =20 > virtualized > state has to conform to a set of rules described further below. =20 > Among other > things in order to allow for all the changes to be conditionally =20 > compilable. > i.e. permitting the virtualized code to fall back to operation on =20 > global state. > > The rest of this document will discuss NETWORK virtualization > though the concepts may be true in the future for other parts of the > system. > > The most visible change throughout the existing code is typically =20 > replacement > of direct references to global variables with macros; foo_bar thus =20 > becomes > V_foo_bar. V_foo_bar macros will resolve back to the foo_bar global =20= > in > default kernel builds, and alternatively to the logical equivalent of > some_base_pointer->_foo_bar for "options VIMAGE" kernel configs. > > Prepending of "V_" prefixes to variable references helps in > visual discrimination between global and virtualized state. > It is also possible to use an alternative syntax, of VNET(foo_bar) to > achieve the same thing. The developers felt that V_foo_bar was less > visually distracting while still providing enough clues to the reader > that the variable is virtualized. In fact the V_foo_bar macro is > locally defined near the definition of foo_bar to be an alias for > VNET(foo_bar) so the two are not only equivalent, they are the same. > > The framework also extends the sysctl infrastructure to support =20 > access to > virtualized state through introduction of the SYSCTL_VNET family of =20= > macros; > those also automatically fall back to their standard SYSCTL =20 > counterparts > in default kernel builds. > > Transparent libkvm(3) lookups are provided to virtualized variables > which permits userland binaries such as netstat to operate unmodified > on "options VIMAGE" kernels, though this may have some security =20 > implications. > > Vnets are associated with jails. In 8.0, every process is =20 > associated with > a jail, usually the default (null) jail, and jails currently hang =20 > off of > a processes ucred. This relationship defines a process's =20 > administrative > affinity to a vnet and thus indirectly to all of its state. All =20 > network > interfaces and sockets hold pointers back to their associated vnets. > This relationship is obviously entirely independent from proc->ucred-=20= > >jail > bindings. Hence, when a process opens a socket, the socket will get =20= > bound > to a vnet instance hanging off of proc->ucred->jail->vnet, but once =20= > such a > socket->vnet binding gets established, it cannot be changed for the =20= > entire > socket lifetime. > > The mapping of a from a thread to a vnet should always be done via the > TD_TO_VNET macro as the path may change in the future as we get more > experience with using the system. > > Certain classes of network interfaces (Ethernet in particular) can be > reassigned from one vnet to another at any time. By definition all =20= > vnets > are independent and can communicate only if they are explicitly > provided with communication paths. Currently mainly netgraph is used =20= > to > establish inter-vnet datapaths, though other paths are being explored > such as the 'epair' back-to-back virtual interface pair, in which > the different sides may exist in different jails. > > In network traffic processing the vnet affinity is defined either by =20= > the > inbound interface or by the socket / pcb -> vnet binding. However, =20= > there > are many functions in the network stack that cannot implicitly fetch > the vnet context from their standard arguments. Instead of explicitly > extending argument lists of such functions with a struct vnet *, > the concept of a "current vnet", a per-thread variable was introduced, > which can be fetched efficiently via the curvnet macro. The correct > network context has to be set on entry to the network stack (socket > operations, packet reception, or timer-driven functions) and cleared =20= > on exit. > This must be done via provided CURVNET_SET() / CURVNET_RESTORE() =20 > family of > macros, which allow for "stacking" of curvnet context setting and =20 > provide > additional debugging info in INVARIANTS kernel configs. In most cases > however a developer writing virtualized code will not have to set / > restore the curvnet context unless the code would include timer-driven > events, given that those are inherently vnet-contextless on entry. > > The current rule is that when not in networking code, the result of > the 'curvnet' macro will return NULL and evaluating a V_xxx (or =20 > VNET(xxx)) > macro will result in an kernel page-fault error. While this is not =20 > strictly > necessary, it aids in debugging and assurance of program correctness. > Note this does NOT mean that TD_TO_VNET(curthread) is invalid. > A thread is always associated with a vnet, but just the efficient > "curvnet" access method is disabled along with the ability to resolve > virtualized symbols. > > > Converting / virtualizing existing code > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > There are several steps need in virtualisation. > > 1/ Decide whether the module needs to be virtualised. > > If the module is a driver for specific hardware, it makes sense that > there be only one instance of the driver as there is only one =20 > piece of > physical hardware. There are changes in the networking code to =20 > allow > physical (or virtual) interfaces to be moved between vnets. This > generally requires NO changes to the network drivers of the classes > covered (e.g. ethernet). Currently if your module is does not have =20= > any > networking facet, the answer is "no" by default. > > 2/ If the module is to be virtualised, decide which attributes of the > module should be virtualised. > > For example, It may make sense that there be a single central pool > of "struct foo" and a single uma zone for them to come from, with =20= > a single > lock guarding it. It might also make sense if the "foo_debug" sysctl > controls all the instances at once, while on the other hand, the > "foo_mode" sysctl might make better sense if it were controllable > on a virtual system by virtual system basis. > > 3/ Work out what global variables and structures are to be =20 > virtualised to > achieve the behaviour required for part #2. > > 4/ Work out for all the code paths through the module, how the =20 > thread entering > the module can divine which virtual environment it is on. > > Some examples: > * Since interfaces are all assigned to one vnet or another, an =20 > incoming > packet has a pointer to the receive interface, which in turn has a > pointer back to the vnet. Often "curvnet" will already have been =20= > set > by the time your code is called anyhow. > * Similarly, on any request from outside the kernel, (direct or =20 > indirect) > the current thread has a way to get to the current virtual =20 > environment > instance via TD_TO_VNET(curthread). For existing sockets the vnet > context must be used via so->so_vnet since the thread's vnet might > change after socket creation. > * Timer initiated actions usually have a (void *) argument which =20 > points to > some private structure for the module. It should be possible to =20= > add > a pointer to the appropriate module instance into whatever =20 > structure > that points to. > * Sometimes an action (timer trigerred or trigerred by module load =20= > or > unload simply has to check all the vimage or module instances. > There are macro (pairs) for this which will iterate through all =20= > the > VNET or instances. (see sample code below). > > This covers most of the cases, however in some cases it may still be > required for the module to stash away the virtual environment =20 > instance > somewhere, and make associated changes in the code. > > 5/ Decide which parts of the initialization and teardown are per =20 > jail and > which parts are global, and separate out the code accordingly. > Global initialization is done using the SYSINIT facility. > Per jail initialization is done using VNET_SYSINIT(). > Per jail teardown is doen using VNET_SYSUNINIT(). > Global teardown is done using SYSUNIT(). > In addition, the modevent handler is called with various event =20 > types before > any of these are called. The modevent handler may veto load or =20 > teardown. > On Shutdown, only the modevent handler is called so it may have to =20= > simulate > the calling of the other handlers if clean shutdown is a requirement > of your module. (see sample code below). Don't forget to unregister > event handlers, and destroy locks and condition variables. > > 6/ Add the code described below to the files that make up the module. > > Details: (VNET implementation details) > > Firstly the file must be included. Depending on what > code you use you may find you also need one or more of: , > and . These requirements may change slightly > as the ABI settles. > > Having decided which variables need to be virtualized, the definition > of thosvariables needs to be modified to use the VNET_DEFINE() macro. > For example: > > static int foo =3D 3; > struct bar thebar =3D { 1,2,3 }; > > would become: > > static VNET_DEFINE(int, foo) =3D 3; > VNET_DEFINE(struct bar, thebar) =3D { 1,2,3 }; > > extern int foo; > in an include file might become: > VNET_DECLARE(int foo); > > Normal rules regarding 'static/extern' apply. The initial values =20 > that you > give in this way will be stored and used as the initial values for > EACH NEW INSTANCE of these variables as new jails/vnets are created. > > As mentioned above, accesses to virtualized symbols are achieved via =20= > macros, > which generally are of the same name as the original symbol but with =20= > a "V_" > prepended, thus the head of the interface list, called 'ifnet' is =20 > replaced > whereever used with "V_ifnet". We do this, by adding the following > lines after the definitions above: > > #define V_foo VNET(foo) > #define V_thebar VNET(thebar) > > --- side-note --- > In SCTP, because the code is shared with > other OS's they are replaced with a macro MODULE_GLOBAL(modulename, =20= > symbol). > (this may simplify in light of recent changes). > -------------- > > In addition, should any of your values need to be changed or viewed > via sysctl, the following SYSCTL definitions would be needed: > > SYSCTL_VNET_PROC(_net_inet, OID_AUTO, thebar, > CTLTYPE_?? | CTLFLAG_RW | CTLFLAG_SECURE3, &VNET_NAME(thebar), 0, > thebar, "?", "the bar is open"); > {[XXX] robert fix this is possible ^^^} > SYSCTL_VNET_INT(_net_inet, OID_AUTO, foo, > CTLFLAG_RW, &VNET_NAME(foo), 0, "size of foo"); > > > In the current version of vimage, when VIMAGE is not compiled into > the kernel, the macros evaluate to a direct reference to the one and =20= > only > symbol/variable, so that there is no speed penalty for those not =20 > using vnets. > > When VIMAGE is compiled in, the macro will evaluate to an access to =20= > an offset > into a data structure that is accessed on a per-vet basis. The vnet > used for this is always curvnet. For this reason an attempt to access > such a variable while curvnet is not valid, will result in an =20 > exception. > > To ensure that curvnet has a valid value when needed one needs to > add the following code on all entry code paths into the networking =20 > code: > int > my_func(int arg) > { > CURVNET_SET(TD_TO_VNET(curthread)); > do_my_network_stuff(arg); > CURVNET_RESTORE(); > return (0); > } > > The initial value is usually something like "TD_TO_VNET(curthread) > which in turn is a macro that derives the vnet affinity from the =20 > current > thread. It could also be (m->m_ifp->if_vnet) if we were receiving =20 > an mbuf, > or so->so_vnet if we had a socket involved. > > Usually, when a packet enters the system it is carried through the =20 > processing > path via a single thread, and that thread will set its virtual =20 > environment > reference to that indicated by the packet on picking up that new =20 > packet. > This means that in the normal inbound processing path as well as the > outgoing process path the current thread can be used to indicate the > current virtual environment and curvet will always be valid once most > user supplied code is reached. In timer events, it is sometimes > necessary to add an "outer loop" to iterate through all the possible =20= > vnets > if there is just one timer for all instances. > > When a new loadable module is virtualised the module definitions > and intializers need to be examined. The following example illustrates > what is needed in the case that you are not loading a new protocol, =20= > or domain. > (for that see later) > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D sample skeleton code = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > /* init on boot or module load */ > static int > mymod_init(void) > { > return (error); > } > > /**************** > * Stuff that must be initialized for every instance > * (including the first of course). > */ > static int > mymod_vnet_init(const void *unused) > { > return (0); > } > > /********************** > * Called for the removal of the last instance only on module unload. > */ > static void > mymod_uninit(void) > { > } > > /*********************** > * Called for the removal of each instance. > */ > static int > mymod_vnet_uninit(const void *unused) > { > return (0) > } > > mymod_modevent(module_t mod, int type, void *unused) > { > int err =3D 0; > > switch (type) { > case MOD_LOAD: > /* check that loading is ok */ > break; > > case MOD_UNLOAD: > /* check that unloading is ok */ > break; > > case MOD_QUIESCE: > /* warning: try stop processing */ > /* maybe sleep 1 mSec or something to let threads get = out */ > break; > > case MOD_SHUTDOWN: > /* > * this is called once but you may want to shut down > * things in each jail, or something global. > * In that case it's up to us to simulate the = SYSUNINIT() > * or the VNET_SYSUNINIT() > */ > { > VNET_ITERATOR_DECL(vnet_iter); > VNET_LIST_RLOCK(); > VNET_FOREACH(vnet_iter) { > CURVNET_SET(vnet_iter); > mymod_vnet_uninit(NULL); > CURVNET_RESTORE(); > } > VNET_LIST_RUNLOCK(); > } > /* you may need to shutdown something global. */ > mymod_uninit(); > break; > > default: > err =3D EOPNOTSUPP; > break; > } > return err; > } > > static moduledata_t mymodmod =3D { > "mymod", > mymod_modevent, > 0 > }; > > /* define execution order using constants from /sys/sys/kernel.h */ > #define MYMOD_MAJOR_ORDER SI_SUB_PROTO_BEGIN /* for =20 > example */ > #define MYMOD_MODULE_ORDER (SI_ORDER_ANY + 64) /* not =20 > fussy */ > #define MYMOD_SYSINIT_ORDER (MYMOD_MODULE_ORDER + 1) /* a bit =20 > later */ > #define MYMOD_VNET_ORDER (MYMOD_MODULE_ORDER + 2) /* later =20 > still */ > > DECLARE_MODULE(mymod, mymodmod, MYMOD_MAJOR_ORDER, =20 > MYMOD_MODULE_ORDER); > MODULE_DEPEND(mymod, ipfw, 2, 2, 2); /* depend on ipfw version =20 > (exactly) 2 */ > MODULE_VERSION(mymod, 1); > > SYSINIT(mymod_init, MYMOD_MAJOR_ORDER, MYMOD_SYSINIT_ORDER, > mymod_init, NULL); > SYSUNINIT(mymod_uninit, MYMOD_MAJOR_ORDER, MYMOD_SYSINIT_ORDER, > mymod_uninit, NULL); > > VNET_SYSINIT(mymod_vnet_init, MYMOD_MAJOR_ORDER, MYMOD_VNET_ORDER, > mymod_vnet_init, NULL); > VNET_SYSUNINIT(mymod_vnet_uninit, MYMOD_MAJOR_ORDER, MYMOD_VNET_ORDER, > mymod_vnet_uninit, NULL); > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D end sample code =3D=3D=3D=3D=3D=3D=3D > > On BOOT, the order of evaluation will be: > In a NON-VIMAGE kernel where the module is compiled: > MODEVENT, SYSINIT and VNET_SYSINIT both runm with order defined =20= > by their > order declarations. {good foot shooting material if you get it =20 > wrong!} > > In a VIMAGE kernel where the module is compiled in: > MODEVNET, SYSINIT and VNET_SYSINIT all run with order defined by =20= > their > order declarations. AND in addition, the VNET_SYSINIT is > repeated once for every existing or new jail/vnet. > > On loading a vnet enabled kernel module after boot: > MODEVENT("event =3D load"); > SYSINIT() > VNET_SYSINIT() for every existing jail > AND in addition, VNET_SYSINIT being called for each new jail =20= > created. > > On unloading of module: > MODEVENT("event =3D MOD_QUIESCE") > MODEVENT("event =3D MOD_UNLOAD") > VNET_SYSUNINIT called for every jail/vnet > SYSUNINIT > > On system shutdown: > MODEVENT(shutdown) > > NOTICE that while the order of the SYSINIT and VNET_SYSINIT is =20 > reversed from > that of SYSUNINIT and VNET_SYSUNINIT, MODEVENTS do not follow > this rule and thus it is dangerous to initialise and uninitialise > things which are order dependent using MODEVENTs. > > Or, put another way, > Since MODEVENT is called first during module load, it would, by the > assumption that everything is reversed, be easy to assume that =20 > MODEVENT > is called AFTER the SYSINITS during unload. This is in fact not > the case. (and I have the scars to prove it). > > It might be make some sense if the "QUIESCE" was called before the > SYSINIT/SYSUNINIT and the UNLOAD called after.. with a millisecond > sleep between them, but this is not the case either. > > Since initial values are copied into the virtualized variables > on each new instantiatin, it is quite possible to have modules for =20 > which > some of the above methods are not needed, and they may be left out. > (but not the modevent). > > Sometimes there is a need to iterate through the vnets. > See the modevent shutdown handler (above) for an example of how to =20 > do this. > Don't forget the locks. > > In the case where you are loading a new protocol, or domain =20 > (protocol family) > there are some "shortcuts" that are in place to allow you to =20 > maintain a bit > more source compatibility with older revisions of FreeBSD. It must be > added that the sample code above works just fine for protocols, =20 > however > protcols also have an aditional initialization vector which is via the > prtocol structure, which has a pr_init() entry. > When a protocol is registered using pf_proto_register(), the pr_init() > for the protocol is called once for every existing vnet. in addition, > it will be called for each new vnet. The pr_destroy() method will be =20= > called > as well on vnet teardown. The pf_proto_register() funcion can be =20 > called > either from a modevent handler of from the SYSINIT() if you have =20 > one, and > the pf_proto_unregister() called from the SYSUNINIT or the unload > modevent handler. > > If you are adding a whole new protocol domain, (protocol family) then > you should add the VNET_DOMAIN_SET(domainname) (e,g, inet, inet6) > macro. These use VNET_SYSINIT internally to indirectly call the > dom_init() and pr_init() functions for each vnet, (and the =20 > equivalent for > teardown.) In this case one needs to be absolutely sure that both =20 > your > domain and protocol initializers can be called multiple times, once =20= > for > each vnet. One can still add SYSINITs for once only initialization, > or use the modevent handler. I prefer to do as much explicitly > in the SYSINITS and VNET_SYSINITS as then you have no surprises. > > finally: > The command to make a new jail with a new vnet: > jail -c host.hostname=3Dtest path=3D/ vnet command=3D/bin/tcsh > jail -c host.hostname=3Dtest path=3D/ children.max=3D4 vnet = command=3D/bin/=20 > tcsh > (children.max allows hierarchical jail creation). > Note that the command must come last. > > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"