From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 15:15:05 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 50FC216A419 for ; Fri, 28 Dec 2007 15:15:05 +0000 (UTC) (envelope-from ivo.vachkov@gmail.com) Received: from wx-out-0506.google.com (wx-out-0506.google.com [66.249.82.236]) by mx1.freebsd.org (Postfix) with ESMTP id BAD5913C44B for ; Fri, 28 Dec 2007 15:15:04 +0000 (UTC) (envelope-from ivo.vachkov@gmail.com) Received: by wx-out-0506.google.com with SMTP id i29so990613wxd.7 for ; Fri, 28 Dec 2007 07:15:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=Xgce59oOh55fxeTG2J6YoIG8nu/AgUUjXBJ/AWMokiQ=; b=fzgYaAoedMzuUszJkE+GNfe/ea4DqL7/0IEoBi5AP0yY3yaII2mMzfLKQU5nYMg36cPsoN4mmbrrPfFAL24SH2o0PfDd4ZRdjLR9Ro1YHMlLm4ooQQKe8sBU9shqfzVISxuZ7QkSZYosPB5cWncR+GKj1iONsVSiFTLHjvKB7Zw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=h+hJojK1RXsDr/WZwnFxT9GJjnvQLycMaqF82K6zCN/ip4cM35z2JSdhkaVJnnZe1cB+7Xs5Ea5MwZVvyiB1RWwedWe8TTQbY8+wRUCNK9hqVY+i0poM+MKuM5KszSeKGeBgDOaUow4eVcuAiCPfiepyUaNhmAnFp+5+1jB5mu0= Received: by 10.150.197.8 with SMTP id u8mr2605151ybf.131.1198854903741; Fri, 28 Dec 2007 07:15:03 -0800 (PST) Received: by 10.150.219.5 with HTTP; Fri, 28 Dec 2007 07:15:03 -0800 (PST) Message-ID: Date: Fri, 28 Dec 2007 17:15:03 +0200 From: "Ivo Vachkov" To: "Julian Elischer" In-Reply-To: <477416CC.4090906@elischer.org> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <4772F123.5030303@elischer.org> <477416CC.4090906@elischer.org> Cc: FreeBSD Net , Robert Watson , Qing Li , arch@freebsd.org Subject: Re: resend: multiple routing table roadmap (format fix) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 15:15:05 -0000 On Dec 27, 2007 11:19 PM, Julian Elischer wrote: > > Ivo Vachkov wrote: > > On Dec 27, 2007 2:26 AM, Julian Elischer wrote: > >> Resending as my mailer made a dog's breakfast of the first one > >> with all sorts of wierd line breaks... hopefully this will be better. > >> (I haven't sent it yet so I'm hoping).. > >> > >> > >> ------------------------------------------- > >> > >> > >> > >> On thing where FreeBSD has been falling behind, and which by chance I > >> have some time to work on is "policy based routing", which allows > >> different > >> packet streams to be routed by more than just the destination address. > >> > >> Constraints: > >> ------------ > >> > >> I want to make some form of this available in the 6.x tree > >> (and by extension 7.x) , but FreeBSD in general needs it so I might as > >> well > >> do it in -current and back port the portions I need. > >> > >> One of the ways that this can be done is to have the ability to > >> instantiate multiple kernel routing tables (which I will now > >> refer to as "Forwarding Information Bases" or "FIBs" for political > >> correctness reasons. Which FIB a particular packet uses to make > >> the next hop decision can be decided by a number of mechanisms. > >> The policies these mechanisms implement are the "Policies" referred > >> to in "Policy based routing". > >> > >> One of the constraints I have if I try to back port this work to > >> 6.x is that it must be implemented as a EXTENSION to the existing > >> ABIs in 6.x so that third party applications do not need to be > >> recompiled in timespan of the branch. > >> > >> Implementation method, (part 1) > >> ------------------------------- > >> For this reason I have implemented a "sufficient subset" of a > >> multiple routing table solution in Perforce, and back-ported it > >> to 6.x. (also in Perforce though not yet caught up with what I > >> have done in -current/P4). The subset allows a number of FIBs > >> to be defined at compile time (sufficient for my purposes in 6.x) and > >> implements the changes needed to allow IPV4 to use them. I have not done > >> the changes for ipv6 simply because I do not need it, and I do not > >> have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. > > By the way, I might add that in the 6.x compat. version I may end up > limiting the feature to 8 tables. This is because I need to store some > stuff in an efficient way in the mbuf, and in a compatible manner this > is easiest done by stealing the top 4 bits in the mbuf dlags word > and defining them as: > > #define M_HAVEFIB 0x10000000 > #define M_FIBMASK 0x07 > #define M_FIBNUM 0xe0000000 > #define M_FIBSHIFT 29 > #define m_getfib(_m, _default) ((m->m_flags & M_HAVE_FIBNUM) ? > ((m->m_flags >> M_FIBSHIFT) & M_FIBMASK) : _default) > #M_SETFIB(_m, _fib) do { \ > _m->m_flags &= ~M_FIBNUM; \ > _m->m_flags |= (M_HAVEFIB|((_fib & M_FIBMASK) << M_FIBSHIFT));\ > } while (0) > > This then becomes very easy to change to use a tag or > whatever is needed in later versions , and the number can > be expanded past 8 predefined FIBs at that time.. > > >> > >> Other protocol families are left untouched and should there be > >> users with proprietary protocol families, they should continue to work > >> and be oblivious to the existence of the extra FIBs. > >> > >> To understand how this is done, one must know that the current FIB > >> code starts everything off with a single dimensional array of > >> pointers to FIB head structures (One per protocol family), each of > >> which in turn points to the trie of routes available to that family. > >> > >> The basic change in the ABI compatible version of the change is to > >> extent that array to be a 2 dimensional array, so that > >> instead of protocol family X looking at rt_tables[X] for the > >> table it needs, it looks at rt_tables[Y][X] when for all > >> protocol families except ipv4 Y is always 0. > >> Code that is unaware of the change always just sees the first row > >> of the table, which of course looks just like the one dimensional > >> array that existed before. > > > > Pretty much like the OpenBSD approach :) > > well, I did look at the code briefly, but I didn't base it on it.. > > > > > >> The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() > >> are all maintained, but refer only to the first row of the array, > >> so that existing callers in proprietary protocols can continue to > >> do the "right thing". > >> Some new entry points are added, for the exclusive use of ipv4 code > >> called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), > >> which have an extra argument which refers the code to the correct row. > >> > >> In addition, there are some new entry points (currently called > >> dom_rtalloc() and friends) that check the Address family being > >> looked up and call either rtalloc() (and friends) if the protocol > >> is not IPv4 forcing the action to row 0 or to the appropriate row > >> if it IS IPv4 (and that info is available). These are for calling > >> from code that is not specific to any particular protocol. The way > >> these are implemented would change in the non ABI preserving code > >> to be added later. > >> > >> One feature of the first version of the code is that for ipv4, > >> the interface routes show up automatically on all the FIBs, so > >> that no matter what FIB you select you always have the basic > >> direct attached hosts available to you. (rtinit() does this > >> automatically). > >> You CAN delete an interface route from one FIB should you want > >> to but by default it's there. ARP information is also available > >> in each FIB. It's assumed that the same machine would have the > >> same MAC address, regardless of which FIB you are using to get > >> to it. > >> > >> > >> This brings us as to how the correct FIB is selected for an outgoing > >> IPV4 packet. > >> > >> Packets fall into one of a number of classes. > >> 1/ locally generated packets, coming from a socket/PCB. > >> Such packets select a FIB from a number associated with the > >> socket/PCB. This in turn is inherited from the process, > >> but can be changed by a socket option. The process in turn > >> inherits it on fork. I have written a utility call setfib > >> that acts a bit like nice.. > >> > >> setfib -n 3 ping target.example.com # will use fib 3 for ping. > >> > >> 2/ packets received on an interface for forwarding. > >> By default these packets would use table 0, > >> (or possibly a number settable in a sysctl(not yet)). > >> but prior to routing the firewall can inspect them (see below). > >> > >> 3/ packets inspected by a packet classifier, which can arbitrarily > >> associate a fib with it on a packet by packet basis. > >> A fib assigned to a packet by a packet classifier > >> (such as ipfw) would over-ride a fib associated by > >> a more default source. (such as cases 1 or 2). > > > > For the 2/ and 3/ cases I added (in a personal work i've been doing > > lately) additional field in struct mbuf which can be set by a packet > > filter or other application upon receiving which points the right > > table to use for the lookup. This way a simple "marking" can be used > > to divide different flows and create policy based routing. > > This would be the final way but I want to really minimise problems > in the compat versions, so I'll avoid doing that for now. > > Do you have this work available? I have it. However, I'll break a NDA if I 'open' it. > And have you looked at mi diffs below? I plan to look at your code asap. > > > > >> Routing messages would be associated with their > >> process, and thus select one FIB or another. > >> > >> In addition Netstat has been edited to be able to cope with the > >> fact that the array is now 2 dimensional. (It looks in system > >> memory using libkvm (!)). > >> > >> In addition two sysctls are added to give: > >> a) the number of FIBs compiled in (active) > >> b) the default FIB of the calling process. > >> > >> Early testing experience: > >> ------------------------- > >> > >> Basically our (IronPort's) appliance does this functionality already > >> using ipfw fwd but that method has some drawbacks. > >> > >> For example, > >> It can't fully simulate a routing table because it can't influence the > >> socket's choice of local address when a connect() is done. > >> > >> > >> Testing during the generating of these changes has been > >> remarkably smooth so far. Multiple tables have co-existed > >> with no notable side effects, and packets have been routes > >> accordingly. > >> > >> I have not yet added the changes to ipfw. > >> pf has some similar changes already but they seem to rely on > >> the various FIBs having symbolic names. Which I do not plan to support > >> in the first version of these changes. > >> > >> SCTP has interestingly enough built in support for this, called VRFs > >> in Cisco parlance. it will be interesting to see how that handles it > >> when it suddenly actually does something. > >> > >> I have not redone my testing since my last edits, but will be > >> retesting with the current code asap. > >> > >> > >> Where to next: > >> -------------------- > >> > >> After committing the ABI compatible version and MFCing it, I'd > >> like to proceed in a forward direction in -current. this will > >> result in some roto-tilling in the routing code. > >> > >> Firstly: the current code's idea of having a separate tree per > >> protocol family, all of the same format, and pointed to by the > >> 1 dimensional array is a bit silly. Especially when one considers that > >> there > >> is code that makes assumptions about every protocol having the same > >> internal structures there. Some protocols don't WANT that > >> sort of structure. (for example the whole idea of a netmask is foreign > >> to appletalk). This needs to be made opaque to the external code. > >> > >> My suggested first change is to add routing method pointers to the > >> 'domain' structure, along with information pointing the data. > >> instead of having an array of pointers to uniform structures, > >> there would be an array pointing to the 'domain' structures > >> for each protocol address domain (protocol family), > >> and the methods this reached would be called. The methods would have > >> an argument that gives FIB number, but the protocol would be free > >> to ignore it. > >> > >> Interaction with the ARP layer/ LL layer would need to be > >> revisited as well. Qing Li has been working on this already. > >> > >> > >> diffs > >> for those with p4 access: > >> p4 diff2 -du //depot/vendor/freebsd/src/sys/...@131121 > >> //depot/user/julian/routing/src/sys/... > >> > >> for those with the makediff perl script: > >> perl ~/makediff.pl //depot/vendor/freebsd/src/sys/...@131121 > >> //depot/user/julian/routing/src/sys/... > >> > >> for those with neither: > >> > >> http://people.freebsd.org/~julian/mrt2.diff > >> > >> I just put the userland utility in usr.sbin/setfib/ in p4. > >> and changes to netstat in usr.bin/netstat/ > >> > >> see: > >> http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/julian/routing/src&HIDEDEL=NO > >> > >> > >> > >> > >> I'd like to get comments on this (compat) version, so that I can > >> commit it, > >> get general testing under way to start the clock for MFC, and then get > >> moving on the fuller implementation (that breaks ABIs) and other > >> routing issues. > >> > >> > >> Julian > >> > >> > >> > >> > >> _______________________________________________ > >> freebsd-arch@freebsd.org mailing list > >> http://lists.freebsd.org/mailman/listinfo/freebsd-arch > >> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > >> > > -- "UNIX is basically a simple operating system, but you have to be a genius to understand the simplicity." Dennis Ritchie