From owner-freebsd-net@FreeBSD.ORG Thu Dec 27 01:28:04 2007 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 17CBA16A419 for ; Thu, 27 Dec 2007 01:28:04 +0000 (UTC) (envelope-from ivo.vachkov@gmail.com) Received: from hs-out-2122.google.com (hs-out-0708.google.com [64.233.178.240]) by mx1.freebsd.org (Postfix) with ESMTP id 669D513C447 for ; Thu, 27 Dec 2007 01:28:03 +0000 (UTC) (envelope-from ivo.vachkov@gmail.com) Received: by hs-out-2122.google.com with SMTP id j58so2219948hsj.11 for ; Wed, 26 Dec 2007 17:28:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=7P9byjzIODk7OjOMT/p8QTPLAQm1t3t9xirlo8LCWbQ=; b=PgbD/339ykW1ubzBfVxjCIypmFOJpiEuqCKRGhdWgsw4+US0lzwwBV0DhQVivKkvIZtVJ6eFnugyyE4NuAJzoWI4YwWTKinWwYWo50sCifDyb0/ifRIp0XD1kxj6XbigSIwz0OwRp4YLbDkx2j2jKyU0m6I+Rs/v1zIAQc3/pZI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=jG1MNPOqYXnEan6K0C/XRO7sZWyrhrffthB7BWg91ZXxxcCOUwrQ92wpaUVByN3NI9ROYbPLyfbsz5yu80Soq4JzBAwu/9++82q3m09H0xmcdPj7tHFPm8aU3KrWD5Z7ELHXAACt8viEkrRAkrgmfhyS1ZfbAZkkBaNvbHlq2gE= Received: by 10.150.229.16 with SMTP id b16mr1961399ybh.115.1198718881884; Wed, 26 Dec 2007 17:28:01 -0800 (PST) Received: by 10.150.204.13 with HTTP; Wed, 26 Dec 2007 17:28:01 -0800 (PST) Message-ID: Date: Thu, 27 Dec 2007 03:28:01 +0200 From: "Ivo Vachkov" To: "Julian Elischer" In-Reply-To: <4772F123.5030303@elischer.org> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <4772F123.5030303@elischer.org> Cc: FreeBSD Net , Robert Watson , Qing Li , arch@freebsd.org Subject: Re: resend: multiple routing table roadmap (format fix) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2007 01:28:04 -0000 On Dec 27, 2007 2:26 AM, Julian Elischer wrote: > Resending as my mailer made a dog's breakfast of the first one > with all sorts of wierd line breaks... hopefully this will be better. > (I haven't sent it yet so I'm hoping).. > > > ------------------------------------------- > > > > On thing where FreeBSD has been falling behind, and which by chance I > have some time to work on is "policy based routing", which allows > different > packet streams to be routed by more than just the destination address. > > Constraints: > ------------ > > I want to make some form of this available in the 6.x tree > (and by extension 7.x) , but FreeBSD in general needs it so I might as > well > do it in -current and back port the portions I need. > > One of the ways that this can be done is to have the ability to > instantiate multiple kernel routing tables (which I will now > refer to as "Forwarding Information Bases" or "FIBs" for political > correctness reasons. Which FIB a particular packet uses to make > the next hop decision can be decided by a number of mechanisms. > The policies these mechanisms implement are the "Policies" referred > to in "Policy based routing". > > One of the constraints I have if I try to back port this work to > 6.x is that it must be implemented as a EXTENSION to the existing > ABIs in 6.x so that third party applications do not need to be > recompiled in timespan of the branch. > > Implementation method, (part 1) > ------------------------------- > For this reason I have implemented a "sufficient subset" of a > multiple routing table solution in Perforce, and back-ported it > to 6.x. (also in Perforce though not yet caught up with what I > have done in -current/P4). The subset allows a number of FIBs > to be defined at compile time (sufficient for my purposes in 6.x) and > implements the changes needed to allow IPV4 to use them. I have not done > the changes for ipv6 simply because I do not need it, and I do not > have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. > > Other protocol families are left untouched and should there be > users with proprietary protocol families, they should continue to work > and be oblivious to the existence of the extra FIBs. > > To understand how this is done, one must know that the current FIB > code starts everything off with a single dimensional array of > pointers to FIB head structures (One per protocol family), each of > which in turn points to the trie of routes available to that family. > > The basic change in the ABI compatible version of the change is to > extent that array to be a 2 dimensional array, so that > instead of protocol family X looking at rt_tables[X] for the > table it needs, it looks at rt_tables[Y][X] when for all > protocol families except ipv4 Y is always 0. > Code that is unaware of the change always just sees the first row > of the table, which of course looks just like the one dimensional > array that existed before. Pretty much like the OpenBSD approach :) > The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() > are all maintained, but refer only to the first row of the array, > so that existing callers in proprietary protocols can continue to > do the "right thing". > Some new entry points are added, for the exclusive use of ipv4 code > called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), > which have an extra argument which refers the code to the correct row. > > In addition, there are some new entry points (currently called > dom_rtalloc() and friends) that check the Address family being > looked up and call either rtalloc() (and friends) if the protocol > is not IPv4 forcing the action to row 0 or to the appropriate row > if it IS IPv4 (and that info is available). These are for calling > from code that is not specific to any particular protocol. The way > these are implemented would change in the non ABI preserving code > to be added later. > > One feature of the first version of the code is that for ipv4, > the interface routes show up automatically on all the FIBs, so > that no matter what FIB you select you always have the basic > direct attached hosts available to you. (rtinit() does this > automatically). > You CAN delete an interface route from one FIB should you want > to but by default it's there. ARP information is also available > in each FIB. It's assumed that the same machine would have the > same MAC address, regardless of which FIB you are using to get > to it. > > > This brings us as to how the correct FIB is selected for an outgoing > IPV4 packet. > > Packets fall into one of a number of classes. > 1/ locally generated packets, coming from a socket/PCB. > Such packets select a FIB from a number associated with the > socket/PCB. This in turn is inherited from the process, > but can be changed by a socket option. The process in turn > inherits it on fork. I have written a utility call setfib > that acts a bit like nice.. > > setfib -n 3 ping target.example.com # will use fib 3 for ping. > > 2/ packets received on an interface for forwarding. > By default these packets would use table 0, > (or possibly a number settable in a sysctl(not yet)). > but prior to routing the firewall can inspect them (see below). > > 3/ packets inspected by a packet classifier, which can arbitrarily > associate a fib with it on a packet by packet basis. > A fib assigned to a packet by a packet classifier > (such as ipfw) would over-ride a fib associated by > a more default source. (such as cases 1 or 2). For the 2/ and 3/ cases I added (in a personal work i've been doing lately) additional field in struct mbuf which can be set by a packet filter or other application upon receiving which points the right table to use for the lookup. This way a simple "marking" can be used to divide different flows and create policy based routing. > Routing messages would be associated with their > process, and thus select one FIB or another. > > In addition Netstat has been edited to be able to cope with the > fact that the array is now 2 dimensional. (It looks in system > memory using libkvm (!)). > > In addition two sysctls are added to give: > a) the number of FIBs compiled in (active) > b) the default FIB of the calling process. > > Early testing experience: > ------------------------- > > Basically our (IronPort's) appliance does this functionality already > using ipfw fwd but that method has some drawbacks. > > For example, > It can't fully simulate a routing table because it can't influence the > socket's choice of local address when a connect() is done. > > > Testing during the generating of these changes has been > remarkably smooth so far. Multiple tables have co-existed > with no notable side effects, and packets have been routes > accordingly. > > I have not yet added the changes to ipfw. > pf has some similar changes already but they seem to rely on > the various FIBs having symbolic names. Which I do not plan to support > in the first version of these changes. > > SCTP has interestingly enough built in support for this, called VRFs > in Cisco parlance. it will be interesting to see how that handles it > when it suddenly actually does something. > > I have not redone my testing since my last edits, but will be > retesting with the current code asap. > > > Where to next: > -------------------- > > After committing the ABI compatible version and MFCing it, I'd > like to proceed in a forward direction in -current. this will > result in some roto-tilling in the routing code. > > Firstly: the current code's idea of having a separate tree per > protocol family, all of the same format, and pointed to by the > 1 dimensional array is a bit silly. Especially when one considers that > there > is code that makes assumptions about every protocol having the same > internal structures there. Some protocols don't WANT that > sort of structure. (for example the whole idea of a netmask is foreign > to appletalk). This needs to be made opaque to the external code. > > My suggested first change is to add routing method pointers to the > 'domain' structure, along with information pointing the data. > instead of having an array of pointers to uniform structures, > there would be an array pointing to the 'domain' structures > for each protocol address domain (protocol family), > and the methods this reached would be called. The methods would have > an argument that gives FIB number, but the protocol would be free > to ignore it. > > Interaction with the ARP layer/ LL layer would need to be > revisited as well. Qing Li has been working on this already. > > > diffs > for those with p4 access: > p4 diff2 -du //depot/vendor/freebsd/src/sys/...@131121 > //depot/user/julian/routing/src/sys/... > > for those with the makediff perl script: > perl ~/makediff.pl //depot/vendor/freebsd/src/sys/...@131121 > //depot/user/julian/routing/src/sys/... > > for those with neither: > > http://people.freebsd.org/~julian/mrt2.diff > > I just put the userland utility in usr.sbin/setfib/ in p4. > and changes to netstat in usr.bin/netstat/ > > see: > http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/julian/routing/src&HIDEDEL=NO > > > > > I'd like to get comments on this (compat) version, so that I can > commit it, > get general testing under way to start the clock for MFC, and then get > moving on the fuller implementation (that breaks ABIs) and other > routing issues. > > > Julian > > > > > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" >