From owner-freebsd-arch@FreeBSD.ORG Sun Dec 23 06:33:04 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2D7B516A417; Sun, 23 Dec 2007 06:33:04 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id E43D513C447; Sun, 23 Dec 2007 06:33:03 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.107] (cpe-24-94-75-93.hawaii.res.rr.com [24.94.75.93]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id lBN6Ww4a017018; Sun, 23 Dec 2007 01:32:59 -0500 (EST) (envelope-from jroberson@chesapeake.net) Date: Sat, 22 Dec 2007 20:34:16 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: Andre Oppermann In-Reply-To: <476A4DCC.4040206@freebsd.org> Message-ID: <20071222203120.A899@desktop> References: <20071219211025.T899@desktop> <476A4DCC.4040206@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org Subject: Re: Linux compatible setaffinity. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Dec 2007 06:33:04 -0000 On Thu, 20 Dec 2007, Andre Oppermann wrote: > Jeff Roberson wrote: >> I have implemented a linux compatible sched_setaffinity() call which is >> somewhat crippled. This allows a userspace process to supply a bitmask of >> processors which it will run on. I have copied the linux interface such >> that it should be api compatible because I believe it is a sensible >> interface and they beat us to it by 3 years. > > The Linux (and Solaris) style setaffinity is rather low level and > any user of it has to make many assumptions based on incomplete > knowledge of the underlying hardware and its architecture (buses, > caches, latency between cores, etc). > > In practical use I'd rather have a function to bind myself to the > current CPU or CPU number X, and then to specify that new threads > or forked processes should emerge on another, but not this CPU. > Pepper that with a few hints like latency and cache affinity (important > or not important) the kernel can act on appropriately and it becomes > much more powerful and simpler to use. Taking it even further an > application may want to specify that it would like to run on a number > X of cores that are close (latency/cache) together, be permanently > bound to it and to repel any other such requests. This way I can > run my database server on socket 1 cores 1-4, and the webserver on > socket 2 cores 5-8 more or less automagically. sched_setaffinity > requires a lot of operator involvement and architecture knowledge > to make that happen. > > Not that I'm against a Linux compatible sched_setaffinity(), it's > just not as practical to use as other constructs. > > Food for thought. Well my hope is that the kernel scheduler has all of the required information about the processor to make these kinds of decisions for the general case. Right now we need better topology information in the kernel, but I think userspace only uses setaffinity in very special cases. I'd hate for it to become the norm in applications to start looking at cpu topology and making decisions based on that. Not that I would argue if someone were to implement this. I just want us to get it right often enough in the scheduler that it's not necessary. The uses for setaffinity that I have seen so far have been very special purpose. Or quite often just spawning one thread per cpu and pinning it in place for various purposes. Jeff > > -- > Andre > From owner-freebsd-arch@FreeBSD.ORG Sun Dec 23 06:35:16 2007 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B9A4F16A419; Sun, 23 Dec 2007 06:35:16 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 7CE7513C448; Sun, 23 Dec 2007 06:35:16 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.107] (cpe-24-94-75-93.hawaii.res.rr.com [24.94.75.93]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id lBN6ZEmS017221; Sun, 23 Dec 2007 01:35:15 -0500 (EST) (envelope-from jroberson@chesapeake.net) Date: Sat, 22 Dec 2007 20:36:32 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: David Xu In-Reply-To: <476B1973.6070902@freebsd.org> Message-ID: <20071222203443.U899@desktop> References: <20071219211025.T899@desktop> <476B1973.6070902@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org Subject: Re: Linux compatible setaffinity. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Dec 2007 06:35:16 -0000 On Fri, 21 Dec 2007, David Xu wrote: > Jeff Roberson wrote: >> I have implemented a linux compatible sched_setaffinity() call which is >> somewhat crippled. This allows a userspace process to supply a bitmask of >> processors which it will run on. I have copied the linux interface such >> that it should be api compatible because I believe it is a sensible >> interface and they beat us to it by 3 years. >> >> My implementation is crippled in that it supports binding by curthread only >> and to a single cpu only. Neither of the schedulers presently support >> binding to multiple cpus or binding a non-curthread thread. This property >> is not inherited by forked threads and does not effect other threads in the >> same process. These two limitations can gradually be weakened without >> effecting the syscall api. >> >> The linux api is: >> int sched_setaffinity(pid_t pid, unsigned int cpusetsize, cpu_set_t >> *mask); >> >> The cpu_set_t is the same as a fdset for select. The cpusetsize argument >> is used to determine the size of the array in mask. >> >> I'm mostly interested in feedback on how best to reduce the namespace >> pollution and avoid pulling the sched.h file into the generated syscall >> files (sysproto.h, etc). Anyone who feels this is a terrible interface for >> such a thing should speak up now. >> >> I also feel that in the medium term we will have to deal with machines with >> more cores than bits in their native word. Using these CPU_SET, CPU_CLR >> macros is a fine way to deal with this issue. >> >> I also have a primitive 'taskset', although I don't like the name, it >> allows you to run arbitrary programs bound to a single cpu. >> >> Thanks, >> Jeff >> > > I don't say no to these interfaces, but there is a need to tell > user which cpus are sharing cache, or memory distance is closest enough, > and which cpus are servicing interrupts, e.g, network interrupt and > disks etc, etc, otherwise, blindly setting cpu affinity mask only > can shoot itself in the foot. I don't disagree with you, however, I think in most cases the affinity mask is used for very special purpose applications. In the cases I have observed, anyhow, the application is tailored to the particular machine. So hopefully the programmer knows these things. I would prefer that it not crop up as a general interface that normal applcations use to try to improve performance. We should hope that we can improve the schedulers to do these things automatically. Thanks, Jeff > > Regards, > David Xu > From owner-freebsd-arch@FreeBSD.ORG Sun Dec 23 06:47:30 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 31F6B16A41A; Sun, 23 Dec 2007 06:47:30 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id 0746313C461; Sun, 23 Dec 2007 06:47:29 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.1/8.14.1) with ESMTP id lBN6fptc065424; Sat, 22 Dec 2007 23:41:51 -0700 (MST) (envelope-from imp@bsdimp.com) Date: Sat, 22 Dec 2007 23:45:26 -0700 (MST) Message-Id: <20071222.234526.246317277.imp@bsdimp.com> To: hselasky@c2i.net From: "M. Warner Losh" In-Reply-To: <200712202005.33263.hselasky@c2i.net> References: <200712202005.33263.hselasky@c2i.net> X-Mailer: Mew version 5.2 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: phk@phk.freebsd.dk, alfred@freebsd.org, freebsd-arch@freebsd.org Subject: Re: More leaves on the device tree ? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Dec 2007 06:47:30 -0000 In message: <200712202005.33263.hselasky@c2i.net> Hans Petter Selasky writes: : I'm currently working on USB and I have been thinking about a simple way to : find what devices an USB device creates, and how to easily present that : information to the user. : : I know there is "devinfo" and I would like to extend this utility to also show : which devices under /dev belongs to the device. : : Implementation: : : "make_dev" takes an additional "device_t parent_device" argument and creates a : child device with some magic flags set. : : Any comments ? What do you do for all the devices in /dev/ for which there is no device_t parent? In general, we've tried to keep dev_t and device_t separate inside of the kernel. They are orthogonal, but related, things. This gets especially messy when you add to the mix NIC drivers, which create no devices, but have network interfaces. Do you also track that? What about the relationship to cloned or otherwise faked devices such as the floppy driver and many tty drivers produce. While it sounds simple and straight forward, I don't think that a good implementation that takes into account the complexities of actual hardware would be worth the complexity. Warner From owner-freebsd-arch@FreeBSD.ORG Sun Dec 23 07:44:50 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2E6CF16A417; Sun, 23 Dec 2007 07:44:50 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id B9E9013C44B; Sun, 23 Dec 2007 07:44:49 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (unknown [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 009DD17104; Sun, 23 Dec 2007 07:44:47 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.2/8.14.2) with ESMTP id lBN7ikow005387; Sun, 23 Dec 2007 07:44:46 GMT (envelope-from phk@critter.freebsd.dk) To: "M. Warner Losh" From: "Poul-Henning Kamp" In-Reply-To: Your message of "Sat, 22 Dec 2007 23:45:26 MST." <20071222.234526.246317277.imp@bsdimp.com> Date: Sun, 23 Dec 2007 07:44:46 +0000 Message-ID: <5386.1198395886@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: freebsd-arch@freebsd.org, alfred@freebsd.org, hselasky@c2i.net Subject: Re: More leaves on the device tree ? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Dec 2007 07:44:50 -0000 In message <20071222.234526.246317277.imp@bsdimp.com>, "M. Warner Losh" writes: >In message: <200712202005.33263.hselasky@c2i.net> > Hans Petter Selasky writes: >: "make_dev" takes an additional "device_t parent_device" argument and creates a >: child device with some magic flags set. > >What do you do for all the devices in /dev/ for which there is no >device_t parent? I second Warners comments here. device_t is a handle for a hardware, dev_t is for a device in /dev, they are very different thing and have no reasonable mapping between them ([0..N]:[0..M]) -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Sun Dec 23 08:43:48 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1718F16A55F; Sun, 23 Dec 2007 08:43:48 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe05.swip.net [212.247.154.129]) by mx1.freebsd.org (Postfix) with ESMTP id 564F813C45A; Sun, 23 Dec 2007 08:43:47 +0000 (UTC) (envelope-from hselasky@c2i.net) X-Cloudmark-Score: 0.000000 [] Received: from [193.217.102.3] (account mc467741@c2i.net HELO [10.0.0.105]) by mailfe05.swip.net (CommuniGate Pro SMTP 5.1.13) with ESMTPA id 641958604; Sun, 23 Dec 2007 09:43:45 +0100 From: Hans Petter Selasky To: "Poul-Henning Kamp" Date: Sun, 23 Dec 2007 09:44:27 +0100 User-Agent: KMail/1.9.7 References: <5386.1198395886@critter.freebsd.dk> In-Reply-To: <5386.1198395886@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200712230944.29301.hselasky@c2i.net> Cc: alfred@freebsd.org, freebsd-arch@freebsd.org Subject: Re: More leaves on the device tree ? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Dec 2007 08:43:48 -0000 On Sunday 23 December 2007, Poul-Henning Kamp wrote: > In message <20071222.234526.246317277.imp@bsdimp.com>, "M. Warner Losh" writes: > >In message: <200712202005.33263.hselasky@c2i.net> > > > > Hans Petter Selasky writes: > >: "make_dev" takes an additional "device_t parent_device" argument and > >: creates a child device with some magic flags set. > > > >What do you do for all the devices in /dev/ for which there is no > >device_t parent? Hi, If the parent is NULL, then no dev_t node is created. Regarding cloned devices my opinion is that they should always create a visible entry. What I have done for a while now is to create a dummy dev_t node. It is so annoying with invisible devices. Then you never know what you have got. For example "/dev/usbXXX". What I do is simply to create "/dev/usb0 " with a space in the end. This file is not openable. Really there sould be a flag for that. Then you open "/dev/usb0" instead, but this device is never created. That's the clone device. Then clones appear like "/dev/usb0.XX": /dev/usb0 % /dev/usb1 % /dev/usb2 % /dev/usb3 % /dev/usb0.00% /dev/usb1.00% /dev/usb2.00% /dev/usb3.00% > > I second Warners comments here. > > device_t is a handle for a hardware, dev_t is for a device in /dev, > they are very different thing and have no reasonable mapping between > them ([0..N]:[0..M]) I'm not saying that every make_dev() should take a device_t parent. If there is no "device_t" parent then there will be no node created. Another approach is to add something like: void device_enlist_subdev(device_t parent, dev_t sub); void device_delist_subdev(device_t parent, dev_t sub); struct device { ... LIST_HEAD(struct cdev) dv_cdev_children; ... }; struct cdev { LIST_ENTRY( .... ) dv_list; }; For example if you have 8 USB serial port adapters, then you just get 8 TTY devices like /dev/cuaUXXX . And finding out where the USB devices are actually connected could be very simple if we could put some hints perhaps in the device_t tree ? --HPS From owner-freebsd-arch@FreeBSD.ORG Sun Dec 23 09:31:16 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 23F2616A46C for ; Sun, 23 Dec 2007 09:31:16 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id 20BAB13C47E for ; Sun, 23 Dec 2007 09:31:16 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: by elvis.mu.org (Postfix, from userid 1192) id 72A1C1A4D82; Sun, 23 Dec 2007 01:29:41 -0800 (PST) Date: Sun, 23 Dec 2007 01:29:41 -0800 From: Alfred Perlstein To: Hans Petter Selasky Message-ID: <20071223092941.GV16982@elvis.mu.org> References: <5386.1198395886@critter.freebsd.dk> <200712230944.29301.hselasky@c2i.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200712230944.29301.hselasky@c2i.net> User-Agent: Mutt/1.4.2.3i Cc: Poul-Henning Kamp , freebsd-arch@freebsd.org Subject: Re: More leaves on the device tree ? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Dec 2007 09:31:16 -0000 I think we're getting a little off track of the TODO items for the inclusion of the usb code. I think the original post mentions that devinfo(?) works for the time being, although perhaps some work later on presentation could be done. Let's stick with devinfo and hit up the next tasks on the inclusion list. I think the next thing is the SMP locking? Or do you have anything you have in mind from the list you would like to tackle next? thank you, -Alfred * Hans Petter Selasky [071223 00:42] wrote: > On Sunday 23 December 2007, Poul-Henning Kamp wrote: > > In message <20071222.234526.246317277.imp@bsdimp.com>, "M. Warner Losh" > writes: > > >In message: <200712202005.33263.hselasky@c2i.net> > > > > > > Hans Petter Selasky writes: > > >: "make_dev" takes an additional "device_t parent_device" argument and > > >: creates a child device with some magic flags set. > > > > > >What do you do for all the devices in /dev/ for which there is no > > >device_t parent? > > Hi, > > If the parent is NULL, then no dev_t node is created. > > Regarding cloned devices my opinion is that they should always create a > visible entry. What I have done for a while now is to create a dummy dev_t > node. It is so annoying with invisible devices. Then you never know what you > have got. For example "/dev/usbXXX". > > What I do is simply to create "/dev/usb0 " with a space in the end. This file > is not openable. Really there sould be a flag for that. Then you > open "/dev/usb0" instead, but this device is never created. That's the clone > device. Then clones appear like "/dev/usb0.XX": > > /dev/usb0 % /dev/usb1 % /dev/usb2 % /dev/usb3 % > /dev/usb0.00% /dev/usb1.00% /dev/usb2.00% /dev/usb3.00% > > > > > I second Warners comments here. > > > > device_t is a handle for a hardware, dev_t is for a device in /dev, > > they are very different thing and have no reasonable mapping between > > them ([0..N]:[0..M]) > > I'm not saying that every make_dev() should take a device_t parent. If there > is no "device_t" parent then there will be no node created. > > Another approach is to add something like: > > void device_enlist_subdev(device_t parent, dev_t sub); > void device_delist_subdev(device_t parent, dev_t sub); > > struct device { > ... > LIST_HEAD(struct cdev) dv_cdev_children; > ... > }; > > struct cdev { > LIST_ENTRY( .... ) dv_list; > }; > > For example if you have 8 USB serial port adapters, then you just get 8 TTY > devices like /dev/cuaUXXX . And finding out where the USB devices are > actually connected could be very simple if we could put some hints perhaps in > the device_t tree ? > > --HPS -- - Alfred Perlstein From owner-freebsd-arch@FreeBSD.ORG Sun Dec 23 10:31:41 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8CD7216A419; Sun, 23 Dec 2007 10:31:41 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe08.swip.net [212.247.154.225]) by mx1.freebsd.org (Postfix) with ESMTP id E9B1113C469; Sun, 23 Dec 2007 10:31:40 +0000 (UTC) (envelope-from hselasky@c2i.net) X-Cloudmark-Score: 0.000000 [] Received: from [193.217.102.3] (account mc467741@c2i.net HELO [10.0.0.105]) by mailfe08.swip.net (CommuniGate Pro SMTP 5.1.13) with ESMTPA id 739633548; Sun, 23 Dec 2007 11:31:39 +0100 From: Hans Petter Selasky To: Alfred Perlstein Date: Sun, 23 Dec 2007 11:32:20 +0100 User-Agent: KMail/1.9.7 References: <5386.1198395886@critter.freebsd.dk> <200712230944.29301.hselasky@c2i.net> <20071223092941.GV16982@elvis.mu.org> In-Reply-To: <20071223092941.GV16982@elvis.mu.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200712231132.22576.hselasky@c2i.net> Cc: Poul-Henning Kamp , freebsd-arch@freebsd.org Subject: Re: More leaves on the device tree ? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Dec 2007 10:31:41 -0000 On Sunday 23 December 2007, Alfred Perlstein wrote: > I think we're getting a little off track of the TODO items > for the inclusion of the usb code. > > I think the original post mentions that devinfo(?) works for > the time being, although perhaps some work later on presentation > could be done. > > Let's stick with devinfo and hit up the next tasks on the > inclusion list. > > I think the next thing is the SMP locking? Or do you have > anything you have in mind from the list you would like to > tackle next? No, just go ahead. What needs to be done about SMP locking ? --HPS From owner-freebsd-arch@FreeBSD.ORG Mon Dec 24 01:43:03 2007 Return-Path: Delivered-To: arch@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2BAB416A41B; Mon, 24 Dec 2007 01:43:03 +0000 (UTC) (envelope-from davidxu@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 3500313C45A; Mon, 24 Dec 2007 01:43:03 +0000 (UTC) (envelope-from davidxu@FreeBSD.org) Received: from apple.my.domain (root@localhost [127.0.0.1]) by freefall.freebsd.org (8.14.2/8.14.2) with ESMTP id lBO1gxND030035; Mon, 24 Dec 2007 01:43:01 GMT (envelope-from davidxu@freebsd.org) Message-ID: <476F0EE5.1040404@freebsd.org> Date: Mon, 24 Dec 2007 09:44:05 +0800 From: David Xu User-Agent: Thunderbird 2.0.0.9 (X11/20071211) MIME-Version: 1.0 To: Robert Watson References: <20071219211025.T899@desktop> <476B1973.6070902@freebsd.org> <20071222183700.L5866@fledge.watson.org> In-Reply-To: <20071222183700.L5866@fledge.watson.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: arch@FreeBSD.org Subject: Re: Linux compatible setaffinity. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Dec 2007 01:43:03 -0000 Robert Watson wrote: > On Fri, 21 Dec 2007, David Xu wrote: > >> I don't say no to these interfaces, but there is a need to tell user >> which cpus are sharing cache, or memory distance is closest enough, >> and which cpus are servicing interrupts, e.g, network interrupt and >> disks etc, etc, otherwise, blindly setting cpu affinity mask only can >> shoot itself in the foot. > > While the Mac OS X API is pretty Mach-specific, it's worth taking a look > at their recently-announced affinity API: > > http://developer.apple.com/releasenotes/Performance/RN-AffinityAPI/index.html > > > Robert N M Watson > Computer Laboratory > University of Cambridge > I like the interfaces, it is more flexible. Thanks David Xu From owner-freebsd-arch@FreeBSD.ORG Mon Dec 24 10:43:29 2007 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id ECFF816A417 for ; Mon, 24 Dec 2007 10:43:28 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.freebsd.org (Postfix) with ESMTP id 9B0F813C46A for ; Mon, 24 Dec 2007 10:43:28 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 40AEC46BB2 for ; Mon, 24 Dec 2007 05:43:28 -0500 (EST) Date: Mon, 24 Dec 2007 10:43:28 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: arch@FreeBSD.org Message-ID: <20071224103322.C40176@fledge.watson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: 8.0 network stack MPsafety goals X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Dec 2007 10:43:29 -0000 Dear all: With the 7.0 release around the corner, many developers are starting to think about (and in quite a few cases, work on) their goals for 8.0. One of our on-going kernel projects has been the elimination of the Giant lock, and that project has transformed into one of optimizating behavior on increasing numbers of processors. In 7.0, despite the noteworth accomplishment of eliminating debug.mpsasfenet and conditional network stack Gian acquisition, we were unable to fully eliminate the IFF_NEEDSGIANT flag, which controls the conditional acquisition of the Giant lock around non-MPSAFE network device drivers. Primarily these drivers are aging ISA network device drivers, although there are some exceptions, such as the USB stack. This e-mail proposes the elimination of the IFF_NEEDSGIANT flag and associated infrastructure in FreeBSD 8.0, meaning that all network device drivers must be able to operate without the Giant lock (largely the case already). Remaining drivers using the IFF_NEEDSGIANT flag must either be updated, or less ideally, removed. I propose the following schedule: Date Goals ---- ----- 26 Dec 2007 Post proposed schedule for flag and infrastructure removal Post affected driver list 26 Jan 2008 Repost proposed schedule for flag and infrastructure removal Post updated affected driver list 26 Feb 2008 Adjust boot-time printf for affect drivers to generate a loud warning. Post updated affected driver list 26 May 2008 Post HEADS UP of impending driver disabling Post updated affected driver list 26 Jun 2008 Disable build of all drivers requiring IFF_NEEDSGIANT Post updated affected driver list 26 Sep 2008 Post HEADS up of impending driver removal Post updated affected driver list 26 Oct 2008 Delete source of all drivers requiring IFF_NEEDSGIANT Remove flag and infrastructure Here is a list of potentially affected drivers: Name Bus Man page description --- --- -------------------- ar ISA/PCI synchronous Digi/Arnet device driver arl ISA Aironet Arlan 655 wireless network adapter driver awi PCCARD AMD PCnetMobile IEEE 802.11 PCMCIA wireless network driver axe USB ASIX Electronics AX88172 USB Ethernet driver cdce USB USB Communication Device Class Ethernet driver cnw PCCARD Netwave AirSurfer wireless network driver cs ISA/PCCARD Ethernet device driver cue USB CATC USB-EL1210A USB Ethernet driver ex ISA/PCCARD Ethernet device driver for the Intel EtherExpress Pro/10 and Pro/10+ fe CBUS/ISA/PCCARD Fujitsu MB86960A/MB86965A based Ethernet adapters ic I2C I2C bus system ie ISA Ethernet device driver kue USB Kawasaki LSI KL5KUSB101B USB Ethernet driver oltr ISA/PCI Olicom Token Ring device driver plip PPBUS printer port Internet Protocol driver ppp TTY point to point protocol network interface ray PCCARD Raytheon Raylink/Webgear Aviator PCCard driver rue USB RealTek RTL8150 USB to Fast Ethernet controller driver rum USB Ralink Technology USB IEEE 802.11a/b/g wireless network device sbni ISA/PCI Granch SBNI12 leased line modem driver sbsh PCI Granch SBNI16 SHDSL modem device driver sl TTY slip network interface snc ISA/PCCARD National Semiconductor DP8393X SONIC Ethernet adapter driver sr ISA/PCI synchronous RISCom/N2 / WANic 400/405 device driver udav USB Davicom DM9601 USB Ethernet driver ural USB Ralink Technology RT2500USB IEEE 802.11 driver xe PCCARD Xircom PCMCIA Ethernet device driver zyd USB ZyDAS ZD1211/ZD1211B USB IEEE 802.11b/g wireless network device In some cases, the requirement for Giant is a property of a subsystem the driver depends on as the driver itself; for example, the tty subsystem for SLIP and PPP, and the USB subsystem for a number of USB ethernet and wireless drivers. With most of a year before to go on the proposed schedule, my hope is that we will have lots of time to address these issues, but wanted to get a roadmap out from a network protocol stack architecture perspective so that device driver and subsystem authors could have a schedule in mind. FYI, the following drivers also reference IFF_NEEDSGIANT, but only in order to provide their own conditional MPSAFEty, which can be removed without affecting device driver functionality (I believe): Name Bus Man page description --- --- -------------------- ce PCI driver for synchronous Cronyx Tau-PCI/32 WAN adapters cp PCI driver for synchronous Cronyx Tau-PCI WAN adapters ctau ISA driver for synchronous Cronyx Tau WAN adapters cx ISA driver for synchronous/asynchronous Cronyx Sigma WAN adapters Developers and users of the above drivers are heavily encouraged to update the drivers to remove dependence on Giant, and/or make other contingency plans. Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Mon Dec 24 11:54:10 2007 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3460616A419; Mon, 24 Dec 2007 11:54:10 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.freebsd.org (Postfix) with ESMTP id EF05B13C4DB; Mon, 24 Dec 2007 11:54:09 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 9BE6447911; Mon, 24 Dec 2007 06:54:09 -0500 (EST) Date: Mon, 24 Dec 2007 11:54:09 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: dima <_pppp@mail.ru> In-Reply-To: Message-ID: <20071224114504.E40176@fledge.watson.org> References: <20071220135342.O67327@fledge.watson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org, net@FreeBSD.org Subject: Re: Re: TCP Projects for 8.0 - first cut wiki page X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Dec 2007 11:54:10 -0000 On Thu, 20 Dec 2007, dima wrote: >> Per earlier e-mail, I've created a page to track the various on-going >> projects: >> >> http://wiki.freebsd.org/TCPProjects8 >> >> Rui has already kindly added the TCP ECN work to the page. > > As I know, we have a single swi:net thread in the kernel yet. Are there any > plans to make several such threads? If yes, this activity isn't mentioned in > wiki. > > There are 2 ideas: 1. per-core thread 2. per-interface thread I like the > second more. This is a kind of tricky point, and one we will definitely be looking at. In FreeBSD 6, we did link layer processing in the ithread, and deferred network layer and socket layer processing to the netisr and user thread. In FreeBSD 7, we process up through the network layer and socket deliver in the ithread, and only the socket read/copyout are deferred to the user thread. This means that in FreeBSD 7, we get true parallelism between different input sources. We still have the netisr, which is used for certain types of deferred processing, such as loopback network traffic (in order to avoid entering the receive path from the transmit path), IPSEC tunnel processing, etc, but for general ethernet traffic, it is not used. This appears to work really well for a small number of interfaces because we eliminate a large number of context switches, and pushed the "drop point" from software into hardware, meaning that we don't burn cycles doing link layer processing for packets that will never make it to the network layer (netisr queue overflow). The two real downsides are that this promotes network layer processing to interrupt priority rather than soft interrupt priority (and this may propagate to more other threads), and that the opportunity for parallelism is reduced between the link layer and the network processing layer. The reason we went ahead and made the default change (it's configurable at runtime) is that it seemed that in most cases, we saw a significant performance improvement. However, the current ithread/direct dispatch model has scaling issues as we approach larger numbers of interfaces, as the ithread approach does generally, because when the number of active thread exceeds the number of cores and the system is really busy, context switches are re-introduced, as well as an increased chance of ithreads bouncing around, etc. What to do at that point is an interesting question--would we be better off reducing the number of active threads so that we have a small ithread worker pool serving many devices, for example? So, in answer to your original question: we already do a per-interface thread for all in-bound processing in FreeBSD 7, but we'll need to continue to work on the underlying model and its behavior under high load. Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Tue Dec 25 03:35:13 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1962F16A420 for ; Tue, 25 Dec 2007 03:35:13 +0000 (UTC) (envelope-from brian.mcginty@gmail.com) Received: from wx-out-0506.google.com (wx-out-0506.google.com [66.249.82.239]) by mx1.freebsd.org (Postfix) with ESMTP id D287213C465 for ; Tue, 25 Dec 2007 03:35:12 +0000 (UTC) (envelope-from brian.mcginty@gmail.com) Received: by wx-out-0506.google.com with SMTP id i29so570603wxd.7 for ; Mon, 24 Dec 2007 19:35:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=TyTAIqfRtGCfHBSavDY+wQvZFqsa4x36vZGfFvbI6c8=; b=G/Vez16mmWHBJVA9OjJ+VFZKP2b0+k8nanRsF1s1L0VIqlXetO0sqcAFAC+3N+XCr5ZDSDMqKjNDMTfznbHMY7mutt0RXq4T/FVlyBnfQSXBU8qqH7KbY5I/CXL0NKFZpSzhuRhYXoATvnnOwbaOVm21iJcA4CjbXKT/h98InPg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=UxE1wnbaOJvHwlSCBLV9y7UkvVGMPr/gDtML8wnaDxGJc1RrO+FK7MZf332nWktqwQH31NPwlQljEPwDQtlc5cCnCrFEFRmIc/UqORdyG6nGfm3PNpahiau7Dp+QKi1jSU6Lw0yG2kPnxCCzurbkpNSkvdfZphLyuHpHbBzZe7M= Received: by 10.70.22.16 with SMTP id 16mr3574395wxv.45.1198552179856; Mon, 24 Dec 2007 19:09:39 -0800 (PST) Received: by 10.70.17.20 with HTTP; Mon, 24 Dec 2007 19:09:39 -0800 (PST) Message-ID: <601bffc40712241909t10e6f3k8e7940d387b6efc2@mail.gmail.com> Date: Mon, 24 Dec 2007 19:09:39 -0800 From: "Brian McGinty" To: "David Xu" In-Reply-To: <476F0EE5.1040404@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20071219211025.T899@desktop> <476B1973.6070902@freebsd.org> <20071222183700.L5866@fledge.watson.org> <476F0EE5.1040404@freebsd.org> Cc: arch@freebsd.org, Robert Watson Subject: Re: Linux compatible setaffinity. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Dec 2007 03:35:13 -0000 On Dec 23, 2007 5:44 PM, David Xu wrote: > > Robert Watson wrote: > > On Fri, 21 Dec 2007, David Xu wrote: > > > >> I don't say no to these interfaces, but there is a need to tell user > >> which cpus are sharing cache, or memory distance is closest enough, > >> and which cpus are servicing interrupts, e.g, network interrupt and > >> disks etc, etc, otherwise, blindly setting cpu affinity mask only can > >> shoot itself in the foot. > > > > While the Mac OS X API is pretty Mach-specific, it's worth taking a look > > at their recently-announced affinity API: > > > > http://developer.apple.com/releasenotes/Performance/RN-AffinityAPI/index.html > > > > > > Robert N M Watson > > Computer Laboratory > > University of Cambridge > > > > > I like the interfaces, it is more flexible. I agree. May I as k what's being planned? It's Jeffs' call finally I think. Brian. From owner-freebsd-arch@FreeBSD.ORG Tue Dec 25 03:52:04 2007 Return-Path: Delivered-To: arch@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5A60D16A41A; Tue, 25 Dec 2007 03:52:04 +0000 (UTC) (envelope-from davidxu@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 4FC1D13C455; Tue, 25 Dec 2007 03:52:04 +0000 (UTC) (envelope-from davidxu@FreeBSD.org) Received: from apple.my.domain (root@localhost [127.0.0.1]) by freefall.freebsd.org (8.14.2/8.14.2) with ESMTP id lBP3q0Tb054785; Tue, 25 Dec 2007 03:52:02 GMT (envelope-from davidxu@freebsd.org) Message-ID: <47707EA2.8010002@freebsd.org> Date: Tue, 25 Dec 2007 11:53:06 +0800 From: David Xu User-Agent: Thunderbird 2.0.0.9 (X11/20071211) MIME-Version: 1.0 To: Brian McGinty References: <20071219211025.T899@desktop> <476B1973.6070902@freebsd.org> <20071222183700.L5866@fledge.watson.org> <476F0EE5.1040404@freebsd.org> <601bffc40712241909t10e6f3k8e7940d387b6efc2@mail.gmail.com> In-Reply-To: <601bffc40712241909t10e6f3k8e7940d387b6efc2@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: arch@FreeBSD.org, Robert Watson Subject: Re: Linux compatible setaffinity. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Dec 2007 03:52:04 -0000 Brian McGinty wrote: > On Dec 23, 2007 5:44 PM, David Xu wrote: >> Robert Watson wrote: >>> On Fri, 21 Dec 2007, David Xu wrote: >>> >>>> I don't say no to these interfaces, but there is a need to tell user >>>> which cpus are sharing cache, or memory distance is closest enough, >>>> and which cpus are servicing interrupts, e.g, network interrupt and >>>> disks etc, etc, otherwise, blindly setting cpu affinity mask only can >>>> shoot itself in the foot. >>> While the Mac OS X API is pretty Mach-specific, it's worth taking a look >>> at their recently-announced affinity API: >>> >>> http://developer.apple.com/releasenotes/Performance/RN-AffinityAPI/index.html >>> >>> >>> Robert N M Watson >>> Computer Laboratory >>> University of Cambridge >>> >> >> I like the interfaces, it is more flexible. > > I agree. May I as k what's being planned? It's Jeffs' call finally I think. > > Brian. I don't have plan. ;-) If I understand it correctly, it is a hint to scheduler, it is better describing thread relationship, while Jeff's interface is a hard cpu binding interface, it is still needed in some circumstance. Regards, From owner-freebsd-arch@FreeBSD.ORG Tue Dec 25 05:19:44 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0536416A417; Tue, 25 Dec 2007 05:19:44 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id CB1C813C455; Tue, 25 Dec 2007 05:19:43 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.107] (cpe-24-94-75-93.hawaii.res.rr.com [24.94.75.93]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id lBP5JeGG048514; Tue, 25 Dec 2007 00:19:42 -0500 (EST) (envelope-from jroberson@chesapeake.net) Date: Mon, 24 Dec 2007 19:21:10 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: David Xu In-Reply-To: <47707EA2.8010002@freebsd.org> Message-ID: <20071224191954.Q73903@desktop> References: <20071219211025.T899@desktop> <476B1973.6070902@freebsd.org> <20071222183700.L5866@fledge.watson.org> <476F0EE5.1040404@freebsd.org> <601bffc40712241909t10e6f3k8e7940d387b6efc2@mail.gmail.com> <47707EA2.8010002@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Brian McGinty , Robert Watson , arch@freebsd.org Subject: Re: Linux compatible setaffinity. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Dec 2007 05:19:44 -0000 On Tue, 25 Dec 2007, David Xu wrote: > Brian McGinty wrote: >> On Dec 23, 2007 5:44 PM, David Xu wrote: >>> Robert Watson wrote: >>>> On Fri, 21 Dec 2007, David Xu wrote: >>>> >>>>> I don't say no to these interfaces, but there is a need to tell user >>>>> which cpus are sharing cache, or memory distance is closest enough, >>>>> and which cpus are servicing interrupts, e.g, network interrupt and >>>>> disks etc, etc, otherwise, blindly setting cpu affinity mask only can >>>>> shoot itself in the foot. >>>> While the Mac OS X API is pretty Mach-specific, it's worth taking a look >>>> at their recently-announced affinity API: >>>> >>>> http://developer.apple.com/releasenotes/Performance/RN-AffinityAPI/index.html >>>> >>>> >>>> Robert N M Watson >>>> Computer Laboratory >>>> University of Cambridge >>>> >>> >>> I like the interfaces, it is more flexible. >> >> I agree. May I as k what's being planned? It's Jeffs' call finally I think. >> >> Brian. > > I don't have plan. ;-) If I understand it correctly, it is a hint to > scheduler, it is better describing thread relationship, while Jeff's > interface is a hard cpu binding interface, it is still needed in some > circumstance. Yes, I don't think they're exclusive. However, the system scheduler makes some observations about what threads might be best placed near each other. I have plans to make ULE even smarter in this regard so that the application developers would almost never need to hint it. I think these kinds of hints are not often correct or very useful anyway. Thanks, Jeff > > Regards, > > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Tue Dec 25 20:10:54 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8811216A419 for ; Tue, 25 Dec 2007 20:10:54 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.freebsd.org (Postfix) with ESMTP id 454A613C4D9 for ; Tue, 25 Dec 2007 20:10:54 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id E976146B94; Tue, 25 Dec 2007 15:10:53 -0500 (EST) Date: Tue, 25 Dec 2007 20:10:53 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Jeff Roberson In-Reply-To: <20071219211025.T899@desktop> Message-ID: <20071225201012.S85517@fledge.watson.org> References: <20071219211025.T899@desktop> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org Subject: Re: Linux compatible setaffinity. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Dec 2007 20:10:54 -0000 On Wed, 19 Dec 2007, Jeff Roberson wrote: > I have implemented a linux compatible sched_setaffinity() call which is > somewhat crippled. This allows a userspace process to supply a bitmask of > processors which it will run on. I have copied the linux interface such > that it should be api compatible because I believe it is a sensible > interface and they beat us to it by 3 years. BTW, I notice that you declare sched_getaffinity() in the user include file, but don't reserve a system call in syscalls.master or implement it. Is this intentional? Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Wed Dec 26 07:51:42 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6CA9E16A417; Wed, 26 Dec 2007 07:51:42 +0000 (UTC) (envelope-from deischen@freebsd.org) Received: from mail.netplex.net (mail.netplex.net [204.213.176.10]) by mx1.freebsd.org (Postfix) with ESMTP id 34AAA13C46A; Wed, 26 Dec 2007 07:51:41 +0000 (UTC) (envelope-from deischen@freebsd.org) Received: from sea.ntplx.net (sea.ntplx.net [204.213.176.11]) by mail.netplex.net (8.14.2/8.14.2/NETPLEX) with ESMTP id lBQ7pdfg003814; Wed, 26 Dec 2007 02:51:40 -0500 (EST) X-Virus-Scanned: by AMaViS and Clam AntiVirus (mail.netplex.net) X-Greylist: Message whitelisted by DRAC access database, not delayed by milter-greylist-4.0 (mail.netplex.net [204.213.176.10]); Wed, 26 Dec 2007 02:51:40 -0500 (EST) Date: Wed, 26 Dec 2007 02:51:39 -0500 (EST) From: Daniel Eischen X-X-Sender: eischen@sea.ntplx.net To: David Xu In-Reply-To: <47707EA2.8010002@freebsd.org> Message-ID: References: <20071219211025.T899@desktop> <476B1973.6070902@freebsd.org> <20071222183700.L5866@fledge.watson.org> <476F0EE5.1040404@freebsd.org> <601bffc40712241909t10e6f3k8e7940d387b6efc2@mail.gmail.com> <47707EA2.8010002@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Brian McGinty , Robert Watson , arch@freebsd.org Subject: Re: Linux compatible setaffinity. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Daniel Eischen List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Dec 2007 07:51:42 -0000 On Tue, 25 Dec 2007, David Xu wrote: > Brian McGinty wrote: >> On Dec 23, 2007 5:44 PM, David Xu wrote: >>> Robert Watson wrote: >>>> On Fri, 21 Dec 2007, David Xu wrote: >>>> >>>>> I don't say no to these interfaces, but there is a need to tell user >>>>> which cpus are sharing cache, or memory distance is closest enough, >>>>> and which cpus are servicing interrupts, e.g, network interrupt and >>>>> disks etc, etc, otherwise, blindly setting cpu affinity mask only can >>>>> shoot itself in the foot. >>>> While the Mac OS X API is pretty Mach-specific, it's worth taking a look >>>> at their recently-announced affinity API: >>>> >>>> http://developer.apple.com/releasenotes/Performance/RN-AffinityAPI/index.html >>>> >>>> >>>> Robert N M Watson >>>> Computer Laboratory >>>> University of Cambridge >>>> >>> >>> I like the interfaces, it is more flexible. >> >> I agree. May I as k what's being planned? It's Jeffs' call finally I think. >> >> Brian. > > I don't have plan. ;-) If I understand it correctly, it is a hint to > scheduler, it is better describing thread relationship, while Jeff's > interface is a hard cpu binding interface, it is still needed in some > circumstance. Please take a look at Solaris' API for processor set binding: http://docs.sun.com/app/docs/doc/816-5167/6mbb2jae6?a=expand See processor_bind, processor_info, and pset_*. -- DE From owner-freebsd-arch@FreeBSD.ORG Wed Dec 26 07:56:55 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 613D416A420 for ; Wed, 26 Dec 2007 07:56:55 +0000 (UTC) (envelope-from edwin@mavetju.org) Received: from mail5out.barnet.com.au (mail5.barnet.com.au [202.83.178.78]) by mx1.freebsd.org (Postfix) with ESMTP id 163F013C4D5 for ; Wed, 26 Dec 2007 07:56:54 +0000 (UTC) (envelope-from edwin@mavetju.org) Received: by mail5out.barnet.com.au (Postfix, from userid 1001) id AAA8F2218A90; Wed, 26 Dec 2007 18:56:53 +1100 (EST) X-Viruscan-Id: <4772094500007C2CD87F14@BarNet> Received: from mail5auth.barnet.com.au (mail5.barnet.com.au [202.83.178.78]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "mail5auth.barnet.com.au", Issuer "*.barnet.com.au" (verified OK)) by mail5.barnet.com.au (Postfix) with ESMTP id 74E4B21B18F7; Wed, 26 Dec 2007 18:56:53 +1100 (EST) Received: from k7.mavetju (k7.mavetju.org [10.251.1.18]) by mail5auth.barnet.com.au (Postfix) with ESMTP id 0E0202218A87; Wed, 26 Dec 2007 18:56:53 +1100 (EST) Received: by k7.mavetju (Postfix, from userid 1001) id 87F4D286; Wed, 26 Dec 2007 18:56:52 +1100 (EST) Date: Wed, 26 Dec 2007 18:56:52 +1100 From: Edwin Groothuis To: arch@freebsd.org, gnn@freebsd.org Message-ID: <20071226075652.GC40967@k7.mavetju> References: <20071209223042.GA40965@k7.mavetju> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i Cc: Subject: Re: bin/118292: Add support to remove all msg/shm/sem ids with ipcrm X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Dec 2007 07:56:55 -0000 On Mon, Dec 10, 2007 at 09:26:34AM -0500, gnn@freebsd.org wrote: > At Mon, 10 Dec 2007 09:30:42 +1100, > Edwin Groothuis wrote: > > > > Hello, > > > > A friend of me has submitted this PR and I promised him that I would > > see if I could get it implemented. I couldn't find anybody directly > > responsible for the ips/iprcm tools, so I throw it in here for > > discussion. [...] > > > > I will do it in two parts (according to the wishes of my mentor): > > First style(9)ify ipcrm.c, then the patch. > > > > If anybody has a good observation on this change, please speak up now. > > I have not read the patch in detail but I like the idea, we should be > able to easily clean such things up. It has been commited to HEAD, it will be MFCd when the src freezes are over. Edwin -- Edwin Groothuis | Personal website: http://www.mavetju.org edwin@mavetju.org | Weblog: http://www.mavetju.org/weblog/ From owner-freebsd-arch@FreeBSD.ORG Wed Dec 26 09:19:32 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BCF4F16A41A for ; Wed, 26 Dec 2007 09:19:32 +0000 (UTC) (envelope-from mindactive@ecastnews.com) Received: from attivonet.net (attivonet.net [72.3.236.78]) by mx1.freebsd.org (Postfix) with ESMTP id 861E013C46E for ; Wed, 26 Dec 2007 09:19:32 +0000 (UTC) (envelope-from mindactive@ecastnews.com) Received: (qmail 21826 invoked by uid 48); 26 Dec 2007 00:16:28 -0600 To: arch@freebsd.org Received: from mailer by www.ecastnews.com with HTTP (Mail); Wed, 26 Dec 2007 00:16:28 -0600 Date: Wed, 26 Dec 2007 00:16:28 -0600 From: "FullMotionMail.com" Message-ID: <3e8bbcb199281fbe09574cd1dd29cfe4@www.ecastnews.com> X-Priority: 3 X-Mailer: AC Mailer X-mid: YXJjaEBmcmVlYnNkLm9yZyAsIG0yOQ== MIME-Version: 1.0 Content-Type: text/plain; charset = "utf-8" Content-Transfer-Encoding: 8bit X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: Subject: FullMotionMail - Your Free Video eMail Source X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: info@fullmotionmail.com List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Dec 2007 09:19:32 -0000 SHARE YOUR HOLIDAY SPIRIT AND SEND A FREE VIDEO EMAIL! FullMotion VideoMail http://www.fullmotionmail.com WELCOME! This message comes to you from FullMotionMail.com, your free source for sending personal video messages to your friends and family. HOW DOES IT WORK? If your computer has a built -in video cam, its a snap. You can also easily connect your video camera to your firewire or USB port. Then simply click on the FullMotionMail.com link to create your message, the system will detect your camera, or choose your connection form the list and your ready to go. The whole thing works completely within your browser (no special software is required). You can also select from a variety of templates to use as backgrounds for your message. 1: Select "Allow" 2: Choose a theme 3: Select your camera (Your image will show up immediately if your camera is connected and turned on. Choose your Mic the same way) 4: Press record (Press "Record" and record your video mail once you complete setup and once you complete you can review ot or go ahead and send or make it again if you need to) 5: Click "LIKE IT" and send (Once you like your recording, click "LIKE IT" and send your video mail to up to 6 people at once. Send them a short message as well and your message will come to them in an eMail. A link will take them to the site and the theme you chose as well as video to watch and create their own) 5: click "LIKE IT" and send EMBED CODE Copy & paste the code below and place it in a blog or social network such as Facebook or MySpace. Then take it one step further. If you have a website, mySpace, Facebook or something like a WordPress blog, you can embed a small piece of code and add this awesome widget to your page. This will allow your visitors to send their messages and so on and so on and so on. All usage is free and requires no user account. No user information is collected or stored for use, your video will be available for viewing for 3 months, (and you can always create another). The holidys are a time for you to share the spirit with your family and friends so have fun and make a quick and easy little video and put a smile on their faces. FullMotionMail.com is brought to you by: www.mindactive.com MindActive - Digital Marketing Innovation. To Unsubscribe, please click here : http://www.ecastnews.com/listServer/box.php?funcml=unsub2&nl=15&mi=29&email=arch@freebsd.org (c) 2007 MindActive Design Studio LLC company. From owner-freebsd-arch@FreeBSD.ORG Wed Dec 26 19:41:11 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B07B016A418 for ; Wed, 26 Dec 2007 19:41:11 +0000 (UTC) (envelope-from aryeh.friedman@gmail.com) Received: from mta4.srv.hcvlny.cv.net (mta4.srv.hcvlny.cv.net [167.206.4.199]) by mx1.freebsd.org (Postfix) with ESMTP id 89A8B13C47E for ; Wed, 26 Dec 2007 19:41:11 +0000 (UTC) (envelope-from aryeh.friedman@gmail.com) Received: from flosoft.no-ip.biz (ool-435559b8.dyn.optonline.net [67.85.89.184]) by mta4.srv.hcvlny.cv.net (Sun Java System Messaging Server 6.2-8.04 (built Feb 28 2007)) with ESMTP id <0JTO002047YAP090@mta4.srv.hcvlny.cv.net> for freebsd-arch@freebsd.org; Wed, 26 Dec 2007 14:10:59 -0500 (EST) Received: from flosoft.no-ip.biz (localhost [IPv6:::1]) by flosoft.no-ip.biz (8.14.2/8.14.2) with ESMTP id lBQJAw7r019014 for ; Wed, 26 Dec 2007 14:10:58 -0500 Date: Wed, 26 Dec 2007 14:10:58 -0500 From: "Aryeh M. Friedman" To: freebsd-arch@freebsd.org Message-id: <4772A742.4050106@gmail.com> MIME-version: 1.0 Content-type: text/plain; charset=ISO-8859-1 Content-transfer-encoding: 7BIT X-Enigmail-Version: 0.95.5 User-Agent: Thunderbird 2.0.0.9 (X11/20071217) Subject: Adding better database support to the base system X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Dec 2007 19:41:11 -0000 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Currently the only available DB support in the base system is Berkeley DB (1.x) there are several items that would benefit from migrating something like minisql into the base system. The most immediate application that comes to mind is enabling some interesting features for the ports system. Therefor I purpose migrating some minimal RDBM's features into the base system. - -- Aryeh M. Friedman FloSoft Systems http://www.flosoft-systems.com Developer, not business, friendly -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4 (FreeBSD) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHcqdCzIOMjAek4JIRAqw2AJ0Z1/xKy/fEafbQVP18oUDq2HPz9QCfbQBU 1cjpr9Wy/6zdXUT79tMJvoI= =sPik -----END PGP SIGNATURE----- From owner-freebsd-arch@FreeBSD.ORG Wed Dec 26 22:08:34 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 52CB016A418 for ; Wed, 26 Dec 2007 22:08:34 +0000 (UTC) (envelope-from dougb@FreeBSD.org) Received: from mail2.fluidhosting.com (mx21.fluidhosting.com [204.14.89.4]) by mx1.freebsd.org (Postfix) with SMTP id ECF5F13C457 for ; Wed, 26 Dec 2007 22:08:33 +0000 (UTC) (envelope-from dougb@FreeBSD.org) Received: (qmail 16908 invoked by uid 399); 26 Dec 2007 22:08:33 -0000 Received: from localhost (HELO ?192.168.0.4?) (dougb@dougbarton.us@127.0.0.1) by localhost with ESMTP; 26 Dec 2007 22:08:33 -0000 X-Originating-IP: 127.0.0.1 Message-ID: <4772D0DF.2030505@FreeBSD.org> Date: Wed, 26 Dec 2007 14:08:31 -0800 From: Doug Barton Organization: http://www.FreeBSD.org/ User-Agent: Thunderbird 2.0.0.9 (Windows/20071031) MIME-Version: 1.0 To: "Aryeh M. Friedman" References: <4772A742.4050106@gmail.com> In-Reply-To: <4772A742.4050106@gmail.com> X-Enigmail-Version: 0.95.5 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org Subject: Re: Adding better database support to the base system X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Dec 2007 22:08:34 -0000 Aryeh M. Friedman wrote: > Currently the only available DB support in the base system is Berkeley > DB (1.x) there are several items that would benefit from migrating > something like minisql into the base system. The most immediate > application that comes to mind is enabling some interesting features > for the ports system. Therefor I purpose migrating some minimal > RDBM's features into the base system. To get any sort of useful feedback your recommendation has to be much more specific. You should also focus on candidates that are BSD-licensed, or equivalent. Doug -- This .signature sanitized for your protection From owner-freebsd-arch@FreeBSD.ORG Wed Dec 26 23:31:38 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C8B8A16A419 for ; Wed, 26 Dec 2007 23:31:37 +0000 (UTC) (envelope-from freebsd-arch@m.gmane.org) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by mx1.freebsd.org (Postfix) with ESMTP id 826D913C447 for ; Wed, 26 Dec 2007 23:31:37 +0000 (UTC) (envelope-from freebsd-arch@m.gmane.org) Received: from root by ciao.gmane.org with local (Exim 4.43) id 1J7eq8-0006eW-1u for freebsd-arch@freebsd.org; Wed, 26 Dec 2007 22:35:04 +0000 Received: from 78-0-77-181.adsl.net.t-com.hr ([78.0.77.181]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 26 Dec 2007 22:35:04 +0000 Received: from ivoras by 78-0-77-181.adsl.net.t-com.hr with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 26 Dec 2007 22:35:04 +0000 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-arch@freebsd.org From: Ivan Voras Date: Wed, 26 Dec 2007 23:33:09 +0100 Lines: 32 Message-ID: References: <4772A742.4050106@gmail.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enigD25FCDF5142033084E5624BD" X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 78-0-77-181.adsl.net.t-com.hr User-Agent: Thunderbird 2.0.0.9 (Windows/20071031) In-Reply-To: <4772A742.4050106@gmail.com> X-Enigmail-Version: 0.95.5 Sender: news Subject: Re: Adding better database support to the base system X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Dec 2007 23:31:38 -0000 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enigD25FCDF5142033084E5624BD Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Aryeh M. Friedman wrote: > Currently the only available DB support in the base system is Berkeley > DB (1.x) there are several items that would benefit from migrating > something like minisql into the base system. The most immediate > application that comes to mind is enabling some interesting features > for the ports system. Therefor I purpose migrating some minimal > RDBM's features into the base system. Been there, tried that (SQLite), but unsuccessfully - people here REALLY like text files :) --------------enigD25FCDF5142033084E5624BD Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHctamldnAQVacBcgRAhNmAKD8Fd8CeMkj+TUhNoCuiFvggqI52ACfSnxw GUUfD0NkMsN1GA9k19zGDJg= =dyoy -----END PGP SIGNATURE----- --------------enigD25FCDF5142033084E5624BD-- From owner-freebsd-arch@FreeBSD.ORG Wed Dec 26 23:39:08 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8914C16A41A for ; Wed, 26 Dec 2007 23:39:08 +0000 (UTC) (envelope-from aryeh.friedman@gmail.com) Received: from mta4.srv.hcvlny.cv.net (mta4.srv.hcvlny.cv.net [167.206.4.199]) by mx1.freebsd.org (Postfix) with ESMTP id 63CD813C43E for ; Wed, 26 Dec 2007 23:39:08 +0000 (UTC) (envelope-from aryeh.friedman@gmail.com) Received: from flosoft.no-ip.biz (ool-435559b8.dyn.optonline.net [67.85.89.184]) by mta4.srv.hcvlny.cv.net (Sun Java System Messaging Server 6.2-8.04 (built Feb 28 2007)) with ESMTP id <0JTO002CYKCCGZD0@mta4.srv.hcvlny.cv.net>; Wed, 26 Dec 2007 18:38:39 -0500 (EST) Received: from flosoft.no-ip.biz (localhost [IPv6:::1]) by flosoft.no-ip.biz (8.14.2/8.14.2) with ESMTP id lBQNcRGY023924; Wed, 26 Dec 2007 18:38:29 -0500 Date: Wed, 26 Dec 2007 18:38:27 -0500 From: "Aryeh M. Friedman" In-reply-to: To: Ivan Voras Message-id: <4772E5F3.4010907@gmail.com> MIME-version: 1.0 Content-type: text/plain; charset=UTF-8 Content-transfer-encoding: 7BIT X-Enigmail-Version: 0.95.5 References: <4772A742.4050106@gmail.com> User-Agent: Thunderbird 2.0.0.9 (X11/20071217) Cc: freebsd-arch@freebsd.org Subject: Re: Adding better database support to the base system X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Dec 2007 23:39:08 -0000 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ivan Voras wrote: > Aryeh M. Friedman wrote: >> Currently the only available DB support in the base system is Berkeley >> DB (1.x) there are several items that would benefit from migrating >> something like minisql into the base system. The most immediate >> application that comes to mind is enabling some interesting features >> for the ports system. Therefor I purpose migrating some minimal >> RDBM's features into the base system. > > Been there, tried that (SQLite), but unsuccessfully - people here REALLY > like text files :) > Thats funny because Berkeley DB and some other tools in the base system write binary files. Now that being said if worst comes to worst it is not that hard (in some ways at least) to layer a non-command language RDBMS on top of Berkeley db (i.e. it has a relational API but no user level commands). Basically all that one needs to do is group keyed values into structured records vs. free form data. - -- Aryeh M. Friedman FloSoft Systems http://www.flosoft-systems.com Developer, not business, friendly -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4 (FreeBSD) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD4DBQFHcuXzzIOMjAek4JIRApeuAJ9OI2tjERLJ45kLUtbaNepydnlOOwCYluh3 3E2dyo6hEOjcS+pllXmRuA== =4/Wm -----END PGP SIGNATURE----- From owner-freebsd-arch@FreeBSD.ORG Wed Dec 26 23:48:35 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C758316A420 for ; Wed, 26 Dec 2007 23:48:35 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outQ.internet-mail-service.net (outQ.internet-mail-service.net [216.240.47.240]) by mx1.freebsd.org (Postfix) with ESMTP id AFA7113C457 for ; Wed, 26 Dec 2007 23:48:35 +0000 (UTC) (envelope-from julian@elischer.org) Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160) by out.internet-mail-service.net (qpsmtpd/0.40) with ESMTP; Wed, 26 Dec 2007 15:48:34 -0800 Received: from julian-mac.elischer.org (localhost [127.0.0.1]) by idiom.com (Postfix) with ESMTP id CC2C9126D82; Wed, 26 Dec 2007 15:48:33 -0800 (PST) Message-ID: <4772E859.3090005@elischer.org> Date: Wed, 26 Dec 2007 15:48:41 -0800 From: Julian Elischer User-Agent: Thunderbird 2.0.0.9 (Macintosh/20071031) MIME-Version: 1.0 To: FreeBSD Net , arch@freebsd.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "Li, Qing" , Robert Watson Subject: multiple routing tables roadmap X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Dec 2007 23:48:35 -0000 On thing where FreeBSD has been falling behind, and which by chance I have some time to work on is "policy based routing", which allows different packet streams to be routed by more than just the destination address. Constraints: ------------ I want to make some form of this available in the 6.x tree (and by extension 7.x) , but FreeBSD in general needs it so I might as well do it in -current and back port the portions I need. One of the ways that this can be done is to have the ability to instantiate multiple kernel routing tables (which I will now refer to as "Forwarding Information Bases" or "FIBs" for political correctness reasons. Which FIB a particular packet uses to make the next hop decision can be decided by a number of mechanisms. The policies these mechanisms implement are the "Policies" referred to in "Policy based routing". One of the constraints I have if I try to back port this work to 6.x is that it must be implemented as a EXTENSION to the existing ABIs in 6.x so that third party applications do not need to be recompiled in timespan of the branch. Implementation method, (part 1) ------------------------------- For this reason I have implemented a "sufficient subset" of a multiple routing table solution in Perforce, and back-ported it to 6.x. (also in Perforce though not yet caught up with what I have done in -current/P4). The subset allows a number of FIBs to be defined at compile time (sufficient for my purposes in 6.x) and implements the changes needed to allow IPV4 to use them. I have not done the changes for ipv6 simply because I do not need it, and I do not have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. Other protocol families are left untouched and should there be users with proprietary protocol families, they should continue to work and be oblivious to the existence of the extra FIBs. To understand how this is done, one must know that the current FIB code starts everything off with a single dimensional array of pointers to FIB head structures (One per protocol family), each of which in turn points to the trie of routes available to that family. The basic change in the ABI compatible version of the change is to extent that array to be a 2 dimensional array, so that instead of protocol family X looking at rt_tables[X] for the table it needs, it looks at rt_tables[Y][X] when for all protocol families except ipv4 Y is always 0. Code that is unaware of the change always just sees the first row of the table, which of course looks just like the one dimensional array that existed before. The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() are all maintained, but refer only to the first row of the array, so that existing callers in proprietary protocols can continue to do the "right thing". Some new entry points are added, for the exclusive use of ipv4 code called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), which have an extra argument which refers the code to the correct row. In addition, there are some new entry points (currently called dom_rtalloc() and friends) that check the Address family being looked up and call either rtalloc() (and friends) if the protocol is not IPv4 forcing the action to row 0 or to the appropriate row if it IS IPv4 (and that info is available). These are for calling from code that is not specific to any particular protocol. The way these are implemented would change in the non ABI preserving code to be added later. One feature of the first version of the code is that for ipv4, the interface routes show up automatically on all the FIBs, so that no matter what FIB you select you always have the basic direct attached hosts available to you. (rtinit() does this automatically). you CAN delete an interface route from one FIB should you want to but by default it's there. ARP information is also available in each FIB. It's assumed that the same machine would have the same MAC address, regardless of which FIB you are using to get to it. This brings us as to how the correct FIB is selected for an outgoing IPV4 packet. Packets fall into one of a number of classes. 1/ locally generated packets, coming from a socket/PCB. Such packets select a FIB from a number associated with the socket/PCB. This in turn is inherited from the process, but can be changed by a socket option. The process in turn inherits it on fork. I have written a utility call setfib that acts a bit like nice.. setfib -n 3 ping target.example.com # will use fib 3 for ping. 2/ packets received on an interface for forwarding. By default these packets would use table 0, (or possibly a number settable in a sysctl(not yet)). but prior to routing the firewall can inspect them (see below). 3/ packets inspected by a packet classifier, which can arbitrarily associate a fib with it on a packet by packet basis. A fib assigned to a packet by a packet classifier (such as ipfw) would over-ride a fib associated by a more default source. (such as cases 1 or 2). routing messages would be associated with their process, and thus select one FIB or another. In addition Netstat has been edited to be able to cope with the fact that the array is now 2 dimensional. (It looks in system memory using libkvm (!)). In addition two sysctls are added to give: a) the number of FIBs compiled in (active) b) the default FIB of the calling process. Early testing experience: ------------------------- Basically our (IronPort's) appliance does this functionality already using ipfw fwd but that method has some drawbacks. For example, It can't fully simulate a routing table because it can't influence the socket's choice of local address when a connect() is done. Testing during the generating of these changes has been remarkably smooth so far. Multiple tables have co-existed with no notable side effects, and packets have been routes accordingly. I have not yet added the changes to ipfw. pf has some similar changes already but they seem to rely on the various FIBs having symbolic names. Which I do not plan to support in the first verion of these changes. SCTP has interestingly enough buiold in support for this, called VRFs in cisco parlance. it will be intersting to see how that handles it when it suddenly actually does something. I have not redone my testing since my last edits, but will be retesting with the current code asap. Where to next: -------------------- After committing the ABI compatible version and MFCing it, I'd like to proceed in a forward direction in -current. this will result in some rototilling in the routing code. Firstly: the current code's idea of having a separate tree per protocol family, all of the same format, and pointed to by the 1 dimensional array is a bit silly. Especially when one considers that there is code that makes assumptions about every protocol having the same internal structures there. Some protocols don't WANT that sort of structure. (for example the whole idea of a netmask is foreign to appletalk). This needs to be made opaque to the external code. My suggested first change is to add routing method pointers to the 'domain' structure, along with information pointing the data. instead of having an array of pointers to uniform structures, there would be an array pointing to the 'domain' structures for each protocol address domain (protocol family), and the methods this reached would be called. The methods would have an argument that gives FIB number, but the protocol would be free to ignore it. Interaction with the ARP layer/ LL layer would need to be revisited as well. Qing Li has been working on this already. diffs for those with p4 access: p4 diff2 -du //depot/vendor/freebsd/src/sys/...@131121 //depot/user/julian/routing/src/sys/... for those with the makediff perl script: perl ~/makediff.pl //depot/vendor/freebsd/src/sys/...@131121 //depot/user/julian/routing/src/sys/... for those with neither: http://people.freebsd.org/~julian/mrt2.diff I just put the userland utility in usr.sbin/setfib/ in p4. and changes to netstat in usr.bin/netstat/ see: http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/julian/routing/src&HIDEDEL=NO I'd like to get comments on this (compat) version, so that I can commit it, get general testing under way to start the clock for MFC, and then get moving on the fuller implementation (that breaks ABIs) and other routing issues. Julian From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 00:26:05 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B339F16A418 for ; Thu, 27 Dec 2007 00:26:05 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outV.internet-mail-service.net (outV.internet-mail-service.net [216.240.47.245]) by mx1.freebsd.org (Postfix) with ESMTP id 9926D13C468 for ; Thu, 27 Dec 2007 00:26:05 +0000 (UTC) (envelope-from julian@elischer.org) Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160) by out.internet-mail-service.net (qpsmtpd/0.40) with ESMTP; Wed, 26 Dec 2007 16:26:04 -0800 Received: from julian-mac.elischer.org (localhost [127.0.0.1]) by idiom.com (Postfix) with ESMTP id 235D4126D8C; Wed, 26 Dec 2007 16:26:04 -0800 (PST) Message-ID: <4772F123.5030303@elischer.org> Date: Wed, 26 Dec 2007 16:26:11 -0800 From: Julian Elischer User-Agent: Thunderbird 2.0.0.9 (Macintosh/20071031) MIME-Version: 1.0 To: FreeBSD Net , arch@freebsd.org, Robert Watson , Qing Li Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Subject: resend: multiple routing table roadmap (format fix) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2007 00:26:05 -0000 Resending as my mailer made a dog's breakfast of the first one with all sorts of wierd line breaks... hopefully this will be better. (I haven't sent it yet so I'm hoping).. ------------------------------------------- On thing where FreeBSD has been falling behind, and which by chance I have some time to work on is "policy based routing", which allows different packet streams to be routed by more than just the destination address. Constraints: ------------ I want to make some form of this available in the 6.x tree (and by extension 7.x) , but FreeBSD in general needs it so I might as well do it in -current and back port the portions I need. One of the ways that this can be done is to have the ability to instantiate multiple kernel routing tables (which I will now refer to as "Forwarding Information Bases" or "FIBs" for political correctness reasons. Which FIB a particular packet uses to make the next hop decision can be decided by a number of mechanisms. The policies these mechanisms implement are the "Policies" referred to in "Policy based routing". One of the constraints I have if I try to back port this work to 6.x is that it must be implemented as a EXTENSION to the existing ABIs in 6.x so that third party applications do not need to be recompiled in timespan of the branch. Implementation method, (part 1) ------------------------------- For this reason I have implemented a "sufficient subset" of a multiple routing table solution in Perforce, and back-ported it to 6.x. (also in Perforce though not yet caught up with what I have done in -current/P4). The subset allows a number of FIBs to be defined at compile time (sufficient for my purposes in 6.x) and implements the changes needed to allow IPV4 to use them. I have not done the changes for ipv6 simply because I do not need it, and I do not have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. Other protocol families are left untouched and should there be users with proprietary protocol families, they should continue to work and be oblivious to the existence of the extra FIBs. To understand how this is done, one must know that the current FIB code starts everything off with a single dimensional array of pointers to FIB head structures (One per protocol family), each of which in turn points to the trie of routes available to that family. The basic change in the ABI compatible version of the change is to extent that array to be a 2 dimensional array, so that instead of protocol family X looking at rt_tables[X] for the table it needs, it looks at rt_tables[Y][X] when for all protocol families except ipv4 Y is always 0. Code that is unaware of the change always just sees the first row of the table, which of course looks just like the one dimensional array that existed before. The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() are all maintained, but refer only to the first row of the array, so that existing callers in proprietary protocols can continue to do the "right thing". Some new entry points are added, for the exclusive use of ipv4 code called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), which have an extra argument which refers the code to the correct row. In addition, there are some new entry points (currently called dom_rtalloc() and friends) that check the Address family being looked up and call either rtalloc() (and friends) if the protocol is not IPv4 forcing the action to row 0 or to the appropriate row if it IS IPv4 (and that info is available). These are for calling from code that is not specific to any particular protocol. The way these are implemented would change in the non ABI preserving code to be added later. One feature of the first version of the code is that for ipv4, the interface routes show up automatically on all the FIBs, so that no matter what FIB you select you always have the basic direct attached hosts available to you. (rtinit() does this automatically). You CAN delete an interface route from one FIB should you want to but by default it's there. ARP information is also available in each FIB. It's assumed that the same machine would have the same MAC address, regardless of which FIB you are using to get to it. This brings us as to how the correct FIB is selected for an outgoing IPV4 packet. Packets fall into one of a number of classes. 1/ locally generated packets, coming from a socket/PCB. Such packets select a FIB from a number associated with the socket/PCB. This in turn is inherited from the process, but can be changed by a socket option. The process in turn inherits it on fork. I have written a utility call setfib that acts a bit like nice.. setfib -n 3 ping target.example.com # will use fib 3 for ping. 2/ packets received on an interface for forwarding. By default these packets would use table 0, (or possibly a number settable in a sysctl(not yet)). but prior to routing the firewall can inspect them (see below). 3/ packets inspected by a packet classifier, which can arbitrarily associate a fib with it on a packet by packet basis. A fib assigned to a packet by a packet classifier (such as ipfw) would over-ride a fib associated by a more default source. (such as cases 1 or 2). Routing messages would be associated with their process, and thus select one FIB or another. In addition Netstat has been edited to be able to cope with the fact that the array is now 2 dimensional. (It looks in system memory using libkvm (!)). In addition two sysctls are added to give: a) the number of FIBs compiled in (active) b) the default FIB of the calling process. Early testing experience: ------------------------- Basically our (IronPort's) appliance does this functionality already using ipfw fwd but that method has some drawbacks. For example, It can't fully simulate a routing table because it can't influence the socket's choice of local address when a connect() is done. Testing during the generating of these changes has been remarkably smooth so far. Multiple tables have co-existed with no notable side effects, and packets have been routes accordingly. I have not yet added the changes to ipfw. pf has some similar changes already but they seem to rely on the various FIBs having symbolic names. Which I do not plan to support in the first version of these changes. SCTP has interestingly enough built in support for this, called VRFs in Cisco parlance. it will be interesting to see how that handles it when it suddenly actually does something. I have not redone my testing since my last edits, but will be retesting with the current code asap. Where to next: -------------------- After committing the ABI compatible version and MFCing it, I'd like to proceed in a forward direction in -current. this will result in some roto-tilling in the routing code. Firstly: the current code's idea of having a separate tree per protocol family, all of the same format, and pointed to by the 1 dimensional array is a bit silly. Especially when one considers that there is code that makes assumptions about every protocol having the same internal structures there. Some protocols don't WANT that sort of structure. (for example the whole idea of a netmask is foreign to appletalk). This needs to be made opaque to the external code. My suggested first change is to add routing method pointers to the 'domain' structure, along with information pointing the data. instead of having an array of pointers to uniform structures, there would be an array pointing to the 'domain' structures for each protocol address domain (protocol family), and the methods this reached would be called. The methods would have an argument that gives FIB number, but the protocol would be free to ignore it. Interaction with the ARP layer/ LL layer would need to be revisited as well. Qing Li has been working on this already. diffs for those with p4 access: p4 diff2 -du //depot/vendor/freebsd/src/sys/...@131121 //depot/user/julian/routing/src/sys/... for those with the makediff perl script: perl ~/makediff.pl //depot/vendor/freebsd/src/sys/...@131121 //depot/user/julian/routing/src/sys/... for those with neither: http://people.freebsd.org/~julian/mrt2.diff I just put the userland utility in usr.sbin/setfib/ in p4. and changes to netstat in usr.bin/netstat/ see: http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/julian/routing/src&HIDEDEL=NO I'd like to get comments on this (compat) version, so that I can commit it, get general testing under way to start the clock for MFC, and then get moving on the fuller implementation (that breaks ABIs) and other routing issues. Julian From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 01:53:26 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0937616A41B for ; Thu, 27 Dec 2007 01:53:26 +0000 (UTC) (envelope-from ivo.vachkov@gmail.com) Received: from hs-out-2122.google.com (hs-out-0708.google.com [64.233.178.240]) by mx1.freebsd.org (Postfix) with ESMTP id A689F13C4DD for ; Thu, 27 Dec 2007 01:53:25 +0000 (UTC) (envelope-from ivo.vachkov@gmail.com) Received: by hs-out-2122.google.com with SMTP id j58so2225139hsj.11 for ; Wed, 26 Dec 2007 17:53:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=7P9byjzIODk7OjOMT/p8QTPLAQm1t3t9xirlo8LCWbQ=; b=PgbD/339ykW1ubzBfVxjCIypmFOJpiEuqCKRGhdWgsw4+US0lzwwBV0DhQVivKkvIZtVJ6eFnugyyE4NuAJzoWI4YwWTKinWwYWo50sCifDyb0/ifRIp0XD1kxj6XbigSIwz0OwRp4YLbDkx2j2jKyU0m6I+Rs/v1zIAQc3/pZI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=jG1MNPOqYXnEan6K0C/XRO7sZWyrhrffthB7BWg91ZXxxcCOUwrQ92wpaUVByN3NI9ROYbPLyfbsz5yu80Soq4JzBAwu/9++82q3m09H0xmcdPj7tHFPm8aU3KrWD5Z7ELHXAACt8viEkrRAkrgmfhyS1ZfbAZkkBaNvbHlq2gE= Received: by 10.150.229.16 with SMTP id b16mr1961399ybh.115.1198718881884; Wed, 26 Dec 2007 17:28:01 -0800 (PST) Received: by 10.150.204.13 with HTTP; Wed, 26 Dec 2007 17:28:01 -0800 (PST) Message-ID: Date: Thu, 27 Dec 2007 03:28:01 +0200 From: "Ivo Vachkov" To: "Julian Elischer" In-Reply-To: <4772F123.5030303@elischer.org> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <4772F123.5030303@elischer.org> Cc: FreeBSD Net , Robert Watson , Qing Li , arch@freebsd.org Subject: Re: resend: multiple routing table roadmap (format fix) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2007 01:53:26 -0000 On Dec 27, 2007 2:26 AM, Julian Elischer wrote: > Resending as my mailer made a dog's breakfast of the first one > with all sorts of wierd line breaks... hopefully this will be better. > (I haven't sent it yet so I'm hoping).. > > > ------------------------------------------- > > > > On thing where FreeBSD has been falling behind, and which by chance I > have some time to work on is "policy based routing", which allows > different > packet streams to be routed by more than just the destination address. > > Constraints: > ------------ > > I want to make some form of this available in the 6.x tree > (and by extension 7.x) , but FreeBSD in general needs it so I might as > well > do it in -current and back port the portions I need. > > One of the ways that this can be done is to have the ability to > instantiate multiple kernel routing tables (which I will now > refer to as "Forwarding Information Bases" or "FIBs" for political > correctness reasons. Which FIB a particular packet uses to make > the next hop decision can be decided by a number of mechanisms. > The policies these mechanisms implement are the "Policies" referred > to in "Policy based routing". > > One of the constraints I have if I try to back port this work to > 6.x is that it must be implemented as a EXTENSION to the existing > ABIs in 6.x so that third party applications do not need to be > recompiled in timespan of the branch. > > Implementation method, (part 1) > ------------------------------- > For this reason I have implemented a "sufficient subset" of a > multiple routing table solution in Perforce, and back-ported it > to 6.x. (also in Perforce though not yet caught up with what I > have done in -current/P4). The subset allows a number of FIBs > to be defined at compile time (sufficient for my purposes in 6.x) and > implements the changes needed to allow IPV4 to use them. I have not done > the changes for ipv6 simply because I do not need it, and I do not > have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. > > Other protocol families are left untouched and should there be > users with proprietary protocol families, they should continue to work > and be oblivious to the existence of the extra FIBs. > > To understand how this is done, one must know that the current FIB > code starts everything off with a single dimensional array of > pointers to FIB head structures (One per protocol family), each of > which in turn points to the trie of routes available to that family. > > The basic change in the ABI compatible version of the change is to > extent that array to be a 2 dimensional array, so that > instead of protocol family X looking at rt_tables[X] for the > table it needs, it looks at rt_tables[Y][X] when for all > protocol families except ipv4 Y is always 0. > Code that is unaware of the change always just sees the first row > of the table, which of course looks just like the one dimensional > array that existed before. Pretty much like the OpenBSD approach :) > The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() > are all maintained, but refer only to the first row of the array, > so that existing callers in proprietary protocols can continue to > do the "right thing". > Some new entry points are added, for the exclusive use of ipv4 code > called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), > which have an extra argument which refers the code to the correct row. > > In addition, there are some new entry points (currently called > dom_rtalloc() and friends) that check the Address family being > looked up and call either rtalloc() (and friends) if the protocol > is not IPv4 forcing the action to row 0 or to the appropriate row > if it IS IPv4 (and that info is available). These are for calling > from code that is not specific to any particular protocol. The way > these are implemented would change in the non ABI preserving code > to be added later. > > One feature of the first version of the code is that for ipv4, > the interface routes show up automatically on all the FIBs, so > that no matter what FIB you select you always have the basic > direct attached hosts available to you. (rtinit() does this > automatically). > You CAN delete an interface route from one FIB should you want > to but by default it's there. ARP information is also available > in each FIB. It's assumed that the same machine would have the > same MAC address, regardless of which FIB you are using to get > to it. > > > This brings us as to how the correct FIB is selected for an outgoing > IPV4 packet. > > Packets fall into one of a number of classes. > 1/ locally generated packets, coming from a socket/PCB. > Such packets select a FIB from a number associated with the > socket/PCB. This in turn is inherited from the process, > but can be changed by a socket option. The process in turn > inherits it on fork. I have written a utility call setfib > that acts a bit like nice.. > > setfib -n 3 ping target.example.com # will use fib 3 for ping. > > 2/ packets received on an interface for forwarding. > By default these packets would use table 0, > (or possibly a number settable in a sysctl(not yet)). > but prior to routing the firewall can inspect them (see below). > > 3/ packets inspected by a packet classifier, which can arbitrarily > associate a fib with it on a packet by packet basis. > A fib assigned to a packet by a packet classifier > (such as ipfw) would over-ride a fib associated by > a more default source. (such as cases 1 or 2). For the 2/ and 3/ cases I added (in a personal work i've been doing lately) additional field in struct mbuf which can be set by a packet filter or other application upon receiving which points the right table to use for the lookup. This way a simple "marking" can be used to divide different flows and create policy based routing. > Routing messages would be associated with their > process, and thus select one FIB or another. > > In addition Netstat has been edited to be able to cope with the > fact that the array is now 2 dimensional. (It looks in system > memory using libkvm (!)). > > In addition two sysctls are added to give: > a) the number of FIBs compiled in (active) > b) the default FIB of the calling process. > > Early testing experience: > ------------------------- > > Basically our (IronPort's) appliance does this functionality already > using ipfw fwd but that method has some drawbacks. > > For example, > It can't fully simulate a routing table because it can't influence the > socket's choice of local address when a connect() is done. > > > Testing during the generating of these changes has been > remarkably smooth so far. Multiple tables have co-existed > with no notable side effects, and packets have been routes > accordingly. > > I have not yet added the changes to ipfw. > pf has some similar changes already but they seem to rely on > the various FIBs having symbolic names. Which I do not plan to support > in the first version of these changes. > > SCTP has interestingly enough built in support for this, called VRFs > in Cisco parlance. it will be interesting to see how that handles it > when it suddenly actually does something. > > I have not redone my testing since my last edits, but will be > retesting with the current code asap. > > > Where to next: > -------------------- > > After committing the ABI compatible version and MFCing it, I'd > like to proceed in a forward direction in -current. this will > result in some roto-tilling in the routing code. > > Firstly: the current code's idea of having a separate tree per > protocol family, all of the same format, and pointed to by the > 1 dimensional array is a bit silly. Especially when one considers that > there > is code that makes assumptions about every protocol having the same > internal structures there. Some protocols don't WANT that > sort of structure. (for example the whole idea of a netmask is foreign > to appletalk). This needs to be made opaque to the external code. > > My suggested first change is to add routing method pointers to the > 'domain' structure, along with information pointing the data. > instead of having an array of pointers to uniform structures, > there would be an array pointing to the 'domain' structures > for each protocol address domain (protocol family), > and the methods this reached would be called. The methods would have > an argument that gives FIB number, but the protocol would be free > to ignore it. > > Interaction with the ARP layer/ LL layer would need to be > revisited as well. Qing Li has been working on this already. > > > diffs > for those with p4 access: > p4 diff2 -du //depot/vendor/freebsd/src/sys/...@131121 > //depot/user/julian/routing/src/sys/... > > for those with the makediff perl script: > perl ~/makediff.pl //depot/vendor/freebsd/src/sys/...@131121 > //depot/user/julian/routing/src/sys/... > > for those with neither: > > http://people.freebsd.org/~julian/mrt2.diff > > I just put the userland utility in usr.sbin/setfib/ in p4. > and changes to netstat in usr.bin/netstat/ > > see: > http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/julian/routing/src&HIDEDEL=NO > > > > > I'd like to get comments on this (compat) version, so that I can > commit it, > get general testing under way to start the clock for MFC, and then get > moving on the fuller implementation (that breaks ABIs) and other > routing issues. > > > Julian > > > > > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 15:55:10 2007 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5F7DE16A418 for ; Thu, 27 Dec 2007 15:55:10 +0000 (UTC) (envelope-from rdivacky@vlk.vlakno.cz) Received: from vlakno.cz (vlk.vlakno.cz [62.168.28.247]) by mx1.freebsd.org (Postfix) with ESMTP id 0D96F13C45B for ; Thu, 27 Dec 2007 15:55:09 +0000 (UTC) (envelope-from rdivacky@vlk.vlakno.cz) Received: from localhost (localhost [127.0.0.1]) by vlakno.cz (Postfix) with ESMTP id E849F66B003; Thu, 27 Dec 2007 16:55:07 +0100 (CET) X-Virus-Scanned: amavisd-new at vlakno.cz Received: from vlakno.cz ([127.0.0.1]) by localhost (vlk.vlakno.cz [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 566F08Db+cHw; Thu, 27 Dec 2007 16:54:55 +0100 (CET) Received: from vlk.vlakno.cz (localhost [127.0.0.1]) by vlakno.cz (Postfix) with ESMTP id C8F6D66AFFF; Thu, 27 Dec 2007 16:54:55 +0100 (CET) Received: (from rdivacky@localhost) by vlk.vlakno.cz (8.13.8/8.13.8/Submit) id lBRFstlR023615; Thu, 27 Dec 2007 16:54:55 +0100 (CET) (envelope-from rdivacky) Date: Thu, 27 Dec 2007 16:54:55 +0100 From: Roman Divacky To: John Baldwin Message-ID: <20071227155455.GA23604@freebsd.org> References: <20071218092222.GA9695@freebsd.org> <200712201138.56423.jhb@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200712201138.56423.jhb@freebsd.org> User-Agent: Mutt/1.4.2.3i Cc: arch@FreeBSD.org, freebsd-arch@FreeBSD.org Subject: Re: final decision about *at syscalls X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2007 15:55:10 -0000 > Considering Robert's paper on security race problems in things like systrace > stemming from when you copy parameters out of userland and into the kernel > multiple times, I think #2 is definitely the better choice. Also, namei() is > already thread aware AFAICT since 'struct componentname' already contains a > 'cnp_thread' member (was 'cnp_proc' in 4.x). two strong voices for #2, I am going that way... thnx From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 16:12:46 2007 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 50FB616A417; Thu, 27 Dec 2007 16:12:46 +0000 (UTC) (envelope-from rdivacky@vlk.vlakno.cz) Received: from vlakno.cz (vlk.vlakno.cz [62.168.28.247]) by mx1.freebsd.org (Postfix) with ESMTP id 0BDDA13C45B; Thu, 27 Dec 2007 16:12:46 +0000 (UTC) (envelope-from rdivacky@vlk.vlakno.cz) Received: from localhost (localhost [127.0.0.1]) by vlakno.cz (Postfix) with ESMTP id E849F66B003; Thu, 27 Dec 2007 16:55:07 +0100 (CET) X-Virus-Scanned: amavisd-new at vlakno.cz Received: from vlakno.cz ([127.0.0.1]) by localhost (vlk.vlakno.cz [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 566F08Db+cHw; Thu, 27 Dec 2007 16:54:55 +0100 (CET) Received: from vlk.vlakno.cz (localhost [127.0.0.1]) by vlakno.cz (Postfix) with ESMTP id C8F6D66AFFF; Thu, 27 Dec 2007 16:54:55 +0100 (CET) Received: (from rdivacky@localhost) by vlk.vlakno.cz (8.13.8/8.13.8/Submit) id lBRFstlR023615; Thu, 27 Dec 2007 16:54:55 +0100 (CET) (envelope-from rdivacky) Date: Thu, 27 Dec 2007 16:54:55 +0100 From: Roman Divacky To: John Baldwin Message-ID: <20071227155455.GA23604@freebsd.org> References: <20071218092222.GA9695@freebsd.org> <200712201138.56423.jhb@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200712201138.56423.jhb@freebsd.org> User-Agent: Mutt/1.4.2.3i Cc: arch@FreeBSD.org, freebsd-arch@FreeBSD.org Subject: Re: final decision about *at syscalls X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2007 16:12:46 -0000 > Considering Robert's paper on security race problems in things like systrace > stemming from when you copy parameters out of userland and into the kernel > multiple times, I think #2 is definitely the better choice. Also, namei() is > already thread aware AFAICT since 'struct componentname' already contains a > 'cnp_thread' member (was 'cnp_proc' in 4.x). two strong voices for #2, I am going that way... thnx From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 21:19:03 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1ECA716A421 for ; Thu, 27 Dec 2007 21:19:03 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outK.internet-mail-service.net (outK.internet-mail-service.net [216.240.47.234]) by mx1.freebsd.org (Postfix) with ESMTP id 930D013C468 for ; Thu, 27 Dec 2007 21:19:02 +0000 (UTC) (envelope-from julian@elischer.org) Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160) by out.internet-mail-service.net (qpsmtpd/0.40) with ESMTP; Thu, 27 Dec 2007 13:19:01 -0800 Received: from julian-mac.elischer.org (localhost [127.0.0.1]) by idiom.com (Postfix) with ESMTP id 75B50126D9D; Thu, 27 Dec 2007 13:19:00 -0800 (PST) Message-ID: <477416CC.4090906@elischer.org> Date: Thu, 27 Dec 2007 13:19:08 -0800 From: Julian Elischer User-Agent: Thunderbird 2.0.0.9 (Macintosh/20071031) MIME-Version: 1.0 To: Ivo Vachkov References: <4772F123.5030303@elischer.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: FreeBSD Net , Robert Watson , Qing Li , arch@freebsd.org Subject: Re: resend: multiple routing table roadmap (format fix) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2007 21:19:03 -0000 Ivo Vachkov wrote: > On Dec 27, 2007 2:26 AM, Julian Elischer wrote: >> Resending as my mailer made a dog's breakfast of the first one >> with all sorts of wierd line breaks... hopefully this will be better. >> (I haven't sent it yet so I'm hoping).. >> >> >> ------------------------------------------- >> >> >> >> On thing where FreeBSD has been falling behind, and which by chance I >> have some time to work on is "policy based routing", which allows >> different >> packet streams to be routed by more than just the destination address. >> >> Constraints: >> ------------ >> >> I want to make some form of this available in the 6.x tree >> (and by extension 7.x) , but FreeBSD in general needs it so I might as >> well >> do it in -current and back port the portions I need. >> >> One of the ways that this can be done is to have the ability to >> instantiate multiple kernel routing tables (which I will now >> refer to as "Forwarding Information Bases" or "FIBs" for political >> correctness reasons. Which FIB a particular packet uses to make >> the next hop decision can be decided by a number of mechanisms. >> The policies these mechanisms implement are the "Policies" referred >> to in "Policy based routing". >> >> One of the constraints I have if I try to back port this work to >> 6.x is that it must be implemented as a EXTENSION to the existing >> ABIs in 6.x so that third party applications do not need to be >> recompiled in timespan of the branch. >> >> Implementation method, (part 1) >> ------------------------------- >> For this reason I have implemented a "sufficient subset" of a >> multiple routing table solution in Perforce, and back-ported it >> to 6.x. (also in Perforce though not yet caught up with what I >> have done in -current/P4). The subset allows a number of FIBs >> to be defined at compile time (sufficient for my purposes in 6.x) and >> implements the changes needed to allow IPV4 to use them. I have not done >> the changes for ipv6 simply because I do not need it, and I do not >> have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. By the way, I might add that in the 6.x compat. version I may end up limiting the feature to 8 tables. This is because I need to store some stuff in an efficient way in the mbuf, and in a compatible manner this is easiest done by stealing the top 4 bits in the mbuf dlags word and defining them as: #define M_HAVEFIB 0x10000000 #define M_FIBMASK 0x07 #define M_FIBNUM 0xe0000000 #define M_FIBSHIFT 29 #define m_getfib(_m, _default) ((m->m_flags & M_HAVE_FIBNUM) ? ((m->m_flags >> M_FIBSHIFT) & M_FIBMASK) : _default) #M_SETFIB(_m, _fib) do { \ _m->m_flags &= ~M_FIBNUM; \ _m->m_flags |= (M_HAVEFIB|((_fib & M_FIBMASK) << M_FIBSHIFT));\ } while (0) This then becomes very easy to change to use a tag or whatever is needed in later versions , and the number can be expanded past 8 predefined FIBs at that time.. >> >> Other protocol families are left untouched and should there be >> users with proprietary protocol families, they should continue to work >> and be oblivious to the existence of the extra FIBs. >> >> To understand how this is done, one must know that the current FIB >> code starts everything off with a single dimensional array of >> pointers to FIB head structures (One per protocol family), each of >> which in turn points to the trie of routes available to that family. >> >> The basic change in the ABI compatible version of the change is to >> extent that array to be a 2 dimensional array, so that >> instead of protocol family X looking at rt_tables[X] for the >> table it needs, it looks at rt_tables[Y][X] when for all >> protocol families except ipv4 Y is always 0. >> Code that is unaware of the change always just sees the first row >> of the table, which of course looks just like the one dimensional >> array that existed before. > > Pretty much like the OpenBSD approach :) well, I did look at the code briefly, but I didn't base it on it.. > >> The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() >> are all maintained, but refer only to the first row of the array, >> so that existing callers in proprietary protocols can continue to >> do the "right thing". >> Some new entry points are added, for the exclusive use of ipv4 code >> called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), >> which have an extra argument which refers the code to the correct row. >> >> In addition, there are some new entry points (currently called >> dom_rtalloc() and friends) that check the Address family being >> looked up and call either rtalloc() (and friends) if the protocol >> is not IPv4 forcing the action to row 0 or to the appropriate row >> if it IS IPv4 (and that info is available). These are for calling >> from code that is not specific to any particular protocol. The way >> these are implemented would change in the non ABI preserving code >> to be added later. >> >> One feature of the first version of the code is that for ipv4, >> the interface routes show up automatically on all the FIBs, so >> that no matter what FIB you select you always have the basic >> direct attached hosts available to you. (rtinit() does this >> automatically). >> You CAN delete an interface route from one FIB should you want >> to but by default it's there. ARP information is also available >> in each FIB. It's assumed that the same machine would have the >> same MAC address, regardless of which FIB you are using to get >> to it. >> >> >> This brings us as to how the correct FIB is selected for an outgoing >> IPV4 packet. >> >> Packets fall into one of a number of classes. >> 1/ locally generated packets, coming from a socket/PCB. >> Such packets select a FIB from a number associated with the >> socket/PCB. This in turn is inherited from the process, >> but can be changed by a socket option. The process in turn >> inherits it on fork. I have written a utility call setfib >> that acts a bit like nice.. >> >> setfib -n 3 ping target.example.com # will use fib 3 for ping. >> >> 2/ packets received on an interface for forwarding. >> By default these packets would use table 0, >> (or possibly a number settable in a sysctl(not yet)). >> but prior to routing the firewall can inspect them (see below). >> >> 3/ packets inspected by a packet classifier, which can arbitrarily >> associate a fib with it on a packet by packet basis. >> A fib assigned to a packet by a packet classifier >> (such as ipfw) would over-ride a fib associated by >> a more default source. (such as cases 1 or 2). > > For the 2/ and 3/ cases I added (in a personal work i've been doing > lately) additional field in struct mbuf which can be set by a packet > filter or other application upon receiving which points the right > table to use for the lookup. This way a simple "marking" can be used > to divide different flows and create policy based routing. This would be the final way but I want to really minimise problems in the compat versions, so I'll avoid doing that for now. Do you have this work available? And have you looked at mi diffs below? > >> Routing messages would be associated with their >> process, and thus select one FIB or another. >> >> In addition Netstat has been edited to be able to cope with the >> fact that the array is now 2 dimensional. (It looks in system >> memory using libkvm (!)). >> >> In addition two sysctls are added to give: >> a) the number of FIBs compiled in (active) >> b) the default FIB of the calling process. >> >> Early testing experience: >> ------------------------- >> >> Basically our (IronPort's) appliance does this functionality already >> using ipfw fwd but that method has some drawbacks. >> >> For example, >> It can't fully simulate a routing table because it can't influence the >> socket's choice of local address when a connect() is done. >> >> >> Testing during the generating of these changes has been >> remarkably smooth so far. Multiple tables have co-existed >> with no notable side effects, and packets have been routes >> accordingly. >> >> I have not yet added the changes to ipfw. >> pf has some similar changes already but they seem to rely on >> the various FIBs having symbolic names. Which I do not plan to support >> in the first version of these changes. >> >> SCTP has interestingly enough built in support for this, called VRFs >> in Cisco parlance. it will be interesting to see how that handles it >> when it suddenly actually does something. >> >> I have not redone my testing since my last edits, but will be >> retesting with the current code asap. >> >> >> Where to next: >> -------------------- >> >> After committing the ABI compatible version and MFCing it, I'd >> like to proceed in a forward direction in -current. this will >> result in some roto-tilling in the routing code. >> >> Firstly: the current code's idea of having a separate tree per >> protocol family, all of the same format, and pointed to by the >> 1 dimensional array is a bit silly. Especially when one considers that >> there >> is code that makes assumptions about every protocol having the same >> internal structures there. Some protocols don't WANT that >> sort of structure. (for example the whole idea of a netmask is foreign >> to appletalk). This needs to be made opaque to the external code. >> >> My suggested first change is to add routing method pointers to the >> 'domain' structure, along with information pointing the data. >> instead of having an array of pointers to uniform structures, >> there would be an array pointing to the 'domain' structures >> for each protocol address domain (protocol family), >> and the methods this reached would be called. The methods would have >> an argument that gives FIB number, but the protocol would be free >> to ignore it. >> >> Interaction with the ARP layer/ LL layer would need to be >> revisited as well. Qing Li has been working on this already. >> >> >> diffs >> for those with p4 access: >> p4 diff2 -du //depot/vendor/freebsd/src/sys/...@131121 >> //depot/user/julian/routing/src/sys/... >> >> for those with the makediff perl script: >> perl ~/makediff.pl //depot/vendor/freebsd/src/sys/...@131121 >> //depot/user/julian/routing/src/sys/... >> >> for those with neither: >> >> http://people.freebsd.org/~julian/mrt2.diff >> >> I just put the userland utility in usr.sbin/setfib/ in p4. >> and changes to netstat in usr.bin/netstat/ >> >> see: >> http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/julian/routing/src&HIDEDEL=NO >> >> >> >> >> I'd like to get comments on this (compat) version, so that I can >> commit it, >> get general testing under way to start the clock for MFC, and then get >> moving on the fuller implementation (that breaks ABIs) and other >> routing issues. >> >> >> Julian >> >> >> >> >> _______________________________________________ >> freebsd-arch@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-arch >> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" >> From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 23:21:50 2007 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1149F16A419 for ; Thu, 27 Dec 2007 23:21:50 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id C183113C43E for ; Thu, 27 Dec 2007 23:21:49 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8q) with ESMTP id 226289009-1834499 for ; Thu, 27 Dec 2007 18:24:01 -0500 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id lBRNLf8G054109 for ; Thu, 27 Dec 2007 18:21:43 -0500 (EST) (envelope-from jhb@FreeBSD.org) From: John Baldwin To: arch@FreeBSD.org Date: Thu, 27 Dec 2007 17:04:44 -0500 User-Agent: KMail/1.9.6 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200712271704.44796.jhb@FreeBSD.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Thu, 27 Dec 2007 18:21:43 -0500 (EST) X-Virus-Scanned: ClamAV 0.91.2/5270/Thu Dec 27 12:48:18 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Subject: kernel features MIB X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2007 23:21:50 -0000 One of the things we have at work is a kern.features sysctl MIB that contains nodes to indicate if a named feature is present. For example, on i386 we have kern.features.pae and we auto enable -DPAE for kernel modules if the currently running kernel is using PAE using that sysctl. One of the patches I want to commit soon is support for handling shm_open/shm_unlink directly in the kernel via swap-backed VM objects (the long-heralded memfd stuff). I would like to have the sysctl MIB so that libc's for older releases (e.g. libc.so.6) could use the syscalls if they are available so that shm segments are shared between compat apps (e.g. 4.x or 6.x) and up-to-date apps. At work we don't have a pretty API for this at all, but I'm thinking for FreeBSD we can do this: FEATURE(foo, "description of foo") which is a macro to create the 'kern.features.foo' node and set it to 1. Then we could have a routine in libc: int feature_present(const char *name); That returns a boolean to indicate if a given feature is present or not by invoking sysctlbyname(3), etc. Any objections to the idea? -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 23:21:54 2007 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B585D16A420; Thu, 27 Dec 2007 23:21:54 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id 2C29D13C45B; Thu, 27 Dec 2007 23:21:54 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8q) with ESMTP id 226289019-1834499 for multiple; Thu, 27 Dec 2007 18:24:05 -0500 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id lBRNLf8H054109; Thu, 27 Dec 2007 18:21:47 -0500 (EST) (envelope-from jhb@FreeBSD.org) From: John Baldwin To: freebsd-arch@FreeBSD.org Date: Thu, 27 Dec 2007 18:05:40 -0500 User-Agent: KMail/1.9.6 References: <18378.1196596684@critter.freebsd.dk> <4752AABE.6090006@freebsd.org> In-Reply-To: <4752AABE.6090006@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200712271805.40972.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Thu, 27 Dec 2007 18:21:48 -0500 (EST) X-Virus-Scanned: ClamAV 0.91.2/5270/Thu Dec 27 12:48:18 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Attilio Rao , arch@FreeBSD.org, Poul-Henning Kamp , Robert Watson , Andre Oppermann Subject: Re: New "timeout" api, to replace callout X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2007 23:21:54 -0000 On Sunday 02 December 2007 07:53:18 am Andre Oppermann wrote: > Poul-Henning Kamp wrote: > > In message <4752998A.9030007@freebsd.org>, Andre Oppermann writes: > >> o TCP puts the timer into an allocated structure and upon close of the > >> session it has to be deallocated including stopping of all currently > >> running timers. > >> [...] > >> -> The timer facility should provide an atomic stop/remove call > >> that prevent any further callbacks upon return. It should not > >> do a 'drain' where the callback may be run anyway. > >> Note: We hold the lock the callback would have to obtain. > > > > It is my intent, that the implementation behind the new API will > > only ever grab the specified lock when it calls the timeout function. > > This is the same for the current one and pretty much a given. > > > When you do a timeout_disable() or timeout_cleanup() you will be > > sleeping on a mutex internal to the implementation, if the timeout > > is currently executing. > > This is the problematic part. We can't sleep in TCP when cleaning up > the timer. We're not always called from userland but from interrupt > context. And when calling the cleanup we currently hold the lock the > callout wants to obtain. We can't drop it either as the race would > be back again. What you describe here is the equivalent of callout_ > drain(). This is unfortunately unworkable in TCP's context. The > callout has to go away even if it is already pending and waiting on > the lock. Maybe that can only be solved by a flag in the lock saying > "give up and go away". The reason you need to do a drain is to allow for safe destroying of the lock. Specifically, drivers tend to do this: FOO_LOCK(sc); ... callout_stop(...); FOO_UNLOCK(sc); ... callout_drain(...); ... mtx_destroy(&sc->foo_mtx); If you don't have the drain and softclock is trying to acquire the backing mutex while you have it held (before the callout_stop) then Bad Things can happen if you don't do the drain. Having the lock just "give up" doesn't work either because if the memory containing the lock is free'd and reinitialized such that it looks enough like a valid lock then softclock (or its equivalent) will still try to obtain it. Also, you need to do a drain so it is safe to free the callout structure to prevent it from being recycled and having weird races where it gets recycled and rescheduled but the timer code thinks it has a pending stop for that pointer and so it aborts the wrong instance of the timer, etc. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 23:21:54 2007 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B585D16A420; Thu, 27 Dec 2007 23:21:54 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id 2C29D13C45B; Thu, 27 Dec 2007 23:21:54 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8q) with ESMTP id 226289019-1834499 for multiple; Thu, 27 Dec 2007 18:24:05 -0500 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id lBRNLf8H054109; Thu, 27 Dec 2007 18:21:47 -0500 (EST) (envelope-from jhb@FreeBSD.org) From: John Baldwin To: freebsd-arch@FreeBSD.org Date: Thu, 27 Dec 2007 18:05:40 -0500 User-Agent: KMail/1.9.6 References: <18378.1196596684@critter.freebsd.dk> <4752AABE.6090006@freebsd.org> In-Reply-To: <4752AABE.6090006@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200712271805.40972.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Thu, 27 Dec 2007 18:21:48 -0500 (EST) X-Virus-Scanned: ClamAV 0.91.2/5270/Thu Dec 27 12:48:18 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Attilio Rao , arch@FreeBSD.org, Poul-Henning Kamp , Robert Watson , Andre Oppermann Subject: Re: New "timeout" api, to replace callout X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2007 23:21:54 -0000 On Sunday 02 December 2007 07:53:18 am Andre Oppermann wrote: > Poul-Henning Kamp wrote: > > In message <4752998A.9030007@freebsd.org>, Andre Oppermann writes: > >> o TCP puts the timer into an allocated structure and upon close of the > >> session it has to be deallocated including stopping of all currently > >> running timers. > >> [...] > >> -> The timer facility should provide an atomic stop/remove call > >> that prevent any further callbacks upon return. It should not > >> do a 'drain' where the callback may be run anyway. > >> Note: We hold the lock the callback would have to obtain. > > > > It is my intent, that the implementation behind the new API will > > only ever grab the specified lock when it calls the timeout function. > > This is the same for the current one and pretty much a given. > > > When you do a timeout_disable() or timeout_cleanup() you will be > > sleeping on a mutex internal to the implementation, if the timeout > > is currently executing. > > This is the problematic part. We can't sleep in TCP when cleaning up > the timer. We're not always called from userland but from interrupt > context. And when calling the cleanup we currently hold the lock the > callout wants to obtain. We can't drop it either as the race would > be back again. What you describe here is the equivalent of callout_ > drain(). This is unfortunately unworkable in TCP's context. The > callout has to go away even if it is already pending and waiting on > the lock. Maybe that can only be solved by a flag in the lock saying > "give up and go away". The reason you need to do a drain is to allow for safe destroying of the lock. Specifically, drivers tend to do this: FOO_LOCK(sc); ... callout_stop(...); FOO_UNLOCK(sc); ... callout_drain(...); ... mtx_destroy(&sc->foo_mtx); If you don't have the drain and softclock is trying to acquire the backing mutex while you have it held (before the callout_stop) then Bad Things can happen if you don't do the drain. Having the lock just "give up" doesn't work either because if the memory containing the lock is free'd and reinitialized such that it looks enough like a valid lock then softclock (or its equivalent) will still try to obtain it. Also, you need to do a drain so it is safe to free the callout structure to prevent it from being recycled and having weird races where it gets recycled and rescheduled but the timer code thinks it has a pending stop for that pointer and so it aborts the wrong instance of the timer, etc. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 23:21:56 2007 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B7D4916A47B for ; Thu, 27 Dec 2007 23:21:56 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id 7056E13C468 for ; Thu, 27 Dec 2007 23:21:56 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8q) with ESMTP id 226289028-1834499 for multiple; Thu, 27 Dec 2007 18:24:08 -0500 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id lBRNLf8I054109; Thu, 27 Dec 2007 18:21:53 -0500 (EST) (envelope-from jhb@FreeBSD.org) From: John Baldwin To: freebsd-arch@FreeBSD.org Date: Thu, 27 Dec 2007 18:17:28 -0500 User-Agent: KMail/1.9.6 References: <15391.1196547545@critter.freebsd.dk> In-Reply-To: <15391.1196547545@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200712271817.28789.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Thu, 27 Dec 2007 18:21:53 -0500 (EST) X-Virus-Scanned: ClamAV 0.91.2/5270/Thu Dec 27 12:48:18 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Poul-Henning Kamp Subject: Re: New "timeout" api, to replace callout X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2007 23:21:56 -0000 On Saturday 01 December 2007 05:19:05 pm Poul-Henning Kamp wrote: > > Here is my proposed new timeout API for 8.x. > > The primary objective is to make it possible to have multiple timeout > "providers" of possibly different kind, so that we can have per-cpu > or per-net-stack timeout handing. > > A secondary goal, is to shove the anti-race handling in destruction of > timeouts back into the implemenation, rather than force users to spend > 20+ lines doing that. I don't see this anymore. Perhaps you haven't looked at updated drivers recently? Right now it looks like this: foo_attach/create() { mtx_init(&foo->lock, ...); callout_init_mtx(&foo->callout, &foo->lock); } foo_something() { callout_reset(&foo->callout, foo_timer, ...) } /* Called with lock held */ foo_timer() { /* * Doesn't have to check 'is detaching' or any other such crap * anymore. */ } foo_stop() { FOO_LOCK(); callout_stop(&foo->callout); /* foo_timer() will no longer run after this point. */ FOO_UNLOCK(); } foo_detach/destroy() { foo_stop(); /* * This drain ensures softclock() is done frobbing with our mutex * so we can safely destroy it. Also makes sure it has no references * to our callout structure either. */ callout_drain(&foo->callout); mtx_destroy(&foo->lock); } That's not 20 lines. You have to do the reset/stop anyway and those now work intuitively. The only "extra" code is an init routine (which you will need anyway) and a teardown routine (callout_drain()). From what I can tell, you've basically mandated a lock and when you use callout_init_mtx() (or now callout_init_rw()), callout_stop() == timeout_safe() and callout_drain() == timeout_cleanup(). Thus, as far as the MPSAFEty stuff, I think the timeout changes are just reshuffling deck chairs. The other goals (axeing hz) I agree with, but I don't think you've changed anything as far as MPSAFEty is concerned. Also, I'd probably find timeout_stop() more intuitive than timeout_safe() to be honest. Maybe timeout_disarm()? -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 02:26:05 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C27FA16A46E for ; Fri, 28 Dec 2007 02:26:05 +0000 (UTC) (envelope-from peterjeremy@optushome.com.au) Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au [211.29.132.186]) by mx1.freebsd.org (Postfix) with ESMTP id 405F913C469 for ; Fri, 28 Dec 2007 02:26:05 +0000 (UTC) (envelope-from peterjeremy@optushome.com.au) Received: from server.vk2pj.dyndns.org (c220-239-20-82.belrs4.nsw.optusnet.com.au [220.239.20.82]) by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id lBS2Pw0n032333 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 28 Dec 2007 13:25:58 +1100 Received: from server.vk2pj.dyndns.org (localhost.vk2pj.dyndns.org [127.0.0.1]) by server.vk2pj.dyndns.org (8.14.2/8.14.1) with ESMTP id lBS2PvwK048587; Fri, 28 Dec 2007 13:25:57 +1100 (EST) (envelope-from peter@server.vk2pj.dyndns.org) Received: (from peter@localhost) by server.vk2pj.dyndns.org (8.14.2/8.14.2/Submit) id lBS2Pv0k048586; Fri, 28 Dec 2007 13:25:57 +1100 (EST) (envelope-from peter) Date: Fri, 28 Dec 2007 13:25:57 +1100 From: Peter Jeremy To: "Aryeh M. Friedman" Message-ID: <20071228022557.GT40785@server.vk2pj.dyndns.org> References: <4772A742.4050106@gmail.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="Pk6IbRAofICFmK5e" Content-Disposition: inline In-Reply-To: <4772A742.4050106@gmail.com> X-PGP-Key: http://members.optusnet.com.au/peterjeremy/pubkey.asc User-Agent: Mutt/1.5.17 (2007-11-01) Cc: freebsd-arch@freebsd.org Subject: Re: Adding better database support to the base system X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 02:26:05 -0000 --Pk6IbRAofICFmK5e Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Dec 26, 2007 at 02:10:58PM -0500, Aryeh M. Friedman wrote: >Currently the only available DB support in the base system is Berkeley >DB (1.x) there are several items that would benefit from migrating >something like minisql into the base system. The most immediate >application that comes to mind is enabling some interesting features >for the ports system. Therefor I purpose migrating some minimal >RDBM's features into the base system. Firstly, minisql (mSQL) is a non-starter because of its license. Secondly, you haven't provided any justification for the inclusion of an RDBMS in the base system. In general, the base system only contains tools necessary to build and manage the base system. In order to add this to the base system, you need to justify both why an RDBMS is needed and why it can't be a port. Thirdly, this topic has been thrashed out recently and I suggest you review that thread before continuing. --=20 Peter Jeremy Please excuse any delays as the result of my ISP's inability to implement an MTA that is either RFC2821-compliant or matches their claimed behaviour. --Pk6IbRAofICFmK5e Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4 (FreeBSD) iD8DBQFHdF61/opHv/APuIcRAuIPAKCUNpZ0iDJpAzb1t6dyrfF+YfrO2QCgvpf4 R5NwBC4IT+QafB4Y5m6nfqE= =rGAj -----END PGP SIGNATURE----- --Pk6IbRAofICFmK5e-- From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 05:31:02 2007 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 733A416A418; Fri, 28 Dec 2007 05:31:02 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id 5F04D13C457; Fri, 28 Dec 2007 05:31:02 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: by elvis.mu.org (Postfix, from userid 1192) id 89C8E1A4D7C; Thu, 27 Dec 2007 21:29:06 -0800 (PST) Date: Thu, 27 Dec 2007 21:29:06 -0800 From: Alfred Perlstein To: John Baldwin Message-ID: <20071228052906.GP16982@elvis.mu.org> References: <200712271704.44796.jhb@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200712271704.44796.jhb@FreeBSD.org> User-Agent: Mutt/1.4.2.3i Cc: arch@FreeBSD.org Subject: Re: kernel features MIB X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 05:31:02 -0000 Sounds pretty rad. * John Baldwin [071227 15:20] wrote: > One of the things we have at work is a kern.features sysctl MIB that contains > nodes to indicate if a named feature is present. For example, on i386 we > have kern.features.pae and we auto enable -DPAE for kernel modules if the > currently running kernel is using PAE using that sysctl. > > One of the patches I want to commit soon is support for handling > shm_open/shm_unlink directly in the kernel via swap-backed VM objects (the > long-heralded memfd stuff). I would like to have the sysctl MIB so that > libc's for older releases (e.g. libc.so.6) could use the syscalls if they are > available so that shm segments are shared between compat apps (e.g. 4.x or > 6.x) and up-to-date apps. > > At work we don't have a pretty API for this at all, but I'm thinking for > FreeBSD we can do this: > > FEATURE(foo, "description of foo") > > which is a macro to create the 'kern.features.foo' node and set it to 1. Then > we could have a routine in libc: > > int feature_present(const char *name); > > That returns a boolean to indicate if a given feature is present or not by > invoking sysctlbyname(3), etc. > > Any objections to the idea? > > -- > John Baldwin > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" -- - Alfred Perlstein From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 09:03:09 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A598816A418 for ; Fri, 28 Dec 2007 09:03:09 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe14.swipnet.se [212.247.155.161]) by mx1.freebsd.org (Postfix) with ESMTP id E3D9013C448 for ; Fri, 28 Dec 2007 09:03:07 +0000 (UTC) (envelope-from hselasky@c2i.net) X-Cloudmark-Score: 0.000000 [] Received: from [193.217.102.3] (account mc467741@c2i.net HELO [10.0.0.249]) by mailfe14.swip.net (CommuniGate Pro SMTP 5.1.13) with ESMTPA id 11653783; Fri, 28 Dec 2007 10:03:06 +0100 From: Hans Petter Selasky To: freebsd-arch@freebsd.org Date: Fri, 28 Dec 2007 10:03:50 +0100 User-Agent: KMail/1.9.7 References: <18378.1196596684@critter.freebsd.dk> <4752AABE.6090006@freebsd.org> <200712271805.40972.jhb@freebsd.org> In-Reply-To: <200712271805.40972.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200712281003.52062.hselasky@c2i.net> Cc: Andre Oppermann , Attilio Rao , arch@freebsd.org, Poul-Henning Kamp , Robert Watson Subject: Re: New "timeout" api, to replace callout X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 09:03:09 -0000 On Friday 28 December 2007, John Baldwin wrote: > On Sunday 02 December 2007 07:53:18 am Andre Oppermann wrote: > > Poul-Henning Kamp wrote: > > > In message <4752998A.9030007@freebsd.org>, Andre Oppermann writes: > > >> o TCP puts the timer into an allocated structure and upon close of > > >> the session it has to be deallocated including stopping of all > > >> currently running timers. > > >> [...] > > >> -> The timer facility should provide an atomic stop/remove call > > >> that prevent any further callbacks upon return. It should not > > >> do a 'drain' where the callback may be run anyway. > > >> Note: We hold the lock the callback would have to obtain. > > > > > > It is my intent, that the implementation behind the new API will > > > only ever grab the specified lock when it calls the timeout function. > > > > This is the same for the current one and pretty much a given. > > > > > When you do a timeout_disable() or timeout_cleanup() you will be > > > sleeping on a mutex internal to the implementation, if the timeout > > > is currently executing. > > > > This is the problematic part. We can't sleep in TCP when cleaning up > > the timer. We're not always called from userland but from interrupt > > context. And when calling the cleanup we currently hold the lock the > > callout wants to obtain. We can't drop it either as the race would > > be back again. What you describe here is the equivalent of callout_ > > drain(). This is unfortunately unworkable in TCP's context. The > > callout has to go away even if it is already pending and waiting on > > the lock. Maybe that can only be solved by a flag in the lock saying > > "give up and go away". > > The reason you need to do a drain is to allow for safe destroying of the > lock. Specifically, drivers tend to do this: > > FOO_LOCK(sc); > ... > callout_stop(...); > FOO_UNLOCK(sc); > ... > callout_drain(...); > ... > mtx_destroy(&sc->foo_mtx); > > If you don't have the drain and softclock is trying to acquire the backing > mutex while you have it held (before the callout_stop) then Bad Things can > happen if you don't do the drain. Having the lock just "give up" doesn't > work either because if the memory containing the lock is free'd and > reinitialized such that it looks enough like a valid lock then softclock > (or its equivalent) will still try to obtain it. Also, you need to do a > drain so it is safe to free the callout structure to prevent it from being > recycled and having weird races where it gets recycled and rescheduled but > the timer code thinks it has a pending stop for that pointer and so it > aborts the wrong instance of the timer, etc. Hi, I completely agree to what John Baldwin is writing. You need two stop-functions: xxx_stop which is non-blocking and xxx_drain which can block i.e. sleep BTW: The USB code in P4 uses the same semantics, due to the same reasons: usbd_transfer_stop() and usbd_transfer_drain() The only difference is that I pass an error code to the callback which might happen after that usbd_transfer_stop is called. I think that xxx_stop() and xxx_drain() is a generic approach that should be applied to all callback systems. Whenever you have a callback you need to be able to stop it and drain it. --HPS From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 10:03:12 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C70B916A469; Fri, 28 Dec 2007 10:03:11 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe14.swipnet.se [212.247.155.161]) by mx1.freebsd.org (Postfix) with ESMTP id 5EA1813C468; Fri, 28 Dec 2007 10:03:10 +0000 (UTC) (envelope-from hselasky@c2i.net) X-Cloudmark-Score: 0.000000 [] Received: from [193.217.102.3] (account mc467741@c2i.net HELO [10.0.0.249]) by mailfe14.swip.net (CommuniGate Pro SMTP 5.1.13) with ESMTPA id 11653783; Fri, 28 Dec 2007 10:03:06 +0100 From: Hans Petter Selasky To: freebsd-arch@freebsd.org Date: Fri, 28 Dec 2007 10:03:50 +0100 User-Agent: KMail/1.9.7 References: <18378.1196596684@critter.freebsd.dk> <4752AABE.6090006@freebsd.org> <200712271805.40972.jhb@freebsd.org> In-Reply-To: <200712271805.40972.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200712281003.52062.hselasky@c2i.net> Cc: Andre Oppermann , Attilio Rao , arch@freebsd.org, Poul-Henning Kamp , Robert Watson Subject: Re: New "timeout" api, to replace callout X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 10:03:12 -0000 On Friday 28 December 2007, John Baldwin wrote: > On Sunday 02 December 2007 07:53:18 am Andre Oppermann wrote: > > Poul-Henning Kamp wrote: > > > In message <4752998A.9030007@freebsd.org>, Andre Oppermann writes: > > >> o TCP puts the timer into an allocated structure and upon close of > > >> the session it has to be deallocated including stopping of all > > >> currently running timers. > > >> [...] > > >> -> The timer facility should provide an atomic stop/remove call > > >> that prevent any further callbacks upon return. It should not > > >> do a 'drain' where the callback may be run anyway. > > >> Note: We hold the lock the callback would have to obtain. > > > > > > It is my intent, that the implementation behind the new API will > > > only ever grab the specified lock when it calls the timeout function. > > > > This is the same for the current one and pretty much a given. > > > > > When you do a timeout_disable() or timeout_cleanup() you will be > > > sleeping on a mutex internal to the implementation, if the timeout > > > is currently executing. > > > > This is the problematic part. We can't sleep in TCP when cleaning up > > the timer. We're not always called from userland but from interrupt > > context. And when calling the cleanup we currently hold the lock the > > callout wants to obtain. We can't drop it either as the race would > > be back again. What you describe here is the equivalent of callout_ > > drain(). This is unfortunately unworkable in TCP's context. The > > callout has to go away even if it is already pending and waiting on > > the lock. Maybe that can only be solved by a flag in the lock saying > > "give up and go away". > > The reason you need to do a drain is to allow for safe destroying of the > lock. Specifically, drivers tend to do this: > > FOO_LOCK(sc); > ... > callout_stop(...); > FOO_UNLOCK(sc); > ... > callout_drain(...); > ... > mtx_destroy(&sc->foo_mtx); > > If you don't have the drain and softclock is trying to acquire the backing > mutex while you have it held (before the callout_stop) then Bad Things can > happen if you don't do the drain. Having the lock just "give up" doesn't > work either because if the memory containing the lock is free'd and > reinitialized such that it looks enough like a valid lock then softclock > (or its equivalent) will still try to obtain it. Also, you need to do a > drain so it is safe to free the callout structure to prevent it from being > recycled and having weird races where it gets recycled and rescheduled but > the timer code thinks it has a pending stop for that pointer and so it > aborts the wrong instance of the timer, etc. Hi, I completely agree to what John Baldwin is writing. You need two stop-functions: xxx_stop which is non-blocking and xxx_drain which can block i.e. sleep BTW: The USB code in P4 uses the same semantics, due to the same reasons: usbd_transfer_stop() and usbd_transfer_drain() The only difference is that I pass an error code to the callback which might happen after that usbd_transfer_stop is called. I think that xxx_stop() and xxx_drain() is a generic approach that should be applied to all callback systems. Whenever you have a callback you need to be able to stop it and drain it. --HPS From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 10:30:15 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8DBCD16A41A; Fri, 28 Dec 2007 10:30:15 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.freebsd.org (Postfix) with ESMTP id 4783813C4E3; Fri, 28 Dec 2007 10:30:15 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id E705746EB6; Fri, 28 Dec 2007 05:30:14 -0500 (EST) Date: Fri, 28 Dec 2007 10:30:14 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Hans Petter Selasky In-Reply-To: <200712281003.52062.hselasky@c2i.net> Message-ID: <20071228102544.J45653@fledge.watson.org> References: <18378.1196596684@critter.freebsd.dk> <4752AABE.6090006@freebsd.org> <200712271805.40972.jhb@freebsd.org> <200712281003.52062.hselasky@c2i.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Andre Oppermann , Attilio Rao , arch@freebsd.org, Poul-Henning Kamp , freebsd-arch@freebsd.org Subject: Re: New "timeout" api, to replace callout X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 10:30:15 -0000 On Fri, 28 Dec 2007, Hans Petter Selasky wrote: >> The reason you need to do a drain is to allow for safe destroying of the >> lock. Specifically, drivers tend to do this: >> >> FOO_LOCK(sc); >> ... >> callout_stop(...); >> FOO_UNLOCK(sc); >> ... >> callout_drain(...); >> ... >> mtx_destroy(&sc->foo_mtx); >> >> If you don't have the drain and softclock is trying to acquire the backing >> mutex while you have it held (before the callout_stop) then Bad Things can >> happen if you don't do the drain. Having the lock just "give up" doesn't >> work either because if the memory containing the lock is free'd and >> reinitialized such that it looks enough like a valid lock then softclock >> (or its equivalent) will still try to obtain it. Also, you need to do a >> drain so it is safe to free the callout structure to prevent it from being >> recycled and having weird races where it gets recycled and rescheduled but >> the timer code thinks it has a pending stop for that pointer and so it >> aborts the wrong instance of the timer, etc. > > I completely agree to what John Baldwin is writing. You need two > stop-functions: > > xxx_stop which is non-blocking and xxx_drain which can block i.e. sleep > > BTW: The USB code in P4 uses the same semantics, due to the same reasons: > > usbd_transfer_stop() and usbd_transfer_drain() > > The only difference is that I pass an error code to the callback which might > happen after that usbd_transfer_stop is called. > > I think that xxx_stop() and xxx_drain() is a generic approach that should be > applied to all callback systems. Whenever you have a callback you need to be > able to stop it and drain it. I think the argument that Poul-Henning is making is not that you don't need something that behaves "like drain", but rather, we're like the wait for drain to be a short, mutex-length wait rather than a long, msleep-length wait. Remember that the bodies of callouts are expected to run in a very short period of time in order to not stall the timer system, in fact, in such a way that a mutex could be held over the entirely timeout call. Given that this is the case, one might reasonably expect callout_stop() to perform the drain rather than having a separate call. Such a model would be very advantageous in TCP, where rather than having to defer GC'ing the inpcb/tcpcb to a GC worker thread, which we don't do but should, we could use the stop call safely and eliminate a whole class of races from the stack. Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 10:30:15 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8DBCD16A41A; Fri, 28 Dec 2007 10:30:15 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.freebsd.org (Postfix) with ESMTP id 4783813C4E3; Fri, 28 Dec 2007 10:30:15 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id E705746EB6; Fri, 28 Dec 2007 05:30:14 -0500 (EST) Date: Fri, 28 Dec 2007 10:30:14 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Hans Petter Selasky In-Reply-To: <200712281003.52062.hselasky@c2i.net> Message-ID: <20071228102544.J45653@fledge.watson.org> References: <18378.1196596684@critter.freebsd.dk> <4752AABE.6090006@freebsd.org> <200712271805.40972.jhb@freebsd.org> <200712281003.52062.hselasky@c2i.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Andre Oppermann , Attilio Rao , arch@freebsd.org, Poul-Henning Kamp , freebsd-arch@freebsd.org Subject: Re: New "timeout" api, to replace callout X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 10:30:15 -0000 On Fri, 28 Dec 2007, Hans Petter Selasky wrote: >> The reason you need to do a drain is to allow for safe destroying of the >> lock. Specifically, drivers tend to do this: >> >> FOO_LOCK(sc); >> ... >> callout_stop(...); >> FOO_UNLOCK(sc); >> ... >> callout_drain(...); >> ... >> mtx_destroy(&sc->foo_mtx); >> >> If you don't have the drain and softclock is trying to acquire the backing >> mutex while you have it held (before the callout_stop) then Bad Things can >> happen if you don't do the drain. Having the lock just "give up" doesn't >> work either because if the memory containing the lock is free'd and >> reinitialized such that it looks enough like a valid lock then softclock >> (or its equivalent) will still try to obtain it. Also, you need to do a >> drain so it is safe to free the callout structure to prevent it from being >> recycled and having weird races where it gets recycled and rescheduled but >> the timer code thinks it has a pending stop for that pointer and so it >> aborts the wrong instance of the timer, etc. > > I completely agree to what John Baldwin is writing. You need two > stop-functions: > > xxx_stop which is non-blocking and xxx_drain which can block i.e. sleep > > BTW: The USB code in P4 uses the same semantics, due to the same reasons: > > usbd_transfer_stop() and usbd_transfer_drain() > > The only difference is that I pass an error code to the callback which might > happen after that usbd_transfer_stop is called. > > I think that xxx_stop() and xxx_drain() is a generic approach that should be > applied to all callback systems. Whenever you have a callback you need to be > able to stop it and drain it. I think the argument that Poul-Henning is making is not that you don't need something that behaves "like drain", but rather, we're like the wait for drain to be a short, mutex-length wait rather than a long, msleep-length wait. Remember that the bodies of callouts are expected to run in a very short period of time in order to not stall the timer system, in fact, in such a way that a mutex could be held over the entirely timeout call. Given that this is the case, one might reasonably expect callout_stop() to perform the drain rather than having a separate call. Such a model would be very advantageous in TCP, where rather than having to defer GC'ing the inpcb/tcpcb to a GC worker thread, which we don't do but should, we could use the stop call safely and eliminate a whole class of races from the stack. Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 11:19:16 2007 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D8C8416A469; Fri, 28 Dec 2007 11:19:16 +0000 (UTC) (envelope-from kris@FreeBSD.org) Received: from weak.local (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 240AE13C509; Fri, 28 Dec 2007 11:19:15 +0000 (UTC) (envelope-from kris@FreeBSD.org) Message-ID: <4774DBB2.5060707@FreeBSD.org> Date: Fri, 28 Dec 2007 12:19:14 +0100 From: Kris Kennaway User-Agent: Thunderbird 2.0.0.9 (Macintosh/20071031) MIME-Version: 1.0 To: John Baldwin References: <200712271704.44796.jhb@FreeBSD.org> In-Reply-To: <200712271704.44796.jhb@FreeBSD.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: arch@FreeBSD.org Subject: Re: kernel features MIB X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 11:19:16 -0000 John Baldwin wrote: > One of the things we have at work is a kern.features sysctl MIB that contains > nodes to indicate if a named feature is present. For example, on i386 we > have kern.features.pae and we auto enable -DPAE for kernel modules if the > currently running kernel is using PAE using that sysctl. > > One of the patches I want to commit soon is support for handling > shm_open/shm_unlink directly in the kernel via swap-backed VM objects (the > long-heralded memfd stuff). I would like to have the sysctl MIB so that > libc's for older releases (e.g. libc.so.6) could use the syscalls if they are > available so that shm segments are shared between compat apps (e.g. 4.x or > 6.x) and up-to-date apps. > > At work we don't have a pretty API for this at all, but I'm thinking for > FreeBSD we can do this: > > FEATURE(foo, "description of foo") > > which is a macro to create the 'kern.features.foo' node and set it to 1. Then > we could have a routine in libc: > > int feature_present(const char *name); > > That returns a boolean to indicate if a given feature is present or not by > invoking sysctlbyname(3), etc. > > Any objections to the idea? > I have wanted something like this for a long time. In ports land they often need to know this kind of thing, e.g. is compat4x support enabled in the kernel, etc. Kris From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 14:56:35 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5E3B516A46E; Fri, 28 Dec 2007 14:56:35 +0000 (UTC) (envelope-from gnn@neville-neil.com) Received: from outbound0.mx.meer.net (outbound0.mx.meer.net [209.157.153.23]) by mx1.freebsd.org (Postfix) with ESMTP id 21F6E13C4DD; Fri, 28 Dec 2007 14:56:34 +0000 (UTC) (envelope-from gnn@neville-neil.com) Received: from mail.meer.net (mail.meer.net [209.157.152.14]) by outbound0.sv.meer.net (8.12.10/8.12.6) with ESMTP id lBSDnHih047757; Fri, 28 Dec 2007 05:49:17 -0800 (PST) (envelope-from gnn@neville-neil.com) Received: from minion.local.neville-neil.com (61.204.211.246.customerlink.pwd.ne.jp [61.204.211.246]) by mail.meer.net (8.13.3/8.13.3/meer) with ESMTP id lBSDnGBQ048390; Fri, 28 Dec 2007 05:49:16 -0800 (PST) (envelope-from gnn@neville-neil.com) Date: Fri, 28 Dec 2007 22:49:15 +0900 Message-ID: From: gnn@freebsd.org To: Julian Elischer In-Reply-To: <4772F123.5030303@elischer.org> References: <4772F123.5030303@elischer.org> User-Agent: Wanderlust/2.15.5 (Almost Unreal) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (=?ISO-8859-4?Q?Shij=F2?=) APEL/10.7 Emacs/22.1.50 (i386-apple-darwin8.10.1) MULE/5.0 (SAKAKI) MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka") Content-Type: text/plain; charset=US-ASCII Cc: FreeBSD Net , Robert Watson , Qing Li , arch@freebsd.org Subject: Re: resend: multiple routing table roadmap (format fix) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 14:56:35 -0000 At Wed, 26 Dec 2007 16:26:11 -0800, julian wrote: > > Resending as my mailer made a dog's breakfast of the first one > with all sorts of wierd line breaks... hopefully this will be better. > (I haven't sent it yet so I'm hoping).. > > > ------------------------------------------- > > > > On thing where FreeBSD has been falling behind, and which by chance > I have some time to work on is "policy based routing", which allows > different packet streams to be routed by more than just the > destination address. > > Constraints: > ------------ > > I want to make some form of this available in the 6.x tree > (and by extension 7.x) , but FreeBSD in general needs it so I might as > well > do it in -current and back port the portions I need. > > One of the ways that this can be done is to have the ability to > instantiate multiple kernel routing tables (which I will now > refer to as "Forwarding Information Bases" or "FIBs" for political > correctness reasons. Which FIB a particular packet uses to make > the next hop decision can be decided by a number of mechanisms. > The policies these mechanisms implement are the "Policies" referred > to in "Policy based routing". > > One of the constraints I have if I try to back port this work to > 6.x is that it must be implemented as a EXTENSION to the existing > ABIs in 6.x so that third party applications do not need to be > recompiled in timespan of the branch. > > Implementation method, (part 1) > ------------------------------- > For this reason I have implemented a "sufficient subset" of a > multiple routing table solution in Perforce, and back-ported it > to 6.x. (also in Perforce though not yet caught up with what I > have done in -current/P4). The subset allows a number of FIBs > to be defined at compile time (sufficient for my purposes in 6.x) and > implements the changes needed to allow IPV4 to use them. I have not done > the changes for ipv6 simply because I do not need it, and I do not > have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. > > Other protocol families are left untouched and should there be > users with proprietary protocol families, they should continue to work > and be oblivious to the existence of the extra FIBs. > > To understand how this is done, one must know that the current FIB > code starts everything off with a single dimensional array of > pointers to FIB head structures (One per protocol family), each of > which in turn points to the trie of routes available to that family. > > The basic change in the ABI compatible version of the change is to > extent that array to be a 2 dimensional array, so that > instead of protocol family X looking at rt_tables[X] for the > table it needs, it looks at rt_tables[Y][X] when for all > protocol families except ipv4 Y is always 0. > Code that is unaware of the change always just sees the first row > of the table, which of course looks just like the one dimensional > array that existed before. > > > The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() > are all maintained, but refer only to the first row of the array, > so that existing callers in proprietary protocols can continue to > do the "right thing". > Some new entry points are added, for the exclusive use of ipv4 code > called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), > which have an extra argument which refers the code to the correct row. > > In addition, there are some new entry points (currently called > dom_rtalloc() and friends) that check the Address family being > looked up and call either rtalloc() (and friends) if the protocol > is not IPv4 forcing the action to row 0 or to the appropriate row > if it IS IPv4 (and that info is available). These are for calling > from code that is not specific to any particular protocol. The way > these are implemented would change in the non ABI preserving code > to be added later. > > One feature of the first version of the code is that for ipv4, > the interface routes show up automatically on all the FIBs, so > that no matter what FIB you select you always have the basic > direct attached hosts available to you. (rtinit() does this > automatically). > You CAN delete an interface route from one FIB should you want > to but by default it's there. ARP information is also available > in each FIB. It's assumed that the same machine would have the > same MAC address, regardless of which FIB you are using to get > to it. > > > This brings us as to how the correct FIB is selected for an outgoing > IPV4 packet. > > Packets fall into one of a number of classes. > 1/ locally generated packets, coming from a socket/PCB. > Such packets select a FIB from a number associated with the > socket/PCB. This in turn is inherited from the process, > but can be changed by a socket option. The process in turn > inherits it on fork. I have written a utility call setfib > that acts a bit like nice.. > > setfib -n 3 ping target.example.com # will use fib 3 for ping. > > 2/ packets received on an interface for forwarding. > By default these packets would use table 0, > (or possibly a number settable in a sysctl(not yet)). > but prior to routing the firewall can inspect them (see below). > > 3/ packets inspected by a packet classifier, which can arbitrarily > associate a fib with it on a packet by packet basis. > A fib assigned to a packet by a packet classifier > (such as ipfw) would over-ride a fib associated by > a more default source. (such as cases 1 or 2). > > Routing messages would be associated with their > process, and thus select one FIB or another. > > In addition Netstat has been edited to be able to cope with the > fact that the array is now 2 dimensional. (It looks in system > memory using libkvm (!)). > > In addition two sysctls are added to give: > a) the number of FIBs compiled in (active) > b) the default FIB of the calling process. > > Early testing experience: > ------------------------- > > Basically our (IronPort's) appliance does this functionality already > using ipfw fwd but that method has some drawbacks. > > For example, > It can't fully simulate a routing table because it can't influence the > socket's choice of local address when a connect() is done. > > > Testing during the generating of these changes has been > remarkably smooth so far. Multiple tables have co-existed > with no notable side effects, and packets have been routes > accordingly. > > I have not yet added the changes to ipfw. > pf has some similar changes already but they seem to rely on > the various FIBs having symbolic names. Which I do not plan to support > in the first version of these changes. > > SCTP has interestingly enough built in support for this, called VRFs > in Cisco parlance. it will be interesting to see how that handles it > when it suddenly actually does something. > > I have not redone my testing since my last edits, but will be > retesting with the current code asap. > > > Where to next: > -------------------- > > After committing the ABI compatible version and MFCing it, I'd > like to proceed in a forward direction in -current. this will > result in some roto-tilling in the routing code. > > Firstly: the current code's idea of having a separate tree per > protocol family, all of the same format, and pointed to by the > 1 dimensional array is a bit silly. Especially when one considers that > there > is code that makes assumptions about every protocol having the same > internal structures there. Some protocols don't WANT that > sort of structure. (for example the whole idea of a netmask is foreign > to appletalk). This needs to be made opaque to the external code. > > My suggested first change is to add routing method pointers to the > 'domain' structure, along with information pointing the data. > instead of having an array of pointers to uniform structures, > there would be an array pointing to the 'domain' structures > for each protocol address domain (protocol family), > and the methods this reached would be called. The methods would have > an argument that gives FIB number, but the protocol would be free > to ignore it. > > Interaction with the ARP layer/ LL layer would need to be > revisited as well. Qing Li has been working on this already. > > > diffs > for those with p4 access: > p4 diff2 -du //depot/vendor/freebsd/src/sys/...@131121 > //depot/user/julian/routing/src/sys/... > > for those with the makediff perl script: > perl ~/makediff.pl //depot/vendor/freebsd/src/sys/...@131121 > //depot/user/julian/routing/src/sys/... > > for those with neither: > > http://people.freebsd.org/~julian/mrt2.diff > > I just put the userland utility in usr.sbin/setfib/ in p4. > and changes to netstat in usr.bin/netstat/ > > see: > http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/julian/routing/src&HIDEDEL=NO > > > > > I'd like to get comments on this (compat) version, so that I can > commit it, get general testing under way to start the clock for MFC, > and then get moving on the fuller implementation (that breaks ABIs) > and other routing issues. > How does this work with Marko Zec's virtual stack system? Best, George From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 15:15:05 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 50FC216A419 for ; Fri, 28 Dec 2007 15:15:05 +0000 (UTC) (envelope-from ivo.vachkov@gmail.com) Received: from wx-out-0506.google.com (wx-out-0506.google.com [66.249.82.236]) by mx1.freebsd.org (Postfix) with ESMTP id BAD5913C44B for ; Fri, 28 Dec 2007 15:15:04 +0000 (UTC) (envelope-from ivo.vachkov@gmail.com) Received: by wx-out-0506.google.com with SMTP id i29so990613wxd.7 for ; Fri, 28 Dec 2007 07:15:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=Xgce59oOh55fxeTG2J6YoIG8nu/AgUUjXBJ/AWMokiQ=; b=fzgYaAoedMzuUszJkE+GNfe/ea4DqL7/0IEoBi5AP0yY3yaII2mMzfLKQU5nYMg36cPsoN4mmbrrPfFAL24SH2o0PfDd4ZRdjLR9Ro1YHMlLm4ooQQKe8sBU9shqfzVISxuZ7QkSZYosPB5cWncR+GKj1iONsVSiFTLHjvKB7Zw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=h+hJojK1RXsDr/WZwnFxT9GJjnvQLycMaqF82K6zCN/ip4cM35z2JSdhkaVJnnZe1cB+7Xs5Ea5MwZVvyiB1RWwedWe8TTQbY8+wRUCNK9hqVY+i0poM+MKuM5KszSeKGeBgDOaUow4eVcuAiCPfiepyUaNhmAnFp+5+1jB5mu0= Received: by 10.150.197.8 with SMTP id u8mr2605151ybf.131.1198854903741; Fri, 28 Dec 2007 07:15:03 -0800 (PST) Received: by 10.150.219.5 with HTTP; Fri, 28 Dec 2007 07:15:03 -0800 (PST) Message-ID: Date: Fri, 28 Dec 2007 17:15:03 +0200 From: "Ivo Vachkov" To: "Julian Elischer" In-Reply-To: <477416CC.4090906@elischer.org> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <4772F123.5030303@elischer.org> <477416CC.4090906@elischer.org> Cc: FreeBSD Net , Robert Watson , Qing Li , arch@freebsd.org Subject: Re: resend: multiple routing table roadmap (format fix) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 15:15:05 -0000 On Dec 27, 2007 11:19 PM, Julian Elischer wrote: > > Ivo Vachkov wrote: > > On Dec 27, 2007 2:26 AM, Julian Elischer wrote: > >> Resending as my mailer made a dog's breakfast of the first one > >> with all sorts of wierd line breaks... hopefully this will be better. > >> (I haven't sent it yet so I'm hoping).. > >> > >> > >> ------------------------------------------- > >> > >> > >> > >> On thing where FreeBSD has been falling behind, and which by chance I > >> have some time to work on is "policy based routing", which allows > >> different > >> packet streams to be routed by more than just the destination address. > >> > >> Constraints: > >> ------------ > >> > >> I want to make some form of this available in the 6.x tree > >> (and by extension 7.x) , but FreeBSD in general needs it so I might as > >> well > >> do it in -current and back port the portions I need. > >> > >> One of the ways that this can be done is to have the ability to > >> instantiate multiple kernel routing tables (which I will now > >> refer to as "Forwarding Information Bases" or "FIBs" for political > >> correctness reasons. Which FIB a particular packet uses to make > >> the next hop decision can be decided by a number of mechanisms. > >> The policies these mechanisms implement are the "Policies" referred > >> to in "Policy based routing". > >> > >> One of the constraints I have if I try to back port this work to > >> 6.x is that it must be implemented as a EXTENSION to the existing > >> ABIs in 6.x so that third party applications do not need to be > >> recompiled in timespan of the branch. > >> > >> Implementation method, (part 1) > >> ------------------------------- > >> For this reason I have implemented a "sufficient subset" of a > >> multiple routing table solution in Perforce, and back-ported it > >> to 6.x. (also in Perforce though not yet caught up with what I > >> have done in -current/P4). The subset allows a number of FIBs > >> to be defined at compile time (sufficient for my purposes in 6.x) and > >> implements the changes needed to allow IPV4 to use them. I have not done > >> the changes for ipv6 simply because I do not need it, and I do not > >> have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. > > By the way, I might add that in the 6.x compat. version I may end up > limiting the feature to 8 tables. This is because I need to store some > stuff in an efficient way in the mbuf, and in a compatible manner this > is easiest done by stealing the top 4 bits in the mbuf dlags word > and defining them as: > > #define M_HAVEFIB 0x10000000 > #define M_FIBMASK 0x07 > #define M_FIBNUM 0xe0000000 > #define M_FIBSHIFT 29 > #define m_getfib(_m, _default) ((m->m_flags & M_HAVE_FIBNUM) ? > ((m->m_flags >> M_FIBSHIFT) & M_FIBMASK) : _default) > #M_SETFIB(_m, _fib) do { \ > _m->m_flags &= ~M_FIBNUM; \ > _m->m_flags |= (M_HAVEFIB|((_fib & M_FIBMASK) << M_FIBSHIFT));\ > } while (0) > > This then becomes very easy to change to use a tag or > whatever is needed in later versions , and the number can > be expanded past 8 predefined FIBs at that time.. > > >> > >> Other protocol families are left untouched and should there be > >> users with proprietary protocol families, they should continue to work > >> and be oblivious to the existence of the extra FIBs. > >> > >> To understand how this is done, one must know that the current FIB > >> code starts everything off with a single dimensional array of > >> pointers to FIB head structures (One per protocol family), each of > >> which in turn points to the trie of routes available to that family. > >> > >> The basic change in the ABI compatible version of the change is to > >> extent that array to be a 2 dimensional array, so that > >> instead of protocol family X looking at rt_tables[X] for the > >> table it needs, it looks at rt_tables[Y][X] when for all > >> protocol families except ipv4 Y is always 0. > >> Code that is unaware of the change always just sees the first row > >> of the table, which of course looks just like the one dimensional > >> array that existed before. > > > > Pretty much like the OpenBSD approach :) > > well, I did look at the code briefly, but I didn't base it on it.. > > > > > >> The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() > >> are all maintained, but refer only to the first row of the array, > >> so that existing callers in proprietary protocols can continue to > >> do the "right thing". > >> Some new entry points are added, for the exclusive use of ipv4 code > >> called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), > >> which have an extra argument which refers the code to the correct row. > >> > >> In addition, there are some new entry points (currently called > >> dom_rtalloc() and friends) that check the Address family being > >> looked up and call either rtalloc() (and friends) if the protocol > >> is not IPv4 forcing the action to row 0 or to the appropriate row > >> if it IS IPv4 (and that info is available). These are for calling > >> from code that is not specific to any particular protocol. The way > >> these are implemented would change in the non ABI preserving code > >> to be added later. > >> > >> One feature of the first version of the code is that for ipv4, > >> the interface routes show up automatically on all the FIBs, so > >> that no matter what FIB you select you always have the basic > >> direct attached hosts available to you. (rtinit() does this > >> automatically). > >> You CAN delete an interface route from one FIB should you want > >> to but by default it's there. ARP information is also available > >> in each FIB. It's assumed that the same machine would have the > >> same MAC address, regardless of which FIB you are using to get > >> to it. > >> > >> > >> This brings us as to how the correct FIB is selected for an outgoing > >> IPV4 packet. > >> > >> Packets fall into one of a number of classes. > >> 1/ locally generated packets, coming from a socket/PCB. > >> Such packets select a FIB from a number associated with the > >> socket/PCB. This in turn is inherited from the process, > >> but can be changed by a socket option. The process in turn > >> inherits it on fork. I have written a utility call setfib > >> that acts a bit like nice.. > >> > >> setfib -n 3 ping target.example.com # will use fib 3 for ping. > >> > >> 2/ packets received on an interface for forwarding. > >> By default these packets would use table 0, > >> (or possibly a number settable in a sysctl(not yet)). > >> but prior to routing the firewall can inspect them (see below). > >> > >> 3/ packets inspected by a packet classifier, which can arbitrarily > >> associate a fib with it on a packet by packet basis. > >> A fib assigned to a packet by a packet classifier > >> (such as ipfw) would over-ride a fib associated by > >> a more default source. (such as cases 1 or 2). > > > > For the 2/ and 3/ cases I added (in a personal work i've been doing > > lately) additional field in struct mbuf which can be set by a packet > > filter or other application upon receiving which points the right > > table to use for the lookup. This way a simple "marking" can be used > > to divide different flows and create policy based routing. > > This would be the final way but I want to really minimise problems > in the compat versions, so I'll avoid doing that for now. > > Do you have this work available? I have it. However, I'll break a NDA if I 'open' it. > And have you looked at mi diffs below? I plan to look at your code asap. > > > > >> Routing messages would be associated with their > >> process, and thus select one FIB or another. > >> > >> In addition Netstat has been edited to be able to cope with the > >> fact that the array is now 2 dimensional. (It looks in system > >> memory using libkvm (!)). > >> > >> In addition two sysctls are added to give: > >> a) the number of FIBs compiled in (active) > >> b) the default FIB of the calling process. > >> > >> Early testing experience: > >> ------------------------- > >> > >> Basically our (IronPort's) appliance does this functionality already > >> using ipfw fwd but that method has some drawbacks. > >> > >> For example, > >> It can't fully simulate a routing table because it can't influence the > >> socket's choice of local address when a connect() is done. > >> > >> > >> Testing during the generating of these changes has been > >> remarkably smooth so far. Multiple tables have co-existed > >> with no notable side effects, and packets have been routes > >> accordingly. > >> > >> I have not yet added the changes to ipfw. > >> pf has some similar changes already but they seem to rely on > >> the various FIBs having symbolic names. Which I do not plan to support > >> in the first version of these changes. > >> > >> SCTP has interestingly enough built in support for this, called VRFs > >> in Cisco parlance. it will be interesting to see how that handles it > >> when it suddenly actually does something. > >> > >> I have not redone my testing since my last edits, but will be > >> retesting with the current code asap. > >> > >> > >> Where to next: > >> -------------------- > >> > >> After committing the ABI compatible version and MFCing it, I'd > >> like to proceed in a forward direction in -current. this will > >> result in some roto-tilling in the routing code. > >> > >> Firstly: the current code's idea of having a separate tree per > >> protocol family, all of the same format, and pointed to by the > >> 1 dimensional array is a bit silly. Especially when one considers that > >> there > >> is code that makes assumptions about every protocol having the same > >> internal structures there. Some protocols don't WANT that > >> sort of structure. (for example the whole idea of a netmask is foreign > >> to appletalk). This needs to be made opaque to the external code. > >> > >> My suggested first change is to add routing method pointers to the > >> 'domain' structure, along with information pointing the data. > >> instead of having an array of pointers to uniform structures, > >> there would be an array pointing to the 'domain' structures > >> for each protocol address domain (protocol family), > >> and the methods this reached would be called. The methods would have > >> an argument that gives FIB number, but the protocol would be free > >> to ignore it. > >> > >> Interaction with the ARP layer/ LL layer would need to be > >> revisited as well. Qing Li has been working on this already. > >> > >> > >> diffs > >> for those with p4 access: > >> p4 diff2 -du //depot/vendor/freebsd/src/sys/...@131121 > >> //depot/user/julian/routing/src/sys/... > >> > >> for those with the makediff perl script: > >> perl ~/makediff.pl //depot/vendor/freebsd/src/sys/...@131121 > >> //depot/user/julian/routing/src/sys/... > >> > >> for those with neither: > >> > >> http://people.freebsd.org/~julian/mrt2.diff > >> > >> I just put the userland utility in usr.sbin/setfib/ in p4. > >> and changes to netstat in usr.bin/netstat/ > >> > >> see: > >> http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/julian/routing/src&HIDEDEL=NO > >> > >> > >> > >> > >> I'd like to get comments on this (compat) version, so that I can > >> commit it, > >> get general testing under way to start the clock for MFC, and then get > >> moving on the fuller implementation (that breaks ABIs) and other > >> routing issues. > >> > >> > >> Julian > >> > >> > >> > >> > >> _______________________________________________ > >> freebsd-arch@freebsd.org mailing list > >> http://lists.freebsd.org/mailman/listinfo/freebsd-arch > >> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > >> > > -- "UNIX is basically a simple operating system, but you have to be a genius to understand the simplicity." Dennis Ritchie From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 17:17:04 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6523E16A420 for ; Fri, 28 Dec 2007 17:17:04 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outD.internet-mail-service.net (outD.internet-mail-service.net [216.240.47.227]) by mx1.freebsd.org (Postfix) with ESMTP id 2F74C13C45A for ; Fri, 28 Dec 2007 17:17:04 +0000 (UTC) (envelope-from julian@elischer.org) Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160) by out.internet-mail-service.net (qpsmtpd/0.40) with ESMTP; Fri, 28 Dec 2007 09:17:03 -0800 Received: from julian-mac.elischer.org (localhost [127.0.0.1]) by idiom.com (Postfix) with ESMTP id 01D3E126DA3; Fri, 28 Dec 2007 09:17:02 -0800 (PST) Message-ID: <47752F98.6050209@elischer.org> Date: Fri, 28 Dec 2007 09:17:12 -0800 From: Julian Elischer User-Agent: Thunderbird 2.0.0.9 (Macintosh/20071031) MIME-Version: 1.0 To: gnn@freebsd.org References: <4772F123.5030303@elischer.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: FreeBSD Net , Robert Watson , Qing Li , arch@freebsd.org Subject: Re: resend: multiple routing table roadmap (format fix) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 17:17:04 -0000 gnn@freebsd.org wrote: > At Wed, 26 Dec 2007 16:26:11 -0800, > julian wrote: [...] > > How does this work with Marko Zec's virtual stack system? > > Best, > George orthogonal From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 17:30:13 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CBAE816A46E for ; Fri, 28 Dec 2007 17:30:13 +0000 (UTC) (envelope-from SRS0=mkQ4=RT=tm.uka.de=max.laier@srs.kundenserver.de) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.126.174]) by mx1.freebsd.org (Postfix) with ESMTP id 44BF713C46E for ; Fri, 28 Dec 2007 17:30:13 +0000 (UTC) (envelope-from SRS0=mkQ4=RT=tm.uka.de=max.laier@srs.kundenserver.de) Received: from vampire.homelinux.org (dslb-088-066-001-237.pools.arcor-ip.net [88.66.1.237]) by mrelayeu.kundenserver.de (node=mrelayeu7) with ESMTP (Nemesis) id 0ML2xA-1J8Iq03Cp7-0005oJ; Fri, 28 Dec 2007 18:17:37 +0100 Received: (qmail 25901 invoked by uid 80); 28 Dec 2007 17:17:00 -0000 Received: from 2001:6f8:12c8:1:21d:60ff:fe0c:1771 (SquirrelMail authenticated user mlaier) by router.laiers.local with HTTP; Fri, 28 Dec 2007 18:17:00 +0100 (CET) Message-ID: <43684.2001:6f8:12c8:1:21d:60ff:fe0c:1771.1198862220.squirrel@router.laiers.local> In-Reply-To: <200712271704.44796.jhb@FreeBSD.org> References: <200712271704.44796.jhb@FreeBSD.org> Date: Fri, 28 Dec 2007 18:17:00 +0100 (CET) From: "Max Laier" To: "John Baldwin" User-Agent: SquirrelMail/1.4.13 MIME-Version: 1.0 Content-Type: text/plain;charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) Importance: Normal X-Provags-ID: V01U2FsdGVkX1+8Z4peOOieo9iKYjZ2mV9zj0elWboszuGnoKX tmQ0f9HGEdFuGDTbwm7VR75LJHy5UNzbFR3jIpLyOqduCKuqKe 2iZQzY/TRSwxHB789Q1Sg== Cc: arch@freebsd.org Subject: Re: kernel features MIB X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 17:30:13 -0000 Am Do, 27.12.2007, 23:04, schrieb John Baldwin: > One of the things we have at work is a kern.features sysctl MIB that > contains > nodes to indicate if a named feature is present. For example, on i386 we > have kern.features.pae and we auto enable -DPAE for kernel modules if the > currently running kernel is using PAE using that sysctl. > > One of the patches I want to commit soon is support for handling > shm_open/shm_unlink directly in the kernel via swap-backed VM objects (the > long-heralded memfd stuff). I would like to have the sysctl MIB so that > libc's for older releases (e.g. libc.so.6) could use the syscalls if they > are > available so that shm segments are shared between compat apps (e.g. 4.x or > 6.x) and up-to-date apps. > > At work we don't have a pretty API for this at all, but I'm thinking for > FreeBSD we can do this: > > FEATURE(foo, "description of foo") > > which is a macro to create the 'kern.features.foo' node and set it to 1. > Then > we could have a routine in libc: > > int feature_present(const char *name); > > That returns a boolean to indicate if a given feature is present or not by > invoking sysctlbyname(3), etc. > > Any objections to the idea? Sounds like a good idea indeed. What about modules, though? Would it make sense to have something ident/strings parseable in the .kld to identify features provided by that module? feature_present (or _available) could search the default module paths and return which module needs to be loaded. This could depend on FEATURE(kld, ...) and maybe kern.securelevel. -- /"\ Best regards, | mlaier@freebsd.org \ / Max Laier | ICQ #67774661 X http://pf4freebsd.love2party.net/ | mlaier@EFnet / \ ASCII Ribbon Campaign | Against HTML Mail and News From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 19:39:51 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 90EEE16A418; Fri, 28 Dec 2007 19:39:51 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id D785613C458; Fri, 28 Dec 2007 19:39:50 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8q) with ESMTP id 226422892-1834499 for multiple; Fri, 28 Dec 2007 14:41:57 -0500 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id lBSJdX4B064130; Fri, 28 Dec 2007 14:39:35 -0500 (EST) (envelope-from jhb@freebsd.org) From: John Baldwin To: Robert Watson Date: Fri, 28 Dec 2007 12:25:13 -0500 User-Agent: KMail/1.9.6 References: <18378.1196596684@critter.freebsd.dk> <200712281003.52062.hselasky@c2i.net> <20071228102544.J45653@fledge.watson.org> In-Reply-To: <20071228102544.J45653@fledge.watson.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200712281225.14954.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Fri, 28 Dec 2007 14:39:36 -0500 (EST) X-Virus-Scanned: ClamAV 0.91.2/5278/Fri Dec 28 11:55:36 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Andre Oppermann , Hans Petter Selasky , Attilio Rao , arch@freebsd.org, Poul-Henning Kamp , freebsd-arch@freebsd.org Subject: Re: New "timeout" api, to replace callout X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 19:39:51 -0000 On Friday 28 December 2007 05:30:14 am Robert Watson wrote: > > On Fri, 28 Dec 2007, Hans Petter Selasky wrote: > > >> The reason you need to do a drain is to allow for safe destroying of the > >> lock. Specifically, drivers tend to do this: > >> > >> FOO_LOCK(sc); > >> ... > >> callout_stop(...); > >> FOO_UNLOCK(sc); > >> ... > >> callout_drain(...); > >> ... > >> mtx_destroy(&sc->foo_mtx); > >> > >> If you don't have the drain and softclock is trying to acquire the backing > >> mutex while you have it held (before the callout_stop) then Bad Things can > >> happen if you don't do the drain. Having the lock just "give up" doesn't > >> work either because if the memory containing the lock is free'd and > >> reinitialized such that it looks enough like a valid lock then softclock > >> (or its equivalent) will still try to obtain it. Also, you need to do a > >> drain so it is safe to free the callout structure to prevent it from being > >> recycled and having weird races where it gets recycled and rescheduled but > >> the timer code thinks it has a pending stop for that pointer and so it > >> aborts the wrong instance of the timer, etc. > > > > I completely agree to what John Baldwin is writing. You need two > > stop-functions: > > > > xxx_stop which is non-blocking and xxx_drain which can block i.e. sleep > > > > BTW: The USB code in P4 uses the same semantics, due to the same reasons: > > > > usbd_transfer_stop() and usbd_transfer_drain() > > > > The only difference is that I pass an error code to the callback which might > > happen after that usbd_transfer_stop is called. > > > > I think that xxx_stop() and xxx_drain() is a generic approach that should be > > applied to all callback systems. Whenever you have a callback you need to be > > able to stop it and drain it. > > I think the argument that Poul-Henning is making is not that you don't need > something that behaves "like drain", but rather, we're like the wait for drain > to be a short, mutex-length wait rather than a long, msleep-length wait. > Remember that the bodies of callouts are expected to run in a very short > period of time in order to not stall the timer system, in fact, in such a way > that a mutex could be held over the entirely timeout call. Given that this is > the case, one might reasonably expect callout_stop() to perform the drain > rather than having a separate call. Such a model would be very advantageous > in TCP, where rather than having to defer GC'ing the inpcb/tcpcb to a GC > worker thread, which we don't do but should, we could use the stop call safely > and eliminate a whole class of races from the stack. The problem is if softclock() (or similar replacement in future) was preempted and hasn't run yet but has already gotten to the point that callout_drain() has to block, then sitting in a spin loop waiting for softclock() to acknowledge the stop isn't very optimal. The amount of time you are asleep in this case is actually very small, and you probably won't even block the vast majority of the time if you follow the 'lock / stop / unlock / drain' model. You only sleep when you lose the race and softclock() has chosen to run your callout and is waiting for the driver/client/whatever's lock. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 19:39:51 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 90EEE16A418; Fri, 28 Dec 2007 19:39:51 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id D785613C458; Fri, 28 Dec 2007 19:39:50 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8q) with ESMTP id 226422892-1834499 for multiple; Fri, 28 Dec 2007 14:41:57 -0500 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id lBSJdX4B064130; Fri, 28 Dec 2007 14:39:35 -0500 (EST) (envelope-from jhb@freebsd.org) From: John Baldwin To: Robert Watson Date: Fri, 28 Dec 2007 12:25:13 -0500 User-Agent: KMail/1.9.6 References: <18378.1196596684@critter.freebsd.dk> <200712281003.52062.hselasky@c2i.net> <20071228102544.J45653@fledge.watson.org> In-Reply-To: <20071228102544.J45653@fledge.watson.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200712281225.14954.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Fri, 28 Dec 2007 14:39:36 -0500 (EST) X-Virus-Scanned: ClamAV 0.91.2/5278/Fri Dec 28 11:55:36 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Andre Oppermann , Hans Petter Selasky , Attilio Rao , arch@freebsd.org, Poul-Henning Kamp , freebsd-arch@freebsd.org Subject: Re: New "timeout" api, to replace callout X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 19:39:51 -0000 On Friday 28 December 2007 05:30:14 am Robert Watson wrote: > > On Fri, 28 Dec 2007, Hans Petter Selasky wrote: > > >> The reason you need to do a drain is to allow for safe destroying of the > >> lock. Specifically, drivers tend to do this: > >> > >> FOO_LOCK(sc); > >> ... > >> callout_stop(...); > >> FOO_UNLOCK(sc); > >> ... > >> callout_drain(...); > >> ... > >> mtx_destroy(&sc->foo_mtx); > >> > >> If you don't have the drain and softclock is trying to acquire the backing > >> mutex while you have it held (before the callout_stop) then Bad Things can > >> happen if you don't do the drain. Having the lock just "give up" doesn't > >> work either because if the memory containing the lock is free'd and > >> reinitialized such that it looks enough like a valid lock then softclock > >> (or its equivalent) will still try to obtain it. Also, you need to do a > >> drain so it is safe to free the callout structure to prevent it from being > >> recycled and having weird races where it gets recycled and rescheduled but > >> the timer code thinks it has a pending stop for that pointer and so it > >> aborts the wrong instance of the timer, etc. > > > > I completely agree to what John Baldwin is writing. You need two > > stop-functions: > > > > xxx_stop which is non-blocking and xxx_drain which can block i.e. sleep > > > > BTW: The USB code in P4 uses the same semantics, due to the same reasons: > > > > usbd_transfer_stop() and usbd_transfer_drain() > > > > The only difference is that I pass an error code to the callback which might > > happen after that usbd_transfer_stop is called. > > > > I think that xxx_stop() and xxx_drain() is a generic approach that should be > > applied to all callback systems. Whenever you have a callback you need to be > > able to stop it and drain it. > > I think the argument that Poul-Henning is making is not that you don't need > something that behaves "like drain", but rather, we're like the wait for drain > to be a short, mutex-length wait rather than a long, msleep-length wait. > Remember that the bodies of callouts are expected to run in a very short > period of time in order to not stall the timer system, in fact, in such a way > that a mutex could be held over the entirely timeout call. Given that this is > the case, one might reasonably expect callout_stop() to perform the drain > rather than having a separate call. Such a model would be very advantageous > in TCP, where rather than having to defer GC'ing the inpcb/tcpcb to a GC > worker thread, which we don't do but should, we could use the stop call safely > and eliminate a whole class of races from the stack. The problem is if softclock() (or similar replacement in future) was preempted and hasn't run yet but has already gotten to the point that callout_drain() has to block, then sitting in a spin loop waiting for softclock() to acknowledge the stop isn't very optimal. The amount of time you are asleep in this case is actually very small, and you probably won't even block the vast majority of the time if you follow the 'lock / stop / unlock / drain' model. You only sleep when you lose the race and softclock() has chosen to run your callout and is waiting for the driver/client/whatever's lock. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 19:40:06 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9F3A116A418 for ; Fri, 28 Dec 2007 19:40:06 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id 357F613C474 for ; Fri, 28 Dec 2007 19:40:06 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8q) with ESMTP id 226422916-1834499 for multiple; Fri, 28 Dec 2007 14:42:04 -0500 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id lBSJdg7G064138; Fri, 28 Dec 2007 14:39:43 -0500 (EST) (envelope-from jhb@freebsd.org) From: John Baldwin To: "Max Laier" Date: Fri, 28 Dec 2007 13:00:28 -0500 User-Agent: KMail/1.9.6 References: <200712271704.44796.jhb@FreeBSD.org> <43684.2001:6f8:12c8:1:21d:60ff:fe0c:1771.1198862220.squirrel@router.laiers.local> In-Reply-To: <43684.2001:6f8:12c8:1:21d:60ff:fe0c:1771.1198862220.squirrel@router.laiers.local> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200712281300.28899.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Fri, 28 Dec 2007 14:39:43 -0500 (EST) X-Virus-Scanned: ClamAV 0.91.2/5278/Fri Dec 28 11:55:36 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: arch@freebsd.org Subject: Re: kernel features MIB X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 19:40:06 -0000 On Friday 28 December 2007 12:17:00 pm Max Laier wrote: > > Am Do, 27.12.2007, 23:04, schrieb John Baldwin: > > One of the things we have at work is a kern.features sysctl MIB that > > contains > > nodes to indicate if a named feature is present. For example, on i386 we > > have kern.features.pae and we auto enable -DPAE for kernel modules if the > > currently running kernel is using PAE using that sysctl. > > > > One of the patches I want to commit soon is support for handling > > shm_open/shm_unlink directly in the kernel via swap-backed VM objects (the > > long-heralded memfd stuff). I would like to have the sysctl MIB so that > > libc's for older releases (e.g. libc.so.6) could use the syscalls if they > > are > > available so that shm segments are shared between compat apps (e.g. 4.x or > > 6.x) and up-to-date apps. > > > > At work we don't have a pretty API for this at all, but I'm thinking for > > FreeBSD we can do this: > > > > FEATURE(foo, "description of foo") > > > > which is a macro to create the 'kern.features.foo' node and set it to 1. > > Then > > we could have a routine in libc: > > > > int feature_present(const char *name); > > > > That returns a boolean to indicate if a given feature is present or not by > > invoking sysctlbyname(3), etc. > > > > Any objections to the idea? > > Sounds like a good idea indeed. What about modules, though? Would it > make sense to have something ident/strings parseable in the .kld to > identify features provided by that module? feature_present (or > _available) could search the default module paths and return which module > needs to be loaded. This could depend on FEATURE(kld, ...) and maybe > kern.securelevel. You could have a userland tool that parses the linker set for sysctl's and uses the name of the symbol to figure this out if that was desired. Modules already have the MODULE_DEPEND stuff available that could be used, but I'm thinking about things that aren't in modules. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 20:01:50 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AAD2516A469; Fri, 28 Dec 2007 20:01:50 +0000 (UTC) (envelope-from zec@tel.fer.hr) Received: from xaqua.tel.fer.hr (xaqua.tel.fer.hr [161.53.19.25]) by mx1.freebsd.org (Postfix) with ESMTP id E7FEF13C45D; Fri, 28 Dec 2007 20:01:49 +0000 (UTC) (envelope-from zec@tel.fer.hr) Received: by xaqua.tel.fer.hr (Postfix, from userid 20006) id 9E5D99B742; Fri, 28 Dec 2007 20:42:43 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.1.7 (2006-10-05) on xaqua.tel.fer.hr X-Spam-Level: X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.7 Received: from [192.168.200.112] (zec2.tel.fer.hr [161.53.19.79]) by xaqua.tel.fer.hr (Postfix) with ESMTP id ABBC89B6C9; Fri, 28 Dec 2007 20:42:40 +0100 (CET) From: Marko Zec To: freebsd-arch@freebsd.org, FreeBSD Net Date: Fri, 28 Dec 2007 20:40:30 +0100 User-Agent: KMail/1.9.7 References: <4772F123.5030303@elischer.org> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200712282040.30745.zec@tel.fer.hr> Cc: gnn@freebsd.org, Robert Watson , Julian Elischer , Qing Li Subject: Re: resend: multiple routing table roadmap (format fix) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 20:01:50 -0000 On Friday 28 December 2007 14:49:15 gnn@freebsd.org wrote: > At Wed, 26 Dec 2007 16:26:11 -0800, > > julian wrote: > > > > On thing where FreeBSD has been falling behind, and which by chance > > I have some time to work on is "policy based routing", which allows > > different packet streams to be routed by more than just the > > destination address. > > > > Constraints: > > ------------ > > > > I want to make some form of this available in the 6.x tree > > (and by extension 7.x) , but FreeBSD in general needs it so I might > > as well > > do it in -current and back port the portions I need. > > > > One of the ways that this can be done is to have the ability to > > instantiate multiple kernel routing tables (which I will now > > refer to as "Forwarding Information Bases" or "FIBs" for political > > correctness reasons. Which FIB a particular packet uses to make > > the next hop decision can be decided by a number of mechanisms. > > The policies these mechanisms implement are the "Policies" referred > > to in "Policy based routing". > > > > One of the constraints I have if I try to back port this work to > > 6.x is that it must be implemented as a EXTENSION to the existing > > ABIs in 6.x so that third party applications do not need to be > > recompiled in timespan of the branch. > > > > Implementation method, (part 1) > > ------------------------------- > > For this reason I have implemented a "sufficient subset" of a > > multiple routing table solution in Perforce, and back-ported it > > to 6.x. (also in Perforce though not yet caught up with what I > > have done in -current/P4). The subset allows a number of FIBs > > to be defined at compile time (sufficient for my purposes in 6.x) > > and implements the changes needed to allow IPV4 to use them. I have > > not done the changes for ipv6 simply because I do not need it, and > > I do not have enough knowledge of ipv6 (e.g. neighbor discovery) > > needed to do it. > > > > Other protocol families are left untouched and should there be > > users with proprietary protocol families, they should continue to > > work and be oblivious to the existence of the extra FIBs. > > > > To understand how this is done, one must know that the current FIB > > code starts everything off with a single dimensional array of > > pointers to FIB head structures (One per protocol family), each of > > which in turn points to the trie of routes available to that > > family. > > > > The basic change in the ABI compatible version of the change is to > > extent that array to be a 2 dimensional array, so that > > instead of protocol family X looking at rt_tables[X] for the > > table it needs, it looks at rt_tables[Y][X] when for all > > protocol families except ipv4 Y is always 0. > > Code that is unaware of the change always just sees the first row > > of the table, which of course looks just like the one dimensional > > array that existed before. > > > > > > The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() > > are all maintained, but refer only to the first row of the array, > > so that existing callers in proprietary protocols can continue to > > do the "right thing". > > Some new entry points are added, for the exclusive use of ipv4 code > > called in_rtrequest(), in_rtalloc(), in_rtalloc1() and > > in_rtalloc_ign(), which have an extra argument which refers the > > code to the correct row. > > > > In addition, there are some new entry points (currently called > > dom_rtalloc() and friends) that check the Address family being > > looked up and call either rtalloc() (and friends) if the protocol > > is not IPv4 forcing the action to row 0 or to the appropriate row > > if it IS IPv4 (and that info is available). These are for calling > > from code that is not specific to any particular protocol. The way > > these are implemented would change in the non ABI preserving code > > to be added later. > > > > One feature of the first version of the code is that for ipv4, > > the interface routes show up automatically on all the FIBs, so > > that no matter what FIB you select you always have the basic > > direct attached hosts available to you. (rtinit() does this > > automatically). > > You CAN delete an interface route from one FIB should you want > > to but by default it's there. ARP information is also available > > in each FIB. It's assumed that the same machine would have the > > same MAC address, regardless of which FIB you are using to get > > to it. > > > > > > This brings us as to how the correct FIB is selected for an > > outgoing IPV4 packet. > > > > Packets fall into one of a number of classes. > > 1/ locally generated packets, coming from a socket/PCB. > > Such packets select a FIB from a number associated with the > > socket/PCB. This in turn is inherited from the process, > > but can be changed by a socket option. The process in turn > > inherits it on fork. I have written a utility call setfib > > that acts a bit like nice.. > > > > setfib -n 3 ping target.example.com # will use fib 3 for > > ping. > > > > 2/ packets received on an interface for forwarding. > > By default these packets would use table 0, > > (or possibly a number settable in a sysctl(not yet)). > > but prior to routing the firewall can inspect them (see below). > > > > 3/ packets inspected by a packet classifier, which can arbitrarily > > associate a fib with it on a packet by packet basis. > > A fib assigned to a packet by a packet classifier > > (such as ipfw) would over-ride a fib associated by > > a more default source. (such as cases 1 or 2). > > > > Routing messages would be associated with their > > process, and thus select one FIB or another. > > > > In addition Netstat has been edited to be able to cope with the > > fact that the array is now 2 dimensional. (It looks in system > > memory using libkvm (!)). > > > > In addition two sysctls are added to give: > > a) the number of FIBs compiled in (active) > > b) the default FIB of the calling process. > > > > Early testing experience: > > ------------------------- > > > > Basically our (IronPort's) appliance does this functionality > > already using ipfw fwd but that method has some drawbacks. > > > > For example, > > It can't fully simulate a routing table because it can't influence > > the socket's choice of local address when a connect() is done. > > > > > > Testing during the generating of these changes has been > > remarkably smooth so far. Multiple tables have co-existed > > with no notable side effects, and packets have been routes > > accordingly. > > > > I have not yet added the changes to ipfw. > > pf has some similar changes already but they seem to rely on > > the various FIBs having symbolic names. Which I do not plan to > > support in the first version of these changes. > > > > SCTP has interestingly enough built in support for this, called > > VRFs in Cisco parlance. it will be interesting to see how that > > handles it when it suddenly actually does something. > > > > I have not redone my testing since my last edits, but will be > > retesting with the current code asap. > > > > > > Where to next: > > -------------------- > > > > After committing the ABI compatible version and MFCing it, I'd > > like to proceed in a forward direction in -current. this will > > result in some roto-tilling in the routing code. > > > > Firstly: the current code's idea of having a separate tree per > > protocol family, all of the same format, and pointed to by the > > 1 dimensional array is a bit silly. Especially when one considers > > that there > > is code that makes assumptions about every protocol having the same > > internal structures there. Some protocols don't WANT that > > sort of structure. (for example the whole idea of a netmask is > > foreign to appletalk). This needs to be made opaque to the external > > code. > > > > My suggested first change is to add routing method pointers to the > > 'domain' structure, along with information pointing the data. > > instead of having an array of pointers to uniform structures, > > there would be an array pointing to the 'domain' structures > > for each protocol address domain (protocol family), > > and the methods this reached would be called. The methods would > > have an argument that gives FIB number, but the protocol would be > > free to ignore it. > > > > Interaction with the ARP layer/ LL layer would need to be > > revisited as well. Qing Li has been working on this already. > > > > > > diffs > > for those with p4 access: > > p4 diff2 -du //depot/vendor/freebsd/src/sys/...@131121 > > //depot/user/julian/routing/src/sys/... > > > > for those with the makediff perl script: > > perl ~/makediff.pl //depot/vendor/freebsd/src/sys/...@131121 > > //depot/user/julian/routing/src/sys/... > > > > for those with neither: > > > > http://people.freebsd.org/~julian/mrt2.diff > > > > I just put the userland utility in usr.sbin/setfib/ in p4. > > and changes to netstat in usr.bin/netstat/ > > > > see: > > http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/ > >julian/routing/src&HIDEDEL=NO > > > > I'd like to get comments on this (compat) version, so that I can > > commit it, get general testing under way to start the clock for > > MFC, and then get moving on the fuller implementation (that breaks > > ABIs) and other routing issues. > > How does this work with Marko Zec's virtual stack system? The thrust behind Julian's work seems to be providing multiple forwarding tables for for purposes of traffic engineering / policy based routing, with a single firewall instance used as a classifier. vimage-style network stack virtualization provides for more strict isolation on both port and IP address space, independent firewall instances, IPSEC config / state etc., and as such might be better suited for providing enhanced jail-style virtual hosting environments, as well as for providing virtual router "slices". So once we get Julian's multi-FIB stuff in the base system, I see no reason why we couldn't have this functionality replicated in each "vimage" instance, i.e. have multiple independent virtual networking environnments, each with multiple FIBs. Implementationwise, my hacks currently rely on macros for conditional virtualization of global variables / structs. As long as Julian's changes continue to be unconditional, i.e. without playing a similar macroization game, I think integrating this code (once it hits HEAD) into p4/projects/vimage should be more or less a straightforward job. Marko From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 22:56:35 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5184016A419 for ; Fri, 28 Dec 2007 22:56:35 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id 1BF2413C44B for ; Fri, 28 Dec 2007 22:56:34 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8q) with ESMTP id 226448658-1834499 for ; Fri, 28 Dec 2007 17:58:50 -0500 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id lBSMuWMw065495 for ; Fri, 28 Dec 2007 17:56:32 -0500 (EST) (envelope-from jhb@freebsd.org) From: John Baldwin To: freebsd-arch@freebsd.org Date: Fri, 28 Dec 2007 17:45:07 -0500 User-Agent: KMail/1.9.6 References: <200712271704.44796.jhb@FreeBSD.org> In-Reply-To: <200712271704.44796.jhb@FreeBSD.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200712281745.08144.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Fri, 28 Dec 2007 17:56:32 -0500 (EST) X-Virus-Scanned: ClamAV 0.91.2/5278/Fri Dec 28 11:55:36 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Subject: Re: kernel features MIB X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 22:56:35 -0000 On Thursday 27 December 2007 05:04:44 pm John Baldwin wrote: > At work we don't have a pretty API for this at all, but I'm thinking for > FreeBSD we can do this: > > FEATURE(foo, "description of foo") > > which is a macro to create the 'kern.features.foo' node and set it to 1. Then > we could have a routine in libc: > > int feature_present(const char *name); > > That returns a boolean to indicate if a given feature is present or not by > invoking sysctlbyname(3), etc. > > Any objections to the idea? So here's a bikeshed question I have no idea for. Which header should feature_present()'s prototype go in? I anticipate this routine being used in libc itself, so I don't think it can go into libutil. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 22:57:25 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 68DEE16A46B; Fri, 28 Dec 2007 22:57:25 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id 7646A13C45A; Fri, 28 Dec 2007 22:57:25 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: by elvis.mu.org (Postfix, from userid 1192) id 1C8441A4D80; Fri, 28 Dec 2007 14:55:26 -0800 (PST) Date: Fri, 28 Dec 2007 14:55:26 -0800 From: Alfred Perlstein To: John Baldwin Message-ID: <20071228225526.GJ76698@elvis.mu.org> References: <200712271704.44796.jhb@FreeBSD.org> <200712281745.08144.jhb@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200712281745.08144.jhb@freebsd.org> User-Agent: Mutt/1.4.2.3i Cc: freebsd-arch@freebsd.org Subject: Re: kernel features MIB X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 22:57:25 -0000 * John Baldwin [071228 14:54] wrote: > On Thursday 27 December 2007 05:04:44 pm John Baldwin wrote: > > At work we don't have a pretty API for this at all, but I'm thinking for > > FreeBSD we can do this: > > > > FEATURE(foo, "description of foo") > > > > which is a macro to create the 'kern.features.foo' node and set it to 1. Then > > we could have a routine in libc: > > > > int feature_present(const char *name); > > > > That returns a boolean to indicate if a given feature is present or not by > > invoking sysctlbyname(3), etc. > > > > Any objections to the idea? > > So here's a bikeshed question I have no idea for. Which header should > feature_present()'s prototype go in? I anticipate this routine being > used in libc itself, so I don't think it can go into libutil. Whereever sysconf/pathconf stuff is. -- - Alfred Perlstein From owner-freebsd-arch@FreeBSD.ORG Sat Dec 29 00:31:05 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1A84616A417; Sat, 29 Dec 2007 00:31:05 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.freebsd.org (Postfix) with ESMTP id EE59113C447; Sat, 29 Dec 2007 00:31:04 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 41C2F47CAC; Fri, 28 Dec 2007 19:31:04 -0500 (EST) Date: Sat, 29 Dec 2007 00:31:04 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: John Baldwin In-Reply-To: <200712281745.08144.jhb@freebsd.org> Message-ID: <20071229002903.M45653@fledge.watson.org> References: <200712271704.44796.jhb@FreeBSD.org> <200712281745.08144.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: kernel features MIB X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Dec 2007 00:31:05 -0000 On Fri, 28 Dec 2007, John Baldwin wrote: > On Thursday 27 December 2007 05:04:44 pm John Baldwin wrote: > >> At work we don't have a pretty API for this at all, but I'm thinking for >> FreeBSD we can do this: >> >> FEATURE(foo, "description of foo") >> >> which is a macro to create the 'kern.features.foo' node and set it to 1. >> Then we could have a routine in libc: >> >> int feature_present(const char *name); >> >> That returns a boolean to indicate if a given feature is present or not by >> invoking sysctlbyname(3), etc. >> >> Any objections to the idea? > > So here's a bikeshed question I have no idea for. Which header should > feature_present()'s prototype go in? I anticipate this routine being used > in libc itself, so I don't think it can go into libutil. #include feature_check(2)? Does POSIX talk about the namespace for non-portable names being passed to sysconf(3)? Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Sat Dec 29 05:09:42 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 65C4716A418 for ; Sat, 29 Dec 2007 05:09:42 +0000 (UTC) (envelope-from gnn@neville-neil.com) Received: from outbound0.mx.meer.net (outbound0.mx.meer.net [209.157.153.23]) by mx1.freebsd.org (Postfix) with ESMTP id 5385013C469 for ; Sat, 29 Dec 2007 05:09:42 +0000 (UTC) (envelope-from gnn@neville-neil.com) Received: from mail.meer.net (mail.meer.net [209.157.152.14]) by outbound0.sv.meer.net (8.12.10/8.12.6) with ESMTP id lBT32Pih000379; Fri, 28 Dec 2007 19:02:25 -0800 (PST) (envelope-from gnn@neville-neil.com) Received: from minion.local.neville-neil.com (61.204.211.246.customerlink.pwd.ne.jp [61.204.211.246]) by mail.meer.net (8.13.3/8.13.3/meer) with ESMTP id lBT32OIN095874; Fri, 28 Dec 2007 19:02:24 -0800 (PST) (envelope-from gnn@neville-neil.com) Date: Sat, 29 Dec 2007 12:02:22 +0900 Message-ID: From: gnn@freebsd.org To: Marko Zec In-Reply-To: <200712282040.30745.zec@tel.fer.hr> References: <4772F123.5030303@elischer.org> <200712282040.30745.zec@tel.fer.hr> User-Agent: Wanderlust/2.15.5 (Almost Unreal) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (=?ISO-8859-4?Q?Shij=F2?=) APEL/10.7 Emacs/22.1.50 (i386-apple-darwin8.10.1) MULE/5.0 (SAKAKI) MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka") Content-Type: text/plain; charset=US-ASCII Cc: FreeBSD Net , Qing Li , Robert Watson , Julian Elischer , freebsd-arch@freebsd.org Subject: Re: resend: multiple routing table roadmap (format fix) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Dec 2007 05:09:42 -0000 At Fri, 28 Dec 2007 20:40:30 +0100, Marko Zec wrote: > The thrust behind Julian's work seems to be providing multiple > forwarding tables for for purposes of traffic engineering / policy > based routing, with a single firewall instance used as a classifier. > vimage-style network stack virtualization provides for more strict > isolation on both port and IP address space, independent firewall > instances, IPSEC config / state etc., and as such might be better > suited for providing enhanced jail-style virtual hosting environments, > as well as for providing virtual router "slices". > > So once we get Julian's multi-FIB stuff in the base system, I see no > reason why we couldn't have this functionality replicated in > each "vimage" instance, i.e. have multiple independent virtual > networking environnments, each with multiple FIBs. > > Implementationwise, my hacks currently rely on macros for conditional > virtualization of global variables / structs. As long as Julian's > changes continue to be unconditional, i.e. without playing a similar > macroization game, I think integrating this code (once it hits HEAD) > into p4/projects/vimage should be more or less a straightforward job. Cool, that's what I wanted to hear. Best, George From owner-freebsd-arch@FreeBSD.ORG Sat Dec 29 23:43:58 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1527A16A417 for ; Sat, 29 Dec 2007 23:43:58 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id D3D6913C4D3 for ; Sat, 29 Dec 2007 23:43:57 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.107] (cpe-24-94-75-93.hawaii.res.rr.com [24.94.75.93]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id lBTNhrx4067553 for ; Sat, 29 Dec 2007 18:43:54 -0500 (EST) (envelope-from jroberson@chesapeake.net) Date: Sat, 29 Dec 2007 13:44:50 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: arch@freebsd.org Message-ID: <20071229133256.D957@desktop> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: kvm_getfiles is badly broken X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Dec 2007 23:43:58 -0000 >From kvm_getfiles(3): The number of files found is returned in the reference parameter cnt. The files are returned as a contiguous array of file structures, preceded by the address of the first file entry in the kernel. sysctl kern.file is used if the kernel is live. This code assumes the kernel copies out a struct filelist before any files. It does not. I can not find any consumers of this interface however. I also don't understand why it supplies the address of the first file and what this would be used for. There are other users of sysctl kern.file which assume it does not prepend this address so it would be wrong to change that. Would it also be wrong to change kvm to supply null as the first address? Other inconsistencies include live kernels returning strcut xfile and dead kernels returning struct file. The interface in kvm_getfiles() claims to return struct files. I can't imagine any code actually relies on this routine. Any opinions on what we should do with this? It has been broken since 2002 at least. I'm committing changes for my lockless struct file work. As part of that I'll commit a broken but compiling implementation that matches current bugs but causes the code to fail whenever it is called. Cheers, Jeff