From owner-freebsd-arch@FreeBSD.ORG  Sun Dec 23 06:33:04 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2D7B516A417;
	Sun, 23 Dec 2007 06:33:04 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com
	[216.240.101.25])
	by mx1.freebsd.org (Postfix) with ESMTP id E43D513C447;
	Sun, 23 Dec 2007 06:33:03 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from [192.168.1.107] (cpe-24-94-75-93.hawaii.res.rr.com
	[24.94.75.93]) (authenticated bits=0)
	by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id
	lBN6Ww4a017018; Sun, 23 Dec 2007 01:32:59 -0500 (EST)
	(envelope-from jroberson@chesapeake.net)
Date: Sat, 22 Dec 2007 20:34:16 -1000 (HST)
From: Jeff Roberson <jroberson@chesapeake.net>
X-X-Sender: jroberson@desktop
To: Andre Oppermann <andre@freebsd.org>
In-Reply-To: <476A4DCC.4040206@freebsd.org>
Message-ID: <20071222203120.A899@desktop>
References: <20071219211025.T899@desktop> <476A4DCC.4040206@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@freebsd.org
Subject: Re: Linux compatible setaffinity.
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 23 Dec 2007 06:33:04 -0000

On Thu, 20 Dec 2007, Andre Oppermann wrote:

> Jeff Roberson wrote:
>> I have implemented a linux compatible sched_setaffinity() call which is 
>> somewhat crippled.  This allows a userspace process to supply a bitmask of 
>> processors which it will run on.  I have copied the linux interface such 
>> that it should be api compatible because I believe it is a sensible 
>> interface and they beat us to it by 3 years.
>
> The Linux (and Solaris) style setaffinity is rather low level and
> any user of it has to make many assumptions based on incomplete
> knowledge of the underlying hardware and its architecture (buses,
> caches, latency between cores, etc).
>
> In practical use I'd rather have a function to bind myself to the
> current CPU or CPU number X, and then to specify that new threads
> or forked processes should emerge on another, but not this CPU.
> Pepper that with a few hints like latency and cache affinity (important
> or not important) the kernel can act on appropriately and it becomes
> much more powerful and simpler to use.  Taking it even further an
> application may want to specify that it would like to run on a number
> X of cores that are close (latency/cache) together, be permanently
> bound to it and to repel any other such requests.  This way I can
> run my database server on socket 1 cores 1-4, and the webserver on
> socket 2 cores 5-8 more or less automagically.  sched_setaffinity
> requires a lot of operator involvement and architecture knowledge
> to make that happen.
>
> Not that I'm against a Linux compatible sched_setaffinity(), it's
> just not as practical to use as other constructs.
>
> Food for thought.


Well my hope is that the kernel scheduler has all of the required 
information about the processor to make these kinds of decisions for the 
general case.  Right now we need better topology information in the 
kernel, but I think userspace only uses setaffinity in very special cases. 
I'd hate for it to become the norm in applications to start looking at cpu 
topology and making decisions based on that.

Not that I would argue if someone were to implement this.  I just want us 
to get it right often enough in the scheduler that it's not necessary.

The uses for setaffinity that I have seen so far have been very special 
purpose.  Or quite often just spawning one thread per cpu and pinning it 
in place for various purposes.

Jeff

>
> -- 
> Andre
>

From owner-freebsd-arch@FreeBSD.ORG  Sun Dec 23 06:35:16 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B9A4F16A419;
	Sun, 23 Dec 2007 06:35:16 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com
	[216.240.101.25])
	by mx1.freebsd.org (Postfix) with ESMTP id 7CE7513C448;
	Sun, 23 Dec 2007 06:35:16 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from [192.168.1.107] (cpe-24-94-75-93.hawaii.res.rr.com
	[24.94.75.93]) (authenticated bits=0)
	by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id
	lBN6ZEmS017221; Sun, 23 Dec 2007 01:35:15 -0500 (EST)
	(envelope-from jroberson@chesapeake.net)
Date: Sat, 22 Dec 2007 20:36:32 -1000 (HST)
From: Jeff Roberson <jroberson@chesapeake.net>
X-X-Sender: jroberson@desktop
To: David Xu <davidxu@FreeBSD.org>
In-Reply-To: <476B1973.6070902@freebsd.org>
Message-ID: <20071222203443.U899@desktop>
References: <20071219211025.T899@desktop> <476B1973.6070902@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@FreeBSD.org
Subject: Re: Linux compatible setaffinity.
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 23 Dec 2007 06:35:16 -0000

On Fri, 21 Dec 2007, David Xu wrote:

> Jeff Roberson wrote:
>> I have implemented a linux compatible sched_setaffinity() call which is 
>> somewhat crippled.  This allows a userspace process to supply a bitmask of 
>> processors which it will run on.  I have copied the linux interface such 
>> that it should be api compatible because I believe it is a sensible 
>> interface and they beat us to it by 3 years.
>> 
>> My implementation is crippled in that it supports binding by curthread only 
>> and to a single cpu only.  Neither of the schedulers presently support 
>> binding to multiple cpus or binding a non-curthread thread.  This property 
>> is not inherited by forked threads and does not effect other threads in the 
>> same process.  These two limitations can gradually be weakened without 
>> effecting the syscall api.
>> 
>> The linux api is:
>> int    sched_setaffinity(pid_t pid, unsigned int cpusetsize, cpu_set_t 
>> *mask);
>> 
>> The cpu_set_t is the same as a fdset for select.  The cpusetsize argument 
>> is used to determine the size of the array in mask.
>> 
>> I'm mostly interested in feedback on how best to reduce the namespace 
>> pollution and avoid pulling the sched.h file into the generated syscall 
>> files (sysproto.h, etc).  Anyone who feels this is a terrible interface for 
>> such a thing should speak up now.
>> 
>> I also feel that in the medium term we will have to deal with machines with 
>> more cores than bits in their native word.  Using these CPU_SET, CPU_CLR 
>> macros is a fine way to deal with this issue.
>> 
>> I also have a primitive 'taskset', although I don't like the name, it 
>> allows you to run arbitrary programs bound to a single cpu.
>> 
>> Thanks,
>> Jeff
>> 
>
> I don't say no to these interfaces, but there is a need to tell
> user which cpus are sharing cache, or memory distance is closest enough,
> and which cpus are servicing interrupts, e.g, network interrupt and
> disks etc, etc, otherwise, blindly setting cpu affinity mask only
> can shoot itself in the foot.

I don't disagree with you, however, I think in most cases the affinity 
mask is used for very special purpose applications.  In the cases I have 
observed, anyhow, the application is tailored to the particular machine. 
So hopefully the programmer knows these things.  I would prefer that it 
not crop up as a general interface that normal applcations use to try to 
improve performance.  We should hope that we can improve the schedulers to 
do these things automatically.

Thanks,
Jeff


>
> Regards,
> David Xu
>

From owner-freebsd-arch@FreeBSD.ORG  Sun Dec 23 06:47:30 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 31F6B16A41A;
	Sun, 23 Dec 2007 06:47:30 +0000 (UTC) (envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id 0746313C461;
	Sun, 23 Dec 2007 06:47:29 +0000 (UTC) (envelope-from imp@bsdimp.com)
Received: from localhost (localhost [127.0.0.1])
	by harmony.bsdimp.com (8.14.1/8.14.1) with ESMTP id lBN6fptc065424;
	Sat, 22 Dec 2007 23:41:51 -0700 (MST) (envelope-from imp@bsdimp.com)
Date: Sat, 22 Dec 2007 23:45:26 -0700 (MST)
Message-Id: <20071222.234526.246317277.imp@bsdimp.com>
To: hselasky@c2i.net
From: "M. Warner Losh" <imp@bsdimp.com>
In-Reply-To: <200712202005.33263.hselasky@c2i.net>
References: <200712202005.33263.hselasky@c2i.net>
X-Mailer: Mew version 5.2 on Emacs 21.3 / Mule 5.0 (SAKAKI)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Cc: phk@phk.freebsd.dk, alfred@freebsd.org, freebsd-arch@freebsd.org
Subject: Re: More leaves on the device tree ?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 23 Dec 2007 06:47:30 -0000

In message: <200712202005.33263.hselasky@c2i.net>
            Hans Petter Selasky <hselasky@c2i.net> writes:
: I'm currently working on USB and I have been thinking about a simple way to 
: find what devices an USB device creates, and how to easily present that 
: information to the user.
: 
: I know there is "devinfo" and I would like to extend this utility to also show 
: which devices under /dev belongs to the device.
: 
: Implementation:
:
: "make_dev" takes an additional "device_t parent_device" argument and creates a 
: child device with some magic flags set.
: 
: Any comments ?

What do you do for all the devices in /dev/ for which there is no
device_t parent?

In general, we've tried to keep dev_t and device_t separate inside of
the kernel.  They are orthogonal, but related, things.  This gets
especially messy when you add to the mix NIC drivers, which create no
devices, but have network interfaces.  Do you also track that?  What
about the relationship to cloned or otherwise faked devices such as
the floppy driver and many tty drivers produce.

While it sounds simple and straight forward, I don't think that a good
implementation that takes into account the complexities of actual
hardware would be worth the complexity.

Warner

From owner-freebsd-arch@FreeBSD.ORG  Sun Dec 23 07:44:50 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2E6CF16A417;
	Sun, 23 Dec 2007 07:44:50 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
	by mx1.freebsd.org (Postfix) with ESMTP id B9E9013C44B;
	Sun, 23 Dec 2007 07:44:49 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (unknown [192.168.61.3])
	by phk.freebsd.dk (Postfix) with ESMTP id 009DD17104;
	Sun, 23 Dec 2007 07:44:47 +0000 (UTC)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.14.2/8.14.2) with ESMTP id lBN7ikow005387;
	Sun, 23 Dec 2007 07:44:46 GMT (envelope-from phk@critter.freebsd.dk)
To: "M. Warner Losh" <imp@bsdimp.com>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-Reply-To: Your message of "Sat, 22 Dec 2007 23:45:26 MST."
	<20071222.234526.246317277.imp@bsdimp.com> 
Date: Sun, 23 Dec 2007 07:44:46 +0000
Message-ID: <5386.1198395886@critter.freebsd.dk>
Sender: phk@critter.freebsd.dk
Cc: freebsd-arch@freebsd.org, alfred@freebsd.org, hselasky@c2i.net
Subject: Re: More leaves on the device tree ? 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 23 Dec 2007 07:44:50 -0000

In message <20071222.234526.246317277.imp@bsdimp.com>, "M. Warner Losh" writes:
>In message: <200712202005.33263.hselasky@c2i.net>
>            Hans Petter Selasky <hselasky@c2i.net> writes:
>: "make_dev" takes an additional "device_t parent_device" argument and creates a 
>: child device with some magic flags set.
>
>What do you do for all the devices in /dev/ for which there is no
>device_t parent?

I second Warners comments here.

device_t is a handle for a hardware, dev_t is for a device in /dev,
they are very different thing and have no reasonable mapping between
them ([0..N]:[0..M])

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

From owner-freebsd-arch@FreeBSD.ORG  Sun Dec 23 08:43:48 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1718F16A55F;
	Sun, 23 Dec 2007 08:43:48 +0000 (UTC)
	(envelope-from hselasky@c2i.net)
Received: from swip.net (mailfe05.swip.net [212.247.154.129])
	by mx1.freebsd.org (Postfix) with ESMTP id 564F813C45A;
	Sun, 23 Dec 2007 08:43:47 +0000 (UTC)
	(envelope-from hselasky@c2i.net)
X-Cloudmark-Score: 0.000000 []
Received: from [193.217.102.3] (account mc467741@c2i.net HELO [10.0.0.105])
	by mailfe05.swip.net (CommuniGate Pro SMTP 5.1.13)
	with ESMTPA id 641958604; Sun, 23 Dec 2007 09:43:45 +0100
From: Hans Petter Selasky <hselasky@c2i.net>
To: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
Date: Sun, 23 Dec 2007 09:44:27 +0100
User-Agent: KMail/1.9.7
References: <5386.1198395886@critter.freebsd.dk>
In-Reply-To: <5386.1198395886@critter.freebsd.dk>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200712230944.29301.hselasky@c2i.net>
Cc: alfred@freebsd.org, freebsd-arch@freebsd.org
Subject: Re: More leaves on the device tree ?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 23 Dec 2007 08:43:48 -0000

On Sunday 23 December 2007, Poul-Henning Kamp wrote:
> In message <20071222.234526.246317277.imp@bsdimp.com>, "M. Warner Losh" 
writes:
> >In message: <200712202005.33263.hselasky@c2i.net>
> >
> >            Hans Petter Selasky <hselasky@c2i.net> writes:
> >: "make_dev" takes an additional "device_t parent_device" argument and
> >: creates a child device with some magic flags set.
> >
> >What do you do for all the devices in /dev/ for which there is no
> >device_t parent?

Hi,

If the parent is NULL, then no dev_t node is created.

Regarding cloned devices my opinion is that they should always create a 
visible entry. What I have done for a while now is to create a dummy dev_t 
node. It is so annoying with invisible devices. Then you never know what you 
have got. For example "/dev/usbXXX".

What I do is simply to create "/dev/usb0 " with a space in the end. This file 
is not openable. Really there sould be a flag for that. Then you 
open "/dev/usb0" instead, but this device is never created. That's the clone 
device. Then clones appear like "/dev/usb0.XX":

/dev/usb0 %      /dev/usb1 %      /dev/usb2 %      /dev/usb3 %
/dev/usb0.00%    /dev/usb1.00%    /dev/usb2.00%    /dev/usb3.00%

>
> I second Warners comments here.
>
> device_t is a handle for a hardware, dev_t is for a device in /dev,
> they are very different thing and have no reasonable mapping between
> them ([0..N]:[0..M])

I'm not saying that every make_dev() should take a device_t parent. If there 
is no "device_t" parent then there will be no node created.

Another approach is to add something like:

void device_enlist_subdev(device_t parent, dev_t sub);
void device_delist_subdev(device_t parent, dev_t sub);

struct device {
...
	LIST_HEAD(struct cdev) dv_cdev_children;
...
};

struct cdev {
	LIST_ENTRY( .... ) dv_list;
};

For example if you have 8 USB serial port adapters, then you just get 8 TTY 
devices like /dev/cuaUXXX . And finding out where the USB devices are 
actually connected could be very simple if we could put some hints perhaps in 
the device_t tree ?

--HPS

From owner-freebsd-arch@FreeBSD.ORG  Sun Dec 23 09:31:16 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 23F2616A46C
	for <freebsd-arch@freebsd.org>; Sun, 23 Dec 2007 09:31:16 +0000 (UTC)
	(envelope-from bright@elvis.mu.org)
Received: from elvis.mu.org (elvis.mu.org [192.203.228.196])
	by mx1.freebsd.org (Postfix) with ESMTP id 20BAB13C47E
	for <freebsd-arch@freebsd.org>; Sun, 23 Dec 2007 09:31:16 +0000 (UTC)
	(envelope-from bright@elvis.mu.org)
Received: by elvis.mu.org (Postfix, from userid 1192)
	id 72A1C1A4D82; Sun, 23 Dec 2007 01:29:41 -0800 (PST)
Date: Sun, 23 Dec 2007 01:29:41 -0800
From: Alfred Perlstein <alfred@freebsd.org>
To: Hans Petter Selasky <hselasky@c2i.net>
Message-ID: <20071223092941.GV16982@elvis.mu.org>
References: <5386.1198395886@critter.freebsd.dk>
	<200712230944.29301.hselasky@c2i.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <200712230944.29301.hselasky@c2i.net>
User-Agent: Mutt/1.4.2.3i
Cc: Poul-Henning Kamp <phk@phk.freebsd.dk>, freebsd-arch@freebsd.org
Subject: Re: More leaves on the device tree ?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 23 Dec 2007 09:31:16 -0000

I think we're getting a little off track of the TODO items
for the inclusion of the usb code.

I think the original post mentions that devinfo(?) works for
the time being, although perhaps some work later on presentation
could be done.

Let's stick with devinfo and hit up the next tasks on the 
inclusion list.

I think the next thing is the SMP locking?  Or do you have
anything you have in mind from the list you would like to
tackle next?

thank you,
-Alfred


* Hans Petter Selasky <hselasky@c2i.net> [071223 00:42] wrote:
> On Sunday 23 December 2007, Poul-Henning Kamp wrote:
> > In message <20071222.234526.246317277.imp@bsdimp.com>, "M. Warner Losh" 
> writes:
> > >In message: <200712202005.33263.hselasky@c2i.net>
> > >
> > >            Hans Petter Selasky <hselasky@c2i.net> writes:
> > >: "make_dev" takes an additional "device_t parent_device" argument and
> > >: creates a child device with some magic flags set.
> > >
> > >What do you do for all the devices in /dev/ for which there is no
> > >device_t parent?
> 
> Hi,
> 
> If the parent is NULL, then no dev_t node is created.
> 
> Regarding cloned devices my opinion is that they should always create a 
> visible entry. What I have done for a while now is to create a dummy dev_t 
> node. It is so annoying with invisible devices. Then you never know what you 
> have got. For example "/dev/usbXXX".
> 
> What I do is simply to create "/dev/usb0 " with a space in the end. This file 
> is not openable. Really there sould be a flag for that. Then you 
> open "/dev/usb0" instead, but this device is never created. That's the clone 
> device. Then clones appear like "/dev/usb0.XX":
> 
> /dev/usb0 %      /dev/usb1 %      /dev/usb2 %      /dev/usb3 %
> /dev/usb0.00%    /dev/usb1.00%    /dev/usb2.00%    /dev/usb3.00%
> 
> >
> > I second Warners comments here.
> >
> > device_t is a handle for a hardware, dev_t is for a device in /dev,
> > they are very different thing and have no reasonable mapping between
> > them ([0..N]:[0..M])
> 
> I'm not saying that every make_dev() should take a device_t parent. If there 
> is no "device_t" parent then there will be no node created.
> 
> Another approach is to add something like:
> 
> void device_enlist_subdev(device_t parent, dev_t sub);
> void device_delist_subdev(device_t parent, dev_t sub);
> 
> struct device {
> ...
> 	LIST_HEAD(struct cdev) dv_cdev_children;
> ...
> };
> 
> struct cdev {
> 	LIST_ENTRY( .... ) dv_list;
> };
> 
> For example if you have 8 USB serial port adapters, then you just get 8 TTY 
> devices like /dev/cuaUXXX . And finding out where the USB devices are 
> actually connected could be very simple if we could put some hints perhaps in 
> the device_t tree ?
> 
> --HPS

-- 
- Alfred Perlstein

From owner-freebsd-arch@FreeBSD.ORG  Sun Dec 23 10:31:41 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8CD7216A419;
	Sun, 23 Dec 2007 10:31:41 +0000 (UTC)
	(envelope-from hselasky@c2i.net)
Received: from swip.net (mailfe08.swip.net [212.247.154.225])
	by mx1.freebsd.org (Postfix) with ESMTP id E9B1113C469;
	Sun, 23 Dec 2007 10:31:40 +0000 (UTC)
	(envelope-from hselasky@c2i.net)
X-Cloudmark-Score: 0.000000 []
Received: from [193.217.102.3] (account mc467741@c2i.net HELO [10.0.0.105])
	by mailfe08.swip.net (CommuniGate Pro SMTP 5.1.13)
	with ESMTPA id 739633548; Sun, 23 Dec 2007 11:31:39 +0100
From: Hans Petter Selasky <hselasky@c2i.net>
To: Alfred Perlstein <alfred@freebsd.org>
Date: Sun, 23 Dec 2007 11:32:20 +0100
User-Agent: KMail/1.9.7
References: <5386.1198395886@critter.freebsd.dk>
	<200712230944.29301.hselasky@c2i.net>
	<20071223092941.GV16982@elvis.mu.org>
In-Reply-To: <20071223092941.GV16982@elvis.mu.org>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200712231132.22576.hselasky@c2i.net>
Cc: Poul-Henning Kamp <phk@phk.freebsd.dk>, freebsd-arch@freebsd.org
Subject: Re: More leaves on the device tree ?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 23 Dec 2007 10:31:41 -0000

On Sunday 23 December 2007, Alfred Perlstein wrote:
> I think we're getting a little off track of the TODO items
> for the inclusion of the usb code.
>
> I think the original post mentions that devinfo(?) works for
> the time being, although perhaps some work later on presentation
> could be done.
>
> Let's stick with devinfo and hit up the next tasks on the
> inclusion list.
>
> I think the next thing is the SMP locking?  Or do you have
> anything you have in mind from the list you would like to
> tackle next?

No, just go ahead. What needs to be done about SMP locking ?

--HPS

From owner-freebsd-arch@FreeBSD.ORG  Mon Dec 24 01:43:03 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@hub.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2BAB416A41B;
	Mon, 24 Dec 2007 01:43:03 +0000 (UTC)
	(envelope-from davidxu@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id 3500313C45A;
	Mon, 24 Dec 2007 01:43:03 +0000 (UTC)
	(envelope-from davidxu@FreeBSD.org)
Received: from apple.my.domain (root@localhost [127.0.0.1])
	by freefall.freebsd.org (8.14.2/8.14.2) with ESMTP id lBO1gxND030035;
	Mon, 24 Dec 2007 01:43:01 GMT (envelope-from davidxu@freebsd.org)
Message-ID: <476F0EE5.1040404@freebsd.org>
Date: Mon, 24 Dec 2007 09:44:05 +0800
From: David Xu <davidxu@FreeBSD.org>
User-Agent: Thunderbird 2.0.0.9 (X11/20071211)
MIME-Version: 1.0
To: Robert Watson <rwatson@FreeBSD.org>
References: <20071219211025.T899@desktop> <476B1973.6070902@freebsd.org>
	<20071222183700.L5866@fledge.watson.org>
In-Reply-To: <20071222183700.L5866@fledge.watson.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: arch@FreeBSD.org
Subject: Re: Linux compatible setaffinity.
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Dec 2007 01:43:03 -0000

Robert Watson wrote:
> On Fri, 21 Dec 2007, David Xu wrote:
> 
>> I don't say no to these interfaces, but there is a need to tell user 
>> which cpus are sharing cache, or memory distance is closest enough, 
>> and which cpus are servicing interrupts, e.g, network interrupt and 
>> disks etc, etc, otherwise, blindly setting cpu affinity mask only can 
>> shoot itself in the foot.
> 
> While the Mac OS X API is pretty Mach-specific, it's worth taking a look 
> at their recently-announced affinity API:
> 
> http://developer.apple.com/releasenotes/Performance/RN-AffinityAPI/index.html 
> 
> 
> Robert N M Watson
> Computer Laboratory
> University of Cambridge
> 


I like the interfaces, it is more flexible.

Thanks
David Xu

From owner-freebsd-arch@FreeBSD.ORG  Mon Dec 24 10:43:29 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id ECFF816A417
	for <arch@FreeBSD.org>; Mon, 24 Dec 2007 10:43:28 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 9B0F813C46A
	for <arch@FreeBSD.org>; Mon, 24 Dec 2007 10:43:28 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id 40AEC46BB2
	for <arch@FreeBSD.org>; Mon, 24 Dec 2007 05:43:28 -0500 (EST)
Date: Mon, 24 Dec 2007 10:43:28 +0000 (GMT)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: arch@FreeBSD.org
Message-ID: <20071224103322.C40176@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: 
Subject: 8.0 network stack MPsafety goals
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Dec 2007 10:43:29 -0000


Dear all:

With the 7.0 release around the corner, many developers are starting to think 
about (and in quite a few cases, work on) their goals for 8.0.  One of our 
on-going kernel projects has been the elimination of the Giant lock, and that 
project has transformed into one of optimizating behavior on increasing 
numbers of processors.

In 7.0, despite the noteworth accomplishment of eliminating debug.mpsasfenet 
and conditional network stack Gian acquisition, we were unable to fully 
eliminate the IFF_NEEDSGIANT flag, which controls the conditional acquisition 
of the Giant lock around non-MPSAFE network device drivers.  Primarily these 
drivers are aging ISA network device drivers, although there are some 
exceptions, such as the USB stack.

This e-mail proposes the elimination of the IFF_NEEDSGIANT flag and associated 
infrastructure in FreeBSD 8.0, meaning that all network device drivers must be 
able to operate without the Giant lock (largely the case already).  Remaining 
drivers using the IFF_NEEDSGIANT flag must either be updated, or less ideally, 
removed.  I propose the following schedule:

Date		Goals
----		-----
26 Dec 2007	Post proposed schedule for flag and infrastructure removal
 		Post affected driver list

26 Jan 2008	Repost proposed schedule for flag and infrastructure removal
 		Post updated affected driver list

26 Feb 2008	Adjust boot-time printf for affect drivers to generate a loud
 		warning.
 		Post updated affected driver list

26 May 2008	Post HEADS UP of impending driver disabling
 		Post updated affected driver list

26 Jun 2008	Disable build of all drivers requiring IFF_NEEDSGIANT
 		Post updated affected driver list

26 Sep 2008	Post HEADS up of impending driver removal
 		Post updated affected driver list

26 Oct 2008	Delete source of all drivers requiring IFF_NEEDSGIANT
 		Remove flag and infrastructure

Here is a list of potentially affected drivers:

Name	Bus		Man page description
---	---		--------------------
ar	ISA/PCI		synchronous Digi/Arnet device driver
arl	ISA		Aironet Arlan 655 wireless network adapter driver
awi	PCCARD		AMD PCnetMobile IEEE 802.11 PCMCIA wireless network
 			driver
axe	USB		ASIX Electronics AX88172 USB Ethernet driver
cdce	USB		USB Communication Device Class Ethernet driver
cnw	PCCARD		Netwave AirSurfer wireless network driver
cs	ISA/PCCARD	Ethernet device driver
cue	USB		CATC USB-EL1210A USB Ethernet driver
ex	ISA/PCCARD	Ethernet device driver for the Intel EtherExpress
 			Pro/10 and Pro/10+
fe	CBUS/ISA/PCCARD	Fujitsu MB86960A/MB86965A based Ethernet adapters
ic	I2C		I2C bus system
ie	ISA		Ethernet device driver
kue	USB		Kawasaki LSI KL5KUSB101B USB Ethernet driver
oltr	ISA/PCI		Olicom Token Ring device driver
plip	PPBUS		printer port Internet Protocol driver
ppp	TTY		point to point protocol network interface
ray	PCCARD		Raytheon Raylink/Webgear Aviator PCCard driver
rue	USB		RealTek RTL8150 USB to Fast Ethernet controller driver
rum	USB		Ralink Technology USB IEEE 802.11a/b/g wireless
 			network device
sbni	ISA/PCI		Granch SBNI12 leased line modem driver
sbsh	PCI		Granch SBNI16 SHDSL modem device driver
sl	TTY		slip network interface
snc	ISA/PCCARD	National Semiconductor DP8393X SONIC Ethernet adapter
 			driver
sr	ISA/PCI		synchronous RISCom/N2 / WANic 400/405 device driver
udav	USB		Davicom DM9601 USB Ethernet driver
ural	USB		Ralink Technology RT2500USB IEEE 802.11 driver
xe	PCCARD		Xircom PCMCIA Ethernet device driver
zyd	USB		ZyDAS ZD1211/ZD1211B USB IEEE 802.11b/g wireless
 			network device

In some cases, the requirement for Giant is a property of a subsystem the 
driver depends on as the driver itself; for example, the tty subsystem for 
SLIP and PPP, and the USB subsystem for a number of USB ethernet and wireless 
drivers.  With most of a year before to go on the proposed schedule, my hope 
is that we will have lots of time to address these issues, but wanted to get a 
roadmap out from a network protocol stack architecture perspective so that 
device driver and subsystem authors could have a schedule in mind.

FYI, the following drivers also reference IFF_NEEDSGIANT, but only in order to 
provide their own conditional MPSAFEty, which can be removed without affecting 
device driver functionality (I believe):

Name	Bus		Man page description
---	---		--------------------
ce	PCI		driver for synchronous Cronyx Tau-PCI/32 WAN adapters
cp	PCI		driver for synchronous Cronyx Tau-PCI WAN adapters
ctau	ISA		driver for synchronous Cronyx Tau WAN adapters
cx	ISA		driver for synchronous/asynchronous Cronyx Sigma WAN
 			adapters

Developers and users of the above drivers are heavily encouraged to update the 
drivers to remove dependence on Giant, and/or make other contingency plans.

Robert N M Watson
Computer Laboratory
University of Cambridge

From owner-freebsd-arch@FreeBSD.ORG  Mon Dec 24 11:54:10 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3460616A419;
	Mon, 24 Dec 2007 11:54:10 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.freebsd.org (Postfix) with ESMTP id EF05B13C4DB;
	Mon, 24 Dec 2007 11:54:09 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id 9BE6447911;
	Mon, 24 Dec 2007 06:54:09 -0500 (EST)
Date: Mon, 24 Dec 2007 11:54:09 +0000 (GMT)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: dima <_pppp@mail.ru>
In-Reply-To: <E1J5Pxt-0004O8-00._pppp-mail-ru@f59.mail.ru>
Message-ID: <20071224114504.E40176@fledge.watson.org>
References: <20071220135342.O67327@fledge.watson.org>
	<E1J5Pxt-0004O8-00._pppp-mail-ru@f59.mail.ru>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@FreeBSD.org, net@FreeBSD.org
Subject: Re: Re: TCP Projects for 8.0 - first cut wiki page
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Dec 2007 11:54:10 -0000

On Thu, 20 Dec 2007, dima wrote:

>> Per earlier e-mail, I've created a page to track the various on-going
>> projects:
>>
>>    http://wiki.freebsd.org/TCPProjects8
>>
>> Rui has already kindly added the TCP ECN work to the page.
>
> As I know, we have a single swi:net thread in the kernel yet. Are there any 
> plans to make several such threads? If yes, this activity isn't mentioned in 
> wiki.
>
> There are 2 ideas: 1. per-core thread 2. per-interface thread I like the 
> second more.

This is a kind of tricky point, and one we will definitely be looking at.  In 
FreeBSD 6, we did link layer processing in the ithread, and deferred network 
layer and socket layer processing to the netisr and user thread.

In FreeBSD 7, we process up through the network layer and socket deliver in 
the ithread, and only the socket read/copyout are deferred to the user thread. 
This means that in FreeBSD 7, we get true parallelism between different input 
sources.  We still have the netisr, which is used for certain types of 
deferred processing, such as loopback network traffic (in order to avoid 
entering the receive path from the transmit path), IPSEC tunnel processing, 
etc, but for general ethernet traffic, it is not used.  This appears to work 
really well for a small number of interfaces because we eliminate a large 
number of context switches, and pushed the "drop point" from software into 
hardware, meaning that we don't burn cycles doing link layer processing for 
packets that will never make it to the network layer (netisr queue overflow).

The two real downsides are that this promotes network layer processing to 
interrupt priority rather than soft interrupt priority (and this may propagate 
to more other threads), and that the opportunity for parallelism is reduced 
between the link layer and the network processing layer.  The reason we went 
ahead and made the default change (it's configurable at runtime) is that it 
seemed that in most cases, we saw a significant performance improvement.

However, the current ithread/direct dispatch model has scaling issues as we 
approach larger numbers of interfaces, as the ithread approach does generally, 
because when the number of active thread exceeds the number of cores and the 
system is really busy, context switches are re-introduced, as well as an 
increased chance of ithreads bouncing around, etc.  What to do at that point 
is an interesting question--would we be better off reducing the number of 
active threads so that we have a small ithread worker pool serving many 
devices, for example?

So, in answer to your original question: we already do a per-interface thread 
for all in-bound processing in FreeBSD 7, but we'll need to continue to work 
on the underlying model and its behavior under high load.

Robert N M Watson
Computer Laboratory
University of Cambridge

From owner-freebsd-arch@FreeBSD.ORG  Tue Dec 25 03:35:13 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1962F16A420
	for <arch@freebsd.org>; Tue, 25 Dec 2007 03:35:13 +0000 (UTC)
	(envelope-from brian.mcginty@gmail.com)
Received: from wx-out-0506.google.com (wx-out-0506.google.com [66.249.82.239])
	by mx1.freebsd.org (Postfix) with ESMTP id D287213C465
	for <arch@freebsd.org>; Tue, 25 Dec 2007 03:35:12 +0000 (UTC)
	(envelope-from brian.mcginty@gmail.com)
Received: by wx-out-0506.google.com with SMTP id i29so570603wxd.7
	for <arch@freebsd.org>; Mon, 24 Dec 2007 19:35:12 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
	bh=TyTAIqfRtGCfHBSavDY+wQvZFqsa4x36vZGfFvbI6c8=;
	b=G/Vez16mmWHBJVA9OjJ+VFZKP2b0+k8nanRsF1s1L0VIqlXetO0sqcAFAC+3N+XCr5ZDSDMqKjNDMTfznbHMY7mutt0RXq4T/FVlyBnfQSXBU8qqH7KbY5I/CXL0NKFZpSzhuRhYXoATvnnOwbaOVm21iJcA4CjbXKT/h98InPg=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
	b=UxE1wnbaOJvHwlSCBLV9y7UkvVGMPr/gDtML8wnaDxGJc1RrO+FK7MZf332nWktqwQH31NPwlQljEPwDQtlc5cCnCrFEFRmIc/UqORdyG6nGfm3PNpahiau7Dp+QKi1jSU6Lw0yG2kPnxCCzurbkpNSkvdfZphLyuHpHbBzZe7M=
Received: by 10.70.22.16 with SMTP id 16mr3574395wxv.45.1198552179856;
	Mon, 24 Dec 2007 19:09:39 -0800 (PST)
Received: by 10.70.17.20 with HTTP; Mon, 24 Dec 2007 19:09:39 -0800 (PST)
Message-ID: <601bffc40712241909t10e6f3k8e7940d387b6efc2@mail.gmail.com>
Date: Mon, 24 Dec 2007 19:09:39 -0800
From: "Brian McGinty" <brian.mcginty@gmail.com>
To: "David Xu" <davidxu@freebsd.org>
In-Reply-To: <476F0EE5.1040404@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <20071219211025.T899@desktop> <476B1973.6070902@freebsd.org>
	<20071222183700.L5866@fledge.watson.org>
	<476F0EE5.1040404@freebsd.org>
Cc: arch@freebsd.org, Robert Watson <rwatson@freebsd.org>
Subject: Re: Linux compatible setaffinity.
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Dec 2007 03:35:13 -0000

On Dec 23, 2007 5:44 PM, David Xu <davidxu@freebsd.org> wrote:
>
> Robert Watson wrote:
> > On Fri, 21 Dec 2007, David Xu wrote:
> >
> >> I don't say no to these interfaces, but there is a need to tell user
> >> which cpus are sharing cache, or memory distance is closest enough,
> >> and which cpus are servicing interrupts, e.g, network interrupt and
> >> disks etc, etc, otherwise, blindly setting cpu affinity mask only can
> >> shoot itself in the foot.
> >
> > While the Mac OS X API is pretty Mach-specific, it's worth taking a look
> > at their recently-announced affinity API:
> >
> > http://developer.apple.com/releasenotes/Performance/RN-AffinityAPI/index.html
> >
> >
> > Robert N M Watson
> > Computer Laboratory
> > University of Cambridge
> >
>
>
> I like the interfaces, it is more flexible.

I agree. May I as k what's being planned? It's Jeffs' call finally I think.

Brian.

From owner-freebsd-arch@FreeBSD.ORG  Tue Dec 25 03:52:04 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@hub.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5A60D16A41A;
	Tue, 25 Dec 2007 03:52:04 +0000 (UTC)
	(envelope-from davidxu@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id 4FC1D13C455;
	Tue, 25 Dec 2007 03:52:04 +0000 (UTC)
	(envelope-from davidxu@FreeBSD.org)
Received: from apple.my.domain (root@localhost [127.0.0.1])
	by freefall.freebsd.org (8.14.2/8.14.2) with ESMTP id lBP3q0Tb054785;
	Tue, 25 Dec 2007 03:52:02 GMT (envelope-from davidxu@freebsd.org)
Message-ID: <47707EA2.8010002@freebsd.org>
Date: Tue, 25 Dec 2007 11:53:06 +0800
From: David Xu <davidxu@FreeBSD.org>
User-Agent: Thunderbird 2.0.0.9 (X11/20071211)
MIME-Version: 1.0
To: Brian McGinty <brian.mcginty@gmail.com>
References: <20071219211025.T899@desktop>
	<476B1973.6070902@freebsd.org>	<20071222183700.L5866@fledge.watson.org>	<476F0EE5.1040404@freebsd.org>
	<601bffc40712241909t10e6f3k8e7940d387b6efc2@mail.gmail.com>
In-Reply-To: <601bffc40712241909t10e6f3k8e7940d387b6efc2@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: arch@FreeBSD.org, Robert Watson <rwatson@FreeBSD.org>
Subject: Re: Linux compatible setaffinity.
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Dec 2007 03:52:04 -0000

Brian McGinty wrote:
> On Dec 23, 2007 5:44 PM, David Xu <davidxu@freebsd.org> wrote:
>> Robert Watson wrote:
>>> On Fri, 21 Dec 2007, David Xu wrote:
>>>
>>>> I don't say no to these interfaces, but there is a need to tell user
>>>> which cpus are sharing cache, or memory distance is closest enough,
>>>> and which cpus are servicing interrupts, e.g, network interrupt and
>>>> disks etc, etc, otherwise, blindly setting cpu affinity mask only can
>>>> shoot itself in the foot.
>>> While the Mac OS X API is pretty Mach-specific, it's worth taking a look
>>> at their recently-announced affinity API:
>>>
>>> http://developer.apple.com/releasenotes/Performance/RN-AffinityAPI/index.html
>>>
>>>
>>> Robert N M Watson
>>> Computer Laboratory
>>> University of Cambridge
>>>
>>
>> I like the interfaces, it is more flexible.
> 
> I agree. May I as k what's being planned? It's Jeffs' call finally I think.
> 
> Brian.

I don't have plan. ;-) If I understand it correctly, it is a hint to
scheduler, it is better describing thread relationship, while Jeff's
interface is a hard cpu binding interface, it is still needed in some 
circumstance.

Regards,


From owner-freebsd-arch@FreeBSD.ORG  Tue Dec 25 05:19:44 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0536416A417;
	Tue, 25 Dec 2007 05:19:44 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com
	[216.240.101.25])
	by mx1.freebsd.org (Postfix) with ESMTP id CB1C813C455;
	Tue, 25 Dec 2007 05:19:43 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from [192.168.1.107] (cpe-24-94-75-93.hawaii.res.rr.com
	[24.94.75.93]) (authenticated bits=0)
	by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id
	lBP5JeGG048514; Tue, 25 Dec 2007 00:19:42 -0500 (EST)
	(envelope-from jroberson@chesapeake.net)
Date: Mon, 24 Dec 2007 19:21:10 -1000 (HST)
From: Jeff Roberson <jroberson@chesapeake.net>
X-X-Sender: jroberson@desktop
To: David Xu <davidxu@freebsd.org>
In-Reply-To: <47707EA2.8010002@freebsd.org>
Message-ID: <20071224191954.Q73903@desktop>
References: <20071219211025.T899@desktop> <476B1973.6070902@freebsd.org>
	<20071222183700.L5866@fledge.watson.org> <476F0EE5.1040404@freebsd.org>
	<601bffc40712241909t10e6f3k8e7940d387b6efc2@mail.gmail.com>
	<47707EA2.8010002@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Brian McGinty <brian.mcginty@gmail.com>,
	Robert Watson <rwatson@freebsd.org>, arch@freebsd.org
Subject: Re: Linux compatible setaffinity.
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Dec 2007 05:19:44 -0000

On Tue, 25 Dec 2007, David Xu wrote:

> Brian McGinty wrote:
>> On Dec 23, 2007 5:44 PM, David Xu <davidxu@freebsd.org> wrote:
>>> Robert Watson wrote:
>>>> On Fri, 21 Dec 2007, David Xu wrote:
>>>> 
>>>>> I don't say no to these interfaces, but there is a need to tell user
>>>>> which cpus are sharing cache, or memory distance is closest enough,
>>>>> and which cpus are servicing interrupts, e.g, network interrupt and
>>>>> disks etc, etc, otherwise, blindly setting cpu affinity mask only can
>>>>> shoot itself in the foot.
>>>> While the Mac OS X API is pretty Mach-specific, it's worth taking a look
>>>> at their recently-announced affinity API:
>>>> 
>>>> http://developer.apple.com/releasenotes/Performance/RN-AffinityAPI/index.html
>>>> 
>>>> 
>>>> Robert N M Watson
>>>> Computer Laboratory
>>>> University of Cambridge
>>>> 
>>> 
>>> I like the interfaces, it is more flexible.
>> 
>> I agree. May I as k what's being planned? It's Jeffs' call finally I think.
>> 
>> Brian.
>
> I don't have plan. ;-) If I understand it correctly, it is a hint to
> scheduler, it is better describing thread relationship, while Jeff's
> interface is a hard cpu binding interface, it is still needed in some 
> circumstance.

Yes, I don't think they're exclusive.

However, the system scheduler makes some observations about what threads 
might be best placed near each other.  I have plans to make ULE even 
smarter in this regard so that the application developers would almost 
never need to hint it.  I think these kinds of hints are not often correct 
or very useful anyway.

Thanks,
Jeff

>
> Regards,
>
> _______________________________________________
> freebsd-arch@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"
>

From owner-freebsd-arch@FreeBSD.ORG  Tue Dec 25 20:10:54 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8811216A419
	for <arch@freebsd.org>; Tue, 25 Dec 2007 20:10:54 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 454A613C4D9
	for <arch@freebsd.org>; Tue, 25 Dec 2007 20:10:54 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id E976146B94;
	Tue, 25 Dec 2007 15:10:53 -0500 (EST)
Date: Tue, 25 Dec 2007 20:10:53 +0000 (GMT)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Jeff Roberson <jroberson@chesapeake.net>
In-Reply-To: <20071219211025.T899@desktop>
Message-ID: <20071225201012.S85517@fledge.watson.org>
References: <20071219211025.T899@desktop>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@freebsd.org
Subject: Re: Linux compatible setaffinity.
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Dec 2007 20:10:54 -0000


On Wed, 19 Dec 2007, Jeff Roberson wrote:

> I have implemented a linux compatible sched_setaffinity() call which is 
> somewhat crippled.  This allows a userspace process to supply a bitmask of 
> processors which it will run on.  I have copied the linux interface such 
> that it should be api compatible because I believe it is a sensible 
> interface and they beat us to it by 3 years.

BTW, I notice that you declare sched_getaffinity() in the user include file, 
but don't reserve a system call in syscalls.master or implement it.  Is this 
intentional?

Robert N M Watson
Computer Laboratory
University of Cambridge

From owner-freebsd-arch@FreeBSD.ORG  Wed Dec 26 07:51:42 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6CA9E16A417;
	Wed, 26 Dec 2007 07:51:42 +0000 (UTC)
	(envelope-from deischen@freebsd.org)
Received: from mail.netplex.net (mail.netplex.net [204.213.176.10])
	by mx1.freebsd.org (Postfix) with ESMTP id 34AAA13C46A;
	Wed, 26 Dec 2007 07:51:41 +0000 (UTC)
	(envelope-from deischen@freebsd.org)
Received: from sea.ntplx.net (sea.ntplx.net [204.213.176.11])
	by mail.netplex.net (8.14.2/8.14.2/NETPLEX) with ESMTP id
	lBQ7pdfg003814; Wed, 26 Dec 2007 02:51:40 -0500 (EST)
X-Virus-Scanned: by AMaViS and Clam AntiVirus (mail.netplex.net)
X-Greylist: Message whitelisted by DRAC access database, not delayed by
	milter-greylist-4.0 (mail.netplex.net [204.213.176.10]);
	Wed, 26 Dec 2007 02:51:40 -0500 (EST)
Date: Wed, 26 Dec 2007 02:51:39 -0500 (EST)
From: Daniel Eischen <deischen@freebsd.org>
X-X-Sender: eischen@sea.ntplx.net
To: David Xu <davidxu@freebsd.org>
In-Reply-To: <47707EA2.8010002@freebsd.org>
Message-ID: <Pine.GSO.4.64.0712260250140.14817@sea.ntplx.net>
References: <20071219211025.T899@desktop> <476B1973.6070902@freebsd.org>
	<20071222183700.L5866@fledge.watson.org> <476F0EE5.1040404@freebsd.org>
	<601bffc40712241909t10e6f3k8e7940d387b6efc2@mail.gmail.com>
	<47707EA2.8010002@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Brian McGinty <brian.mcginty@gmail.com>,
	Robert Watson <rwatson@freebsd.org>, arch@freebsd.org
Subject: Re: Linux compatible setaffinity.
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: Daniel Eischen <deischen@freebsd.org>
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Dec 2007 07:51:42 -0000

On Tue, 25 Dec 2007, David Xu wrote:

> Brian McGinty wrote:
>> On Dec 23, 2007 5:44 PM, David Xu <davidxu@freebsd.org> wrote:
>>> Robert Watson wrote:
>>>> On Fri, 21 Dec 2007, David Xu wrote:
>>>> 
>>>>> I don't say no to these interfaces, but there is a need to tell user
>>>>> which cpus are sharing cache, or memory distance is closest enough,
>>>>> and which cpus are servicing interrupts, e.g, network interrupt and
>>>>> disks etc, etc, otherwise, blindly setting cpu affinity mask only can
>>>>> shoot itself in the foot.
>>>> While the Mac OS X API is pretty Mach-specific, it's worth taking a look
>>>> at their recently-announced affinity API:
>>>> 
>>>> http://developer.apple.com/releasenotes/Performance/RN-AffinityAPI/index.html
>>>> 
>>>> 
>>>> Robert N M Watson
>>>> Computer Laboratory
>>>> University of Cambridge
>>>> 
>>> 
>>> I like the interfaces, it is more flexible.
>> 
>> I agree. May I as k what's being planned? It's Jeffs' call finally I think.
>> 
>> Brian.
>
> I don't have plan. ;-) If I understand it correctly, it is a hint to
> scheduler, it is better describing thread relationship, while Jeff's
> interface is a hard cpu binding interface, it is still needed in some 
> circumstance.

Please take a look at Solaris' API for processor set binding:

   http://docs.sun.com/app/docs/doc/816-5167/6mbb2jae6?a=expand

See processor_bind, processor_info, and pset_*.

-- 
DE

From owner-freebsd-arch@FreeBSD.ORG  Wed Dec 26 07:56:55 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 613D416A420
	for <arch@freebsd.org>; Wed, 26 Dec 2007 07:56:55 +0000 (UTC)
	(envelope-from edwin@mavetju.org)
Received: from mail5out.barnet.com.au (mail5.barnet.com.au [202.83.178.78])
	by mx1.freebsd.org (Postfix) with ESMTP id 163F013C4D5
	for <arch@freebsd.org>; Wed, 26 Dec 2007 07:56:54 +0000 (UTC)
	(envelope-from edwin@mavetju.org)
Received: by mail5out.barnet.com.au (Postfix, from userid 1001)
	id AAA8F2218A90; Wed, 26 Dec 2007 18:56:53 +1100 (EST)
X-Viruscan-Id: <4772094500007C2CD87F14@BarNet>
Received: from mail5auth.barnet.com.au (mail5.barnet.com.au [202.83.178.78])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client CN "mail5auth.barnet.com.au",
	Issuer "*.barnet.com.au" (verified OK))
	by mail5.barnet.com.au (Postfix) with ESMTP id 74E4B21B18F7;
	Wed, 26 Dec 2007 18:56:53 +1100 (EST)
Received: from k7.mavetju (k7.mavetju.org [10.251.1.18])
	by mail5auth.barnet.com.au (Postfix) with ESMTP id 0E0202218A87;
	Wed, 26 Dec 2007 18:56:53 +1100 (EST)
Received: by k7.mavetju (Postfix, from userid 1001)
	id 87F4D286; Wed, 26 Dec 2007 18:56:52 +1100 (EST)
Date: Wed, 26 Dec 2007 18:56:52 +1100
From: Edwin Groothuis <edwin@mavetju.org>
To: arch@freebsd.org, gnn@freebsd.org
Message-ID: <20071226075652.GC40967@k7.mavetju>
References: <20071209223042.GA40965@k7.mavetju>
	<m2ve76k4md.wl%gnn@neville-neil.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <m2ve76k4md.wl%gnn@neville-neil.com>
User-Agent: Mutt/1.4.2.3i
Cc: 
Subject: Re: bin/118292: Add support to remove all msg/shm/sem ids with ipcrm
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Dec 2007 07:56:55 -0000

On Mon, Dec 10, 2007 at 09:26:34AM -0500, gnn@freebsd.org wrote:
> At Mon, 10 Dec 2007 09:30:42 +1100,
> Edwin Groothuis wrote:
> > 
> > Hello,
> > 
> > A friend of me has submitted this PR and I promised him that I would
> > see if I could get it implemented. I couldn't find anybody directly
> > responsible for the ips/iprcm tools, so I throw it in here for
> > discussion.
[...]
> > 
> > I will do it in two parts (according to the wishes of my mentor):
> > First style(9)ify ipcrm.c, then the patch.
> > 
> > If anybody has a good observation on this change, please speak up now.
> 
> I have not read the patch in detail but I like the idea, we should be
> able to easily clean such things up.

It has been commited to HEAD, it will be MFCd when the src freezes
are over.

Edwin
-- 
Edwin Groothuis      |            Personal website: http://www.mavetju.org
edwin@mavetju.org    |              Weblog: http://www.mavetju.org/weblog/

From owner-freebsd-arch@FreeBSD.ORG  Wed Dec 26 09:19:32 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id BCF4F16A41A
	for <arch@freebsd.org>; Wed, 26 Dec 2007 09:19:32 +0000 (UTC)
	(envelope-from mindactive@ecastnews.com)
Received: from attivonet.net (attivonet.net [72.3.236.78])
	by mx1.freebsd.org (Postfix) with ESMTP id 861E013C46E
	for <arch@freebsd.org>; Wed, 26 Dec 2007 09:19:32 +0000 (UTC)
	(envelope-from mindactive@ecastnews.com)
Received: (qmail 21826 invoked by uid 48); 26 Dec 2007 00:16:28 -0600
To: arch@freebsd.org
Received: from mailer by www.ecastnews.com with HTTP (Mail);
	Wed, 26 Dec 2007 00:16:28 -0600
Date: Wed, 26 Dec 2007 00:16:28 -0600
From: "FullMotionMail.com" <info@fullmotionmail.com>
Message-ID: <3e8bbcb199281fbe09574cd1dd29cfe4@www.ecastnews.com>
X-Priority: 3
X-Mailer: AC Mailer
X-mid: YXJjaEBmcmVlYnNkLm9yZyAsIG0yOQ==
MIME-Version: 1.0
Content-Type: text/plain; charset = "utf-8"
Content-Transfer-Encoding: 8bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Cc: 
Subject: FullMotionMail - Your Free Video eMail Source
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: info@fullmotionmail.com
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Dec 2007 09:19:32 -0000

SHARE YOUR HOLIDAY SPIRIT AND SEND A FREE VIDEO EMAIL!

FullMotion VideoMail
http://www.fullmotionmail.com

WELCOME!
This message comes to you from FullMotionMail.com, your free source for sending personal video messages to your friends and family.

HOW DOES IT WORK?
If your computer has a built -in video cam, its a snap. You can also easily connect your video camera to your firewire or USB port.
			
Then simply click on the FullMotionMail.com link to create your message, the system will detect your camera, or choose your connection form the list and your ready to go. The whole thing works completely within your browser (no special software is required). You can also select from a variety of templates to use as backgrounds for your message.

1: Select "Allow"
2: Choose a theme
3: Select your camera (Your image will show up immediately if your camera is connected and turned on.  Choose your Mic the same way)
4: Press record (Press "Record" and record your video mail once you complete setup and once you complete you can review ot or go ahead and send or make it again if you need to)
5: Click "LIKE IT" and send (Once you like your recording, click "LIKE IT" and send your video mail to up to 6 people at once.  Send them a short message as well and your message will come to them in an eMail.  A link will take them to the site and the theme you chose as well as video to watch and create their own)
5: click "LIKE IT" and send

EMBED CODE
Copy & paste the code below and place it in a blog or social network such as Facebook or MySpace. Then take it one step further. If you have a website, mySpace, Facebook or something like a WordPress blog, you can embed a small piece of code and add this awesome widget to your page. This will allow your visitors to send their messages and so on and so on and so on.

All usage is free and requires no user account. No user information is collected or stored for use, your video will be available for viewing for 3 months, (and you can always create another).

The holidys are a time for you to share the spirit with your family and friends so have fun and make a quick and easy little video and put a smile on their faces.

FullMotionMail.com is brought to you by: www.mindactive.com

MindActive - Digital Marketing Innovation.

To Unsubscribe, please click here :
http://www.ecastnews.com/listServer/box.php?funcml=unsub2&nl=15&mi=29&email=arch@freebsd.org
(c) 2007 MindActive Design Studio LLC company.


From owner-freebsd-arch@FreeBSD.ORG  Wed Dec 26 19:41:11 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B07B016A418
	for <freebsd-arch@freebsd.org>; Wed, 26 Dec 2007 19:41:11 +0000 (UTC)
	(envelope-from aryeh.friedman@gmail.com)
Received: from mta4.srv.hcvlny.cv.net (mta4.srv.hcvlny.cv.net [167.206.4.199])
	by mx1.freebsd.org (Postfix) with ESMTP id 89A8B13C47E
	for <freebsd-arch@freebsd.org>; Wed, 26 Dec 2007 19:41:11 +0000 (UTC)
	(envelope-from aryeh.friedman@gmail.com)
Received: from flosoft.no-ip.biz
	(ool-435559b8.dyn.optonline.net [67.85.89.184]) by
	mta4.srv.hcvlny.cv.net
	(Sun Java System Messaging Server 6.2-8.04 (built Feb 28 2007))
	with ESMTP id <0JTO002047YAP090@mta4.srv.hcvlny.cv.net> for
	freebsd-arch@freebsd.org; Wed, 26 Dec 2007 14:10:59 -0500 (EST)
Received: from flosoft.no-ip.biz (localhost [IPv6:::1])
	by flosoft.no-ip.biz (8.14.2/8.14.2) with ESMTP id lBQJAw7r019014	for
	<freebsd-arch@freebsd.org>; Wed, 26 Dec 2007 14:10:58 -0500
Date: Wed, 26 Dec 2007 14:10:58 -0500
From: "Aryeh M. Friedman" <aryeh.friedman@gmail.com>
To: freebsd-arch@freebsd.org
Message-id: <4772A742.4050106@gmail.com>
MIME-version: 1.0
Content-type: text/plain; charset=ISO-8859-1
Content-transfer-encoding: 7BIT
X-Enigmail-Version: 0.95.5
User-Agent: Thunderbird 2.0.0.9 (X11/20071217)
Subject: Adding better database support to the base system
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Dec 2007 19:41:11 -0000

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Currently the only available DB support in the base system is Berkeley
DB (1.x) there are several items that would benefit from migrating
something like minisql into the base system.   The most immediate
application that comes to mind is enabling some interesting features
for the ports system.  Therefor I purpose migrating some minimal
RDBM's features into the base system.

- --
Aryeh M. Friedman
FloSoft Systems
http://www.flosoft-systems.com
Developer, not business, friendly
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4 (FreeBSD)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHcqdCzIOMjAek4JIRAqw2AJ0Z1/xKy/fEafbQVP18oUDq2HPz9QCfbQBU
1cjpr9Wy/6zdXUT79tMJvoI=
=sPik
-----END PGP SIGNATURE-----


From owner-freebsd-arch@FreeBSD.ORG  Wed Dec 26 22:08:34 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 52CB016A418
	for <freebsd-arch@freebsd.org>; Wed, 26 Dec 2007 22:08:34 +0000 (UTC)
	(envelope-from dougb@FreeBSD.org)
Received: from mail2.fluidhosting.com (mx21.fluidhosting.com [204.14.89.4])
	by mx1.freebsd.org (Postfix) with SMTP id ECF5F13C457
	for <freebsd-arch@freebsd.org>; Wed, 26 Dec 2007 22:08:33 +0000 (UTC)
	(envelope-from dougb@FreeBSD.org)
Received: (qmail 16908 invoked by uid 399); 26 Dec 2007 22:08:33 -0000
Received: from localhost (HELO ?192.168.0.4?) (dougb@dougbarton.us@127.0.0.1)
	by localhost with ESMTP; 26 Dec 2007 22:08:33 -0000
X-Originating-IP: 127.0.0.1
Message-ID: <4772D0DF.2030505@FreeBSD.org>
Date: Wed, 26 Dec 2007 14:08:31 -0800
From: Doug Barton <dougb@FreeBSD.org>
Organization: http://www.FreeBSD.org/
User-Agent: Thunderbird 2.0.0.9 (Windows/20071031)
MIME-Version: 1.0
To: "Aryeh M. Friedman" <aryeh.friedman@gmail.com>
References: <4772A742.4050106@gmail.com>
In-Reply-To: <4772A742.4050106@gmail.com>
X-Enigmail-Version: 0.95.5
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: freebsd-arch@freebsd.org
Subject: Re: Adding better database support to the base system
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Dec 2007 22:08:34 -0000

Aryeh M. Friedman wrote:
> Currently the only available DB support in the base system is Berkeley
> DB (1.x) there are several items that would benefit from migrating
> something like minisql into the base system.   The most immediate
> application that comes to mind is enabling some interesting features
> for the ports system.  Therefor I purpose migrating some minimal
> RDBM's features into the base system.

To get any sort of useful feedback your recommendation has to be much
more specific. You should also focus on candidates that are
BSD-licensed, or equivalent.

Doug

-- 

    This .signature sanitized for your protection

From owner-freebsd-arch@FreeBSD.ORG  Wed Dec 26 23:31:38 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C8B8A16A419
	for <freebsd-arch@freebsd.org>; Wed, 26 Dec 2007 23:31:37 +0000 (UTC)
	(envelope-from freebsd-arch@m.gmane.org)
Received: from ciao.gmane.org (main.gmane.org [80.91.229.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 826D913C447
	for <freebsd-arch@freebsd.org>; Wed, 26 Dec 2007 23:31:37 +0000 (UTC)
	(envelope-from freebsd-arch@m.gmane.org)
Received: from root by ciao.gmane.org with local (Exim 4.43)
	id 1J7eq8-0006eW-1u
	for freebsd-arch@freebsd.org; Wed, 26 Dec 2007 22:35:04 +0000
Received: from 78-0-77-181.adsl.net.t-com.hr ([78.0.77.181])
	by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
	id 1AlnuQ-0007hv-00
	for <freebsd-arch@freebsd.org>; Wed, 26 Dec 2007 22:35:04 +0000
Received: from ivoras by 78-0-77-181.adsl.net.t-com.hr with local (Gmexim 0.1
	(Debian)) id 1AlnuQ-0007hv-00
	for <freebsd-arch@freebsd.org>; Wed, 26 Dec 2007 22:35:04 +0000
X-Injected-Via-Gmane: http://gmane.org/
To: freebsd-arch@freebsd.org
From: Ivan Voras <ivoras@freebsd.org>
Date: Wed, 26 Dec 2007 23:33:09 +0100
Lines: 32
Message-ID: <fkukr6$nd8$2@ger.gmane.org>
References: <4772A742.4050106@gmail.com>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature";
	boundary="------------enigD25FCDF5142033084E5624BD"
X-Complaints-To: usenet@ger.gmane.org
X-Gmane-NNTP-Posting-Host: 78-0-77-181.adsl.net.t-com.hr
User-Agent: Thunderbird 2.0.0.9 (Windows/20071031)
In-Reply-To: <4772A742.4050106@gmail.com>
X-Enigmail-Version: 0.95.5
Sender: news <news@ger.gmane.org>
Subject: Re: Adding better database support to the base system
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Dec 2007 23:31:38 -0000

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enigD25FCDF5142033084E5624BD
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Aryeh M. Friedman wrote:
> Currently the only available DB support in the base system is Berkeley
> DB (1.x) there are several items that would benefit from migrating
> something like minisql into the base system.   The most immediate
> application that comes to mind is enabling some interesting features
> for the ports system.  Therefor I purpose migrating some minimal
> RDBM's features into the base system.

Been there, tried that (SQLite), but unsuccessfully - people here REALLY
like text files :)


--------------enigD25FCDF5142033084E5624BD
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHctamldnAQVacBcgRAhNmAKD8Fd8CeMkj+TUhNoCuiFvggqI52ACfSnxw
GUUfD0NkMsN1GA9k19zGDJg=
=dyoy
-----END PGP SIGNATURE-----

--------------enigD25FCDF5142033084E5624BD--


From owner-freebsd-arch@FreeBSD.ORG  Wed Dec 26 23:39:08 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8914C16A41A
	for <freebsd-arch@freebsd.org>; Wed, 26 Dec 2007 23:39:08 +0000 (UTC)
	(envelope-from aryeh.friedman@gmail.com)
Received: from mta4.srv.hcvlny.cv.net (mta4.srv.hcvlny.cv.net [167.206.4.199])
	by mx1.freebsd.org (Postfix) with ESMTP id 63CD813C43E
	for <freebsd-arch@freebsd.org>; Wed, 26 Dec 2007 23:39:08 +0000 (UTC)
	(envelope-from aryeh.friedman@gmail.com)
Received: from flosoft.no-ip.biz
	(ool-435559b8.dyn.optonline.net [67.85.89.184]) by
	mta4.srv.hcvlny.cv.net
	(Sun Java System Messaging Server 6.2-8.04 (built Feb 28 2007))
	with ESMTP id <0JTO002CYKCCGZD0@mta4.srv.hcvlny.cv.net>; Wed,
	26 Dec 2007 18:38:39 -0500 (EST)
Received: from flosoft.no-ip.biz (localhost [IPv6:::1])
	by flosoft.no-ip.biz (8.14.2/8.14.2) with ESMTP id lBQNcRGY023924; Wed,
	26 Dec 2007 18:38:29 -0500
Date: Wed, 26 Dec 2007 18:38:27 -0500
From: "Aryeh M. Friedman" <aryeh.friedman@gmail.com>
In-reply-to: <fkukr6$nd8$2@ger.gmane.org>
To: Ivan Voras <ivoras@freebsd.org>
Message-id: <4772E5F3.4010907@gmail.com>
MIME-version: 1.0
Content-type: text/plain; charset=UTF-8
Content-transfer-encoding: 7BIT
X-Enigmail-Version: 0.95.5
References: <4772A742.4050106@gmail.com> <fkukr6$nd8$2@ger.gmane.org>
User-Agent: Thunderbird 2.0.0.9 (X11/20071217)
Cc: freebsd-arch@freebsd.org
Subject: Re: Adding better database support to the base system
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Dec 2007 23:39:08 -0000

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ivan Voras wrote:
> Aryeh M. Friedman wrote:
>> Currently the only available DB support in the base system is Berkeley
>> DB (1.x) there are several items that would benefit from migrating
>> something like minisql into the base system.   The most immediate
>> application that comes to mind is enabling some interesting features
>> for the ports system.  Therefor I purpose migrating some minimal
>> RDBM's features into the base system.
>
> Been there, tried that (SQLite), but unsuccessfully - people here REALLY
> like text files :)
>
Thats funny because Berkeley DB and some other tools in the base
system write binary files.  Now that being said if worst comes to
worst it is not that hard (in some ways at least) to layer a
non-command language RDBMS on top of Berkeley db (i.e. it has a
relational API but no user level commands).  Basically all that one
needs to do is group keyed values into structured records vs. free
form data.

- --
Aryeh M. Friedman
FloSoft Systems
http://www.flosoft-systems.com
Developer, not business, friendly
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4 (FreeBSD)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD4DBQFHcuXzzIOMjAek4JIRApeuAJ9OI2tjERLJ45kLUtbaNepydnlOOwCYluh3
3E2dyo6hEOjcS+pllXmRuA==
=4/Wm
-----END PGP SIGNATURE-----


From owner-freebsd-arch@FreeBSD.ORG  Wed Dec 26 23:48:35 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C758316A420
	for <arch@freebsd.org>; Wed, 26 Dec 2007 23:48:35 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from outQ.internet-mail-service.net (outQ.internet-mail-service.net
	[216.240.47.240])
	by mx1.freebsd.org (Postfix) with ESMTP id AFA7113C457
	for <arch@freebsd.org>; Wed, 26 Dec 2007 23:48:35 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160)
	by out.internet-mail-service.net (qpsmtpd/0.40) with ESMTP;
	Wed, 26 Dec 2007 15:48:34 -0800
Received: from julian-mac.elischer.org (localhost [127.0.0.1])
	by idiom.com (Postfix) with ESMTP id CC2C9126D82;
	Wed, 26 Dec 2007 15:48:33 -0800 (PST)
Message-ID: <4772E859.3090005@elischer.org>
Date: Wed, 26 Dec 2007 15:48:41 -0800
From: Julian Elischer <julian@elischer.org>
User-Agent: Thunderbird 2.0.0.9 (Macintosh/20071031)
MIME-Version: 1.0
To: FreeBSD Net <net@FreeBSD.org>, arch@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: "Li, Qing" <qing.li@bluecoat.com>, Robert Watson <rwatson@freebsd.org>
Subject: multiple routing tables roadmap
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Dec 2007 23:48:35 -0000

On thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows different
packet streams to be routed by more than just the destination address.

Constraints:
------------

I want to make some form of this available in the 6.x tree 
(and by extension 7.x) , but FreeBSD in general needs it so I might as well
do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to instantiate
multiple kernel routing tables (which I will now refer to as 
"Forwarding Information Bases" or "FIBs" for political correctness 
reasons. Which FIB a particular packet uses to make the next hop decision
can be decided by a number of mechanisms. The policies these mechanisms
implement are the "Policies" referred to in "Policy based routing".

One of the constraints I have if I try to back port this work to 6.x is that
it must be implemented as a EXTENSION to the existing ABIs in 6.x so that
third party applications do not need to be recompiled in timespan
of the branch.  

Implementation method, (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not yet caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (sufficient for my purposes in 6.x) and 
implements the changes needed to allow IPV4 to use them. I have not done
the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be 
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB code
starts everything off with a single dimensional array of pointers 
to FIB head structures (One per protocol family), each of which in 
turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to extent that
array to be a 2 dimensional array, so that instead of protocol family X
looking at rt_tables[X] for the table it needs, it looks at rt_tables[Y][X]
when for all protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.


The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to 
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
dom_rtalloc() and friends) that check the Address family being looked up and 
call either rtalloc() (and friends) if the protocol is not IPv4 forcing the 
action to row 0 or to the appropriate row if it IS IPv4 (and that info is 
available). These are for calling from code that is not specific to any
particular protocol. The way these are implemented would change 
in the non ABI preserving code to be added later.

One feature of the first version of the code is that for ipv4, the 
interface routes show up automatically on all the FIBs, so that 
no matter what FIB you select you always have the basic direct attached 
hosts available to you. (rtinit() does this automatically).
you CAN delete an interface route from one FIB should you want to 
but by default it's there. ARP information is also available 
in each FIB. It's assumed that the same machine would have the same 
MAC address, regardless of which FIB you are using to get to it.


This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Packets fall into one of a number of classes.
1/ locally generated packets, coming from a socket/PCB.
   Such packets select a FIB from a number associated with the 
   socket/PCB. This in turn is inherited from the process,
   but can be changed by a socket option.  The process in turn 
   inherits it on fork. I have written  a utility call setfib
   that acts a bit like nice..  
      setfib -n 3 ping target.example.com  # will use fib 3 for ping.

2/ packets received on an interface for forwarding.
   By default these packets would use table 0,
   (or possibly a number settable in a sysctl(not yet)).
   but prior to routing the firewall can inspect them (see below).

3/ packets inspected by a packet classifier, which can arbitrarily
   associate a fib with it on a packet by packet basis.
   A fib assigned to a packet by a packet classifier
   (such as ipfw) would over-ride a fib associated by
   a more default source. (such as cases 1 or 2).

routing messages would be associated with their
process, and thus select one FIB or another.
In addition Netstat has been edited to be able to cope with the 
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)).

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:
-------------------------

Basically our (IronPort's) appliance does this functionality already using 
ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.


Testing during the generating of these changes has been 
remarkably smooth so far. Multiple tables have co-existed 
with no notable side effects, and packets have been routes accordingly.

I have not yet added the changes to ipfw. 
pf has some similar changes already but they seem to rely on 
the various FIBs having symbolic names. Which I do not plan to support 
in the first verion of these changes.

SCTP has interestingly enough buiold in support for this, called VRFs
in cisco parlance. it will be intersting to see how that handles it when
it suddenly actually does something.

I have not redone my testing since my last edits, but will be retesting with the 
current code asap.


Where to next:
--------------------

After committing the ABI compatible version and MFCing it, I'd 
like to proceed in a forward direction in -current. this will 
result in some rototilling in the routing code.

Firstly: the current code's idea of having a separate tree per 
protocol family, all of the same format, and pointed to by the 
1 dimensional array is a bit silly. Especially when one considers that there
is code that makes assumptions about every protocol having the same
internal structures there.  Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

Interaction with the ARP layer/ LL layer would need to be 
revisited as well. Qing Li has been working on this already.


diffs
for those with p4 access:
p4 diff2 -du  //depot/vendor/freebsd/src/sys/...@131121 //depot/user/julian/routing/src/sys/... 

for those with the makediff perl script:
perl ~/makediff.pl //depot/vendor/freebsd/src/sys/...@131121 //depot/user/julian/routing/src/sys/... 

for those with neither:

http://people.freebsd.org/~julian/mrt2.diff

I just put the userland utility in usr.sbin/setfib/ in p4.
and changes to netstat in usr.bin/netstat/

see:
http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/julian/routing/src&HIDEDEL=NO


I'd like to get comments on this (compat) version, so that I can commit it, 
get general testing under way to start the clock for MFC, and then get
moving on the fuller implementation (that breaks ABIs) and other routing issues.


Julian


From owner-freebsd-arch@FreeBSD.ORG  Thu Dec 27 00:26:05 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B339F16A418
	for <arch@freebsd.org>; Thu, 27 Dec 2007 00:26:05 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from outV.internet-mail-service.net (outV.internet-mail-service.net
	[216.240.47.245])
	by mx1.freebsd.org (Postfix) with ESMTP id 9926D13C468
	for <arch@freebsd.org>; Thu, 27 Dec 2007 00:26:05 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160)
	by out.internet-mail-service.net (qpsmtpd/0.40) with ESMTP;
	Wed, 26 Dec 2007 16:26:04 -0800
Received: from julian-mac.elischer.org (localhost [127.0.0.1])
	by idiom.com (Postfix) with ESMTP id 235D4126D8C;
	Wed, 26 Dec 2007 16:26:04 -0800 (PST)
Message-ID: <4772F123.5030303@elischer.org>
Date: Wed, 26 Dec 2007 16:26:11 -0800
From: Julian Elischer <julian@elischer.org>
User-Agent: Thunderbird 2.0.0.9 (Macintosh/20071031)
MIME-Version: 1.0
To: FreeBSD Net <freebsd-net@freebsd.org>, arch@freebsd.org, 
	Robert Watson <rwatson@freebsd.org>, Qing Li <qingli@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: 
Subject: resend: multiple routing table roadmap (format fix)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Dec 2007 00:26:05 -0000

Resending as my mailer made a dog's breakfast of the first one
with all sorts of wierd line breaks... hopefully this will be better.
(I haven't sent it yet so I'm hoping)..


-------------------------------------------


On thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.

Constraints:
------------

I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well
do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons. Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

Implementation method, (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not yet caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (sufficient for my purposes in 6.x) and
implements the changes needed to allow IPV4 to use them. I have not done
the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.


The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
dom_rtalloc() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).
You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.


This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Packets fall into one of a number of classes.
1/ locally generated packets, coming from a socket/PCB.
    Such packets select a FIB from a number associated with the
    socket/PCB. This in turn is inherited from the process,
    but can be changed by a socket option. The process in turn
    inherits it on fork. I have written a utility call setfib
    that acts a bit like nice..

        setfib -n 3 ping target.example.com # will use fib 3 for ping.

2/ packets received on an interface for forwarding.
    By default these packets would use table 0,
    (or possibly a number settable in a sysctl(not yet)).
    but prior to routing the firewall can inspect them (see below).

3/ packets inspected by a packet classifier, which can arbitrarily
    associate a fib with it on a packet by packet basis.
    A fib assigned to a packet by a packet classifier
    (such as ipfw) would over-ride a fib associated by
    a more default source. (such as cases 1 or 2).

Routing messages would be associated with their
process, and thus select one FIB or another.

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)).

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:
-------------------------

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.


Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.

I have not yet added the changes to ipfw.
pf has some similar changes already but they seem to rely on
the various FIBs having symbolic names. Which I do not plan to support
in the first version of these changes.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it 
when it suddenly actually does something.

I have not redone my testing since my last edits, but will be
retesting with the current code asap.


Where to next:
--------------------

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there
is code that makes assumptions about every protocol having the same
internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.


diffs
for those with p4 access:
p4 diff2 -du //depot/vendor/freebsd/src/sys/...@131121
//depot/user/julian/routing/src/sys/...

for those with the makediff perl script:
perl ~/makediff.pl //depot/vendor/freebsd/src/sys/...@131121 
//depot/user/julian/routing/src/sys/...

for those with neither:

http://people.freebsd.org/~julian/mrt2.diff

I just put the userland utility in usr.sbin/setfib/ in p4.
and changes to netstat in usr.bin/netstat/

see:
http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/julian/routing/src&HIDEDEL=NO


I'd like to get comments on this (compat) version, so that I can
commit it,
get general testing under way to start the clock for MFC, and then get
moving on the fuller implementation (that breaks ABIs) and other
routing issues.


Julian


From owner-freebsd-arch@FreeBSD.ORG  Thu Dec 27 01:53:26 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0937616A41B
	for <arch@freebsd.org>; Thu, 27 Dec 2007 01:53:26 +0000 (UTC)
	(envelope-from ivo.vachkov@gmail.com)
Received: from hs-out-2122.google.com (hs-out-0708.google.com [64.233.178.240])
	by mx1.freebsd.org (Postfix) with ESMTP id A689F13C4DD
	for <arch@freebsd.org>; Thu, 27 Dec 2007 01:53:25 +0000 (UTC)
	(envelope-from ivo.vachkov@gmail.com)
Received: by hs-out-2122.google.com with SMTP id j58so2225139hsj.11
	for <arch@freebsd.org>; Wed, 26 Dec 2007 17:53:25 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
	bh=7P9byjzIODk7OjOMT/p8QTPLAQm1t3t9xirlo8LCWbQ=;
	b=PgbD/339ykW1ubzBfVxjCIypmFOJpiEuqCKRGhdWgsw4+US0lzwwBV0DhQVivKkvIZtVJ6eFnugyyE4NuAJzoWI4YwWTKinWwYWo50sCifDyb0/ifRIp0XD1kxj6XbigSIwz0OwRp4YLbDkx2j2jKyU0m6I+Rs/v1zIAQc3/pZI=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
	b=jG1MNPOqYXnEan6K0C/XRO7sZWyrhrffthB7BWg91ZXxxcCOUwrQ92wpaUVByN3NI9ROYbPLyfbsz5yu80Soq4JzBAwu/9++82q3m09H0xmcdPj7tHFPm8aU3KrWD5Z7ELHXAACt8viEkrRAkrgmfhyS1ZfbAZkkBaNvbHlq2gE=
Received: by 10.150.229.16 with SMTP id b16mr1961399ybh.115.1198718881884;
	Wed, 26 Dec 2007 17:28:01 -0800 (PST)
Received: by 10.150.204.13 with HTTP; Wed, 26 Dec 2007 17:28:01 -0800 (PST)
Message-ID: <f85d6aa70712261728h331eadb8p205d350dc7fb7f4c@mail.gmail.com>
Date: Thu, 27 Dec 2007 03:28:01 +0200
From: "Ivo Vachkov" <ivo.vachkov@gmail.com>
To: "Julian Elischer" <julian@elischer.org>
In-Reply-To: <4772F123.5030303@elischer.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <4772F123.5030303@elischer.org>
Cc: FreeBSD Net <freebsd-net@freebsd.org>, Robert Watson <rwatson@freebsd.org>,
	Qing Li <qingli@freebsd.org>, arch@freebsd.org
Subject: Re: resend: multiple routing table roadmap (format fix)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Dec 2007 01:53:26 -0000

On Dec 27, 2007 2:26 AM, Julian Elischer <julian@elischer.org> wrote:
> Resending as my mailer made a dog's breakfast of the first one
> with all sorts of wierd line breaks... hopefully this will be better.
> (I haven't sent it yet so I'm hoping)..
>
>
> -------------------------------------------
>
>
>
> On thing where FreeBSD has been falling behind, and which by chance I
> have some time to work on is "policy based routing", which allows
> different
> packet streams to be routed by more than just the destination address.
>
> Constraints:
> ------------
>
> I want to make some form of this available in the 6.x tree
> (and by extension 7.x) , but FreeBSD in general needs it so I might as
> well
> do it in -current and back port the portions I need.
>
> One of the ways that this can be done is to have the ability to
> instantiate multiple kernel routing tables (which I will now
> refer to as "Forwarding Information Bases" or "FIBs" for political
> correctness reasons. Which FIB a particular packet uses to make
> the next hop decision can be decided by a number of mechanisms.
> The policies these mechanisms implement are the "Policies" referred
> to in "Policy based routing".
>
> One of the constraints I have if I try to back port this work to
> 6.x is that it must be implemented as a EXTENSION to the existing
> ABIs in 6.x so that third party applications do not need to be
> recompiled in timespan of the branch.
>
> Implementation method, (part 1)
> -------------------------------
> For this reason I have implemented a "sufficient subset" of a
> multiple routing table solution in Perforce, and back-ported it
> to 6.x. (also in Perforce though not yet caught up with what I
> have done in -current/P4). The subset allows a number of FIBs
> to be defined at compile time (sufficient for my purposes in 6.x) and
> implements the changes needed to allow IPV4 to use them. I have not done
> the changes for ipv6 simply because I do not need it, and I do not
> have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.
>
> Other protocol families are left untouched and should there be
> users with proprietary protocol families, they should continue to work
> and be oblivious to the existence of the extra FIBs.
>
> To understand how this is done, one must know that the current FIB
> code starts everything off with a single dimensional array of
> pointers to FIB head structures (One per protocol family), each of
> which in turn points to the trie of routes available to that family.
>
> The basic change in the ABI compatible version of the change is to
> extent that array to be a 2 dimensional array, so that
> instead of protocol family X looking at rt_tables[X] for the
> table it needs, it looks at rt_tables[Y][X] when for all
> protocol families except ipv4 Y is always 0.
> Code that is unaware of the change always just sees the first row
> of the table, which of course looks just like the one dimensional
> array that existed before.

Pretty much like the OpenBSD approach :)

> The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
> are all maintained, but refer only to the first row of the array,
> so that existing callers in proprietary protocols can continue to
> do the "right thing".
> Some new entry points are added, for the exclusive use of ipv4 code
> called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
> which have an extra argument which refers the code to the correct row.
>
> In addition, there are some new entry points (currently called
> dom_rtalloc() and friends) that check the Address family being
> looked up and call either rtalloc() (and friends) if the protocol
> is not IPv4 forcing the action to row 0 or to the appropriate row
> if it IS IPv4 (and that info is available). These are for calling
> from code that is not specific to any particular protocol. The way
> these are implemented would change in the non ABI preserving code
> to be added later.
>
> One feature of the first version of the code is that for ipv4,
> the interface routes show up automatically on all the FIBs, so
> that no matter what FIB you select you always have the basic
> direct attached hosts available to you. (rtinit() does this
> automatically).
> You CAN delete an interface route from one FIB should you want
> to but by default it's there. ARP information is also available
> in each FIB. It's assumed that the same machine would have the
> same MAC address, regardless of which FIB you are using to get
> to it.
>
>
> This brings us as to how the correct FIB is selected for an outgoing
> IPV4 packet.
>
> Packets fall into one of a number of classes.
> 1/ locally generated packets, coming from a socket/PCB.
>     Such packets select a FIB from a number associated with the
>     socket/PCB. This in turn is inherited from the process,
>     but can be changed by a socket option. The process in turn
>     inherits it on fork. I have written a utility call setfib
>     that acts a bit like nice..
>
>         setfib -n 3 ping target.example.com # will use fib 3 for ping.
>
> 2/ packets received on an interface for forwarding.
>     By default these packets would use table 0,
>     (or possibly a number settable in a sysctl(not yet)).
>     but prior to routing the firewall can inspect them (see below).
>
> 3/ packets inspected by a packet classifier, which can arbitrarily
>     associate a fib with it on a packet by packet basis.
>     A fib assigned to a packet by a packet classifier
>     (such as ipfw) would over-ride a fib associated by
>     a more default source. (such as cases 1 or 2).

For the 2/ and 3/ cases I added (in a personal work i've been doing
lately) additional field in struct mbuf which can be set by a packet
filter or other application upon receiving which points the right
table to use for the lookup. This way a simple "marking" can be used
to divide different flows and create policy based routing.

> Routing messages would be associated with their
> process, and thus select one FIB or another.
>
> In addition Netstat has been edited to be able to cope with the
> fact that the array is now 2 dimensional. (It looks in system
> memory using libkvm (!)).
>
> In addition two sysctls are added to give:
> a) the number of FIBs compiled in (active)
> b) the default FIB of the calling process.
>
> Early testing experience:
> -------------------------
>
> Basically our (IronPort's) appliance does this functionality already
> using ipfw fwd but that method has some drawbacks.
>
> For example,
> It can't fully simulate a routing table because it can't influence the
> socket's choice of local address when a connect() is done.
>
>
> Testing during the generating of these changes has been
> remarkably smooth so far. Multiple tables have co-existed
> with no notable side effects, and packets have been routes
> accordingly.
>
> I have not yet added the changes to ipfw.
> pf has some similar changes already but they seem to rely on
> the various FIBs having symbolic names. Which I do not plan to support
> in the first version of these changes.
>
> SCTP has interestingly enough built in support for this, called VRFs
> in Cisco parlance. it will be interesting to see how that handles it
> when it suddenly actually does something.
>
> I have not redone my testing since my last edits, but will be
> retesting with the current code asap.
>
>
> Where to next:
> --------------------
>
> After committing the ABI compatible version and MFCing it, I'd
> like to proceed in a forward direction in -current. this will
> result in some roto-tilling in the routing code.
>
> Firstly: the current code's idea of having a separate tree per
> protocol family, all of the same format, and pointed to by the
> 1 dimensional array is a bit silly. Especially when one considers that
> there
> is code that makes assumptions about every protocol having the same
> internal structures there. Some protocols don't WANT that
> sort of structure. (for example the whole idea of a netmask is foreign
> to appletalk). This needs to be made opaque to the external code.
>
> My suggested first change is to add routing method pointers to the
> 'domain' structure, along with information pointing the data.
> instead of having an array of pointers to uniform structures,
> there would be an array pointing to the 'domain' structures
> for each protocol address domain (protocol family),
> and the methods this reached would be called. The methods would have
> an argument that gives FIB number, but the protocol would be free
> to ignore it.
>
> Interaction with the ARP layer/ LL layer would need to be
> revisited as well. Qing Li has been working on this already.
>
>
> diffs
> for those with p4 access:
> p4 diff2 -du //depot/vendor/freebsd/src/sys/...@131121
> //depot/user/julian/routing/src/sys/...
>
> for those with the makediff perl script:
> perl ~/makediff.pl //depot/vendor/freebsd/src/sys/...@131121
> //depot/user/julian/routing/src/sys/...
>
> for those with neither:
>
> http://people.freebsd.org/~julian/mrt2.diff
>
> I just put the userland utility in usr.sbin/setfib/ in p4.
> and changes to netstat in usr.bin/netstat/
>
> see:
> http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/julian/routing/src&HIDEDEL=NO
>
>
>
>
> I'd like to get comments on this (compat) version, so that I can
> commit it,
> get general testing under way to start the clock for MFC, and then get
> moving on the fuller implementation (that breaks ABIs) and other
> routing issues.
>
>
> Julian
>
>
>
>
> _______________________________________________
> freebsd-arch@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"
>

From owner-freebsd-arch@FreeBSD.ORG  Thu Dec 27 15:55:10 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5F7DE16A418
	for <arch@FreeBSD.org>; Thu, 27 Dec 2007 15:55:10 +0000 (UTC)
	(envelope-from rdivacky@vlk.vlakno.cz)
Received: from vlakno.cz (vlk.vlakno.cz [62.168.28.247])
	by mx1.freebsd.org (Postfix) with ESMTP id 0D96F13C45B
	for <arch@FreeBSD.org>; Thu, 27 Dec 2007 15:55:09 +0000 (UTC)
	(envelope-from rdivacky@vlk.vlakno.cz)
Received: from localhost (localhost [127.0.0.1])
	by vlakno.cz (Postfix) with ESMTP id E849F66B003;
	Thu, 27 Dec 2007 16:55:07 +0100 (CET)
X-Virus-Scanned: amavisd-new at vlakno.cz
Received: from vlakno.cz ([127.0.0.1])
	by localhost (vlk.vlakno.cz [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id 566F08Db+cHw; Thu, 27 Dec 2007 16:54:55 +0100 (CET)
Received: from vlk.vlakno.cz (localhost [127.0.0.1])
	by vlakno.cz (Postfix) with ESMTP id C8F6D66AFFF;
	Thu, 27 Dec 2007 16:54:55 +0100 (CET)
Received: (from rdivacky@localhost)
	by vlk.vlakno.cz (8.13.8/8.13.8/Submit) id lBRFstlR023615;
	Thu, 27 Dec 2007 16:54:55 +0100 (CET) (envelope-from rdivacky)
Date: Thu, 27 Dec 2007 16:54:55 +0100
From: Roman Divacky <rdivacky@FreeBSD.org>
To: John Baldwin <jhb@FreeBSD.org>
Message-ID: <20071227155455.GA23604@freebsd.org>
References: <20071218092222.GA9695@freebsd.org>
	<200712201138.56423.jhb@freebsd.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <200712201138.56423.jhb@freebsd.org>
User-Agent: Mutt/1.4.2.3i
Cc: arch@FreeBSD.org, freebsd-arch@FreeBSD.org
Subject: Re: final decision about *at syscalls
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Dec 2007 15:55:10 -0000

> Considering Robert's paper on security race problems in things like systrace
> stemming from when you copy parameters out of userland and into the kernel
> multiple times, I think #2 is definitely the better choice.  Also, namei() is
> already thread aware AFAICT since 'struct componentname' already contains a
> 'cnp_thread' member (was 'cnp_proc' in 4.x).

two strong voices for #2, I am going that way...

thnx

From owner-freebsd-arch@FreeBSD.ORG  Thu Dec 27 16:12:46 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 50FB616A417;
	Thu, 27 Dec 2007 16:12:46 +0000 (UTC)
	(envelope-from rdivacky@vlk.vlakno.cz)
Received: from vlakno.cz (vlk.vlakno.cz [62.168.28.247])
	by mx1.freebsd.org (Postfix) with ESMTP id 0BDDA13C45B;
	Thu, 27 Dec 2007 16:12:46 +0000 (UTC)
	(envelope-from rdivacky@vlk.vlakno.cz)
Received: from localhost (localhost [127.0.0.1])
	by vlakno.cz (Postfix) with ESMTP id E849F66B003;
	Thu, 27 Dec 2007 16:55:07 +0100 (CET)
X-Virus-Scanned: amavisd-new at vlakno.cz
Received: from vlakno.cz ([127.0.0.1])
	by localhost (vlk.vlakno.cz [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id 566F08Db+cHw; Thu, 27 Dec 2007 16:54:55 +0100 (CET)
Received: from vlk.vlakno.cz (localhost [127.0.0.1])
	by vlakno.cz (Postfix) with ESMTP id C8F6D66AFFF;
	Thu, 27 Dec 2007 16:54:55 +0100 (CET)
Received: (from rdivacky@localhost)
	by vlk.vlakno.cz (8.13.8/8.13.8/Submit) id lBRFstlR023615;
	Thu, 27 Dec 2007 16:54:55 +0100 (CET) (envelope-from rdivacky)
Date: Thu, 27 Dec 2007 16:54:55 +0100
From: Roman Divacky <rdivacky@FreeBSD.org>
To: John Baldwin <jhb@FreeBSD.org>
Message-ID: <20071227155455.GA23604@freebsd.org>
References: <20071218092222.GA9695@freebsd.org>
	<200712201138.56423.jhb@freebsd.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <200712201138.56423.jhb@freebsd.org>
User-Agent: Mutt/1.4.2.3i
Cc: arch@FreeBSD.org, freebsd-arch@FreeBSD.org
Subject: Re: final decision about *at syscalls
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Dec 2007 16:12:46 -0000

> Considering Robert's paper on security race problems in things like systrace
> stemming from when you copy parameters out of userland and into the kernel
> multiple times, I think #2 is definitely the better choice.  Also, namei() is
> already thread aware AFAICT since 'struct componentname' already contains a
> 'cnp_thread' member (was 'cnp_proc' in 4.x).

two strong voices for #2, I am going that way...

thnx

From owner-freebsd-arch@FreeBSD.ORG  Thu Dec 27 21:19:03 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1ECA716A421
	for <arch@freebsd.org>; Thu, 27 Dec 2007 21:19:03 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from outK.internet-mail-service.net (outK.internet-mail-service.net
	[216.240.47.234])
	by mx1.freebsd.org (Postfix) with ESMTP id 930D013C468
	for <arch@freebsd.org>; Thu, 27 Dec 2007 21:19:02 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160)
	by out.internet-mail-service.net (qpsmtpd/0.40) with ESMTP;
	Thu, 27 Dec 2007 13:19:01 -0800
Received: from julian-mac.elischer.org (localhost [127.0.0.1])
	by idiom.com (Postfix) with ESMTP id 75B50126D9D;
	Thu, 27 Dec 2007 13:19:00 -0800 (PST)
Message-ID: <477416CC.4090906@elischer.org>
Date: Thu, 27 Dec 2007 13:19:08 -0800
From: Julian Elischer <julian@elischer.org>
User-Agent: Thunderbird 2.0.0.9 (Macintosh/20071031)
MIME-Version: 1.0
To: Ivo Vachkov <ivo.vachkov@gmail.com>
References: <4772F123.5030303@elischer.org>
	<f85d6aa70712261728h331eadb8p205d350dc7fb7f4c@mail.gmail.com>
In-Reply-To: <f85d6aa70712261728h331eadb8p205d350dc7fb7f4c@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: FreeBSD Net <freebsd-net@freebsd.org>, Robert Watson <rwatson@freebsd.org>,
	Qing Li <qingli@freebsd.org>, arch@freebsd.org
Subject: Re: resend: multiple routing table roadmap (format fix)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Dec 2007 21:19:03 -0000

Ivo Vachkov wrote:
> On Dec 27, 2007 2:26 AM, Julian Elischer <julian@elischer.org> wrote:
>> Resending as my mailer made a dog's breakfast of the first one
>> with all sorts of wierd line breaks... hopefully this will be better.
>> (I haven't sent it yet so I'm hoping)..
>>
>>
>> -------------------------------------------
>>
>>
>>
>> On thing where FreeBSD has been falling behind, and which by chance I
>> have some time to work on is "policy based routing", which allows
>> different
>> packet streams to be routed by more than just the destination address.
>>
>> Constraints:
>> ------------
>>
>> I want to make some form of this available in the 6.x tree
>> (and by extension 7.x) , but FreeBSD in general needs it so I might as
>> well
>> do it in -current and back port the portions I need.
>>
>> One of the ways that this can be done is to have the ability to
>> instantiate multiple kernel routing tables (which I will now
>> refer to as "Forwarding Information Bases" or "FIBs" for political
>> correctness reasons. Which FIB a particular packet uses to make
>> the next hop decision can be decided by a number of mechanisms.
>> The policies these mechanisms implement are the "Policies" referred
>> to in "Policy based routing".
>>
>> One of the constraints I have if I try to back port this work to
>> 6.x is that it must be implemented as a EXTENSION to the existing
>> ABIs in 6.x so that third party applications do not need to be
>> recompiled in timespan of the branch.
>>
>> Implementation method, (part 1)
>> -------------------------------
>> For this reason I have implemented a "sufficient subset" of a
>> multiple routing table solution in Perforce, and back-ported it
>> to 6.x. (also in Perforce though not yet caught up with what I
>> have done in -current/P4). The subset allows a number of FIBs
>> to be defined at compile time (sufficient for my purposes in 6.x) and
>> implements the changes needed to allow IPV4 to use them. I have not done
>> the changes for ipv6 simply because I do not need it, and I do not
>> have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

By the way, I might add that in the 6.x compat. version I may end up
limiting the feature to 8 tables. This is because I need to store some
stuff in an efficient way in the mbuf, and in a compatible manner this 
is easiest done by stealing the top 4 bits in the mbuf dlags word
and defining them as:

  #define M_HAVEFIB	0x10000000
  #define M_FIBMASK	0x07
  #define M_FIBNUM	0xe0000000
  #define M_FIBSHIFT	29
  #define m_getfib(_m, _default) ((m->m_flags & M_HAVE_FIBNUM) ? 
((m->m_flags >> M_FIBSHIFT) & M_FIBMASK) : _default)
  #M_SETFIB(_m, _fib) do { \
    _m->m_flags &= ~M_FIBNUM; \
    _m->m_flags |= (M_HAVEFIB|((_fib & M_FIBMASK) << M_FIBSHIFT));\
} while (0)

This then becomes very easy to change to use a tag or
whatever is needed in later versions , and the number can
be expanded past 8 predefined  FIBs at that time..

>>
>> Other protocol families are left untouched and should there be
>> users with proprietary protocol families, they should continue to work
>> and be oblivious to the existence of the extra FIBs.
>>
>> To understand how this is done, one must know that the current FIB
>> code starts everything off with a single dimensional array of
>> pointers to FIB head structures (One per protocol family), each of
>> which in turn points to the trie of routes available to that family.
>>
>> The basic change in the ABI compatible version of the change is to
>> extent that array to be a 2 dimensional array, so that
>> instead of protocol family X looking at rt_tables[X] for the
>> table it needs, it looks at rt_tables[Y][X] when for all
>> protocol families except ipv4 Y is always 0.
>> Code that is unaware of the change always just sees the first row
>> of the table, which of course looks just like the one dimensional
>> array that existed before.
> 
> Pretty much like the OpenBSD approach :)

well, I did look at the code briefly, but I didn't base it on it..

> 
>> The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
>> are all maintained, but refer only to the first row of the array,
>> so that existing callers in proprietary protocols can continue to
>> do the "right thing".
>> Some new entry points are added, for the exclusive use of ipv4 code
>> called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
>> which have an extra argument which refers the code to the correct row.
>>
>> In addition, there are some new entry points (currently called
>> dom_rtalloc() and friends) that check the Address family being
>> looked up and call either rtalloc() (and friends) if the protocol
>> is not IPv4 forcing the action to row 0 or to the appropriate row
>> if it IS IPv4 (and that info is available). These are for calling
>> from code that is not specific to any particular protocol. The way
>> these are implemented would change in the non ABI preserving code
>> to be added later.
>>
>> One feature of the first version of the code is that for ipv4,
>> the interface routes show up automatically on all the FIBs, so
>> that no matter what FIB you select you always have the basic
>> direct attached hosts available to you. (rtinit() does this
>> automatically).
>> You CAN delete an interface route from one FIB should you want
>> to but by default it's there. ARP information is also available
>> in each FIB. It's assumed that the same machine would have the
>> same MAC address, regardless of which FIB you are using to get
>> to it.
>>
>>
>> This brings us as to how the correct FIB is selected for an outgoing
>> IPV4 packet.
>>
>> Packets fall into one of a number of classes.
>> 1/ locally generated packets, coming from a socket/PCB.
>>     Such packets select a FIB from a number associated with the
>>     socket/PCB. This in turn is inherited from the process,
>>     but can be changed by a socket option. The process in turn
>>     inherits it on fork. I have written a utility call setfib
>>     that acts a bit like nice..
>>
>>         setfib -n 3 ping target.example.com # will use fib 3 for ping.
>>
>> 2/ packets received on an interface for forwarding.
>>     By default these packets would use table 0,
>>     (or possibly a number settable in a sysctl(not yet)).
>>     but prior to routing the firewall can inspect them (see below).
>>
>> 3/ packets inspected by a packet classifier, which can arbitrarily
>>     associate a fib with it on a packet by packet basis.
>>     A fib assigned to a packet by a packet classifier
>>     (such as ipfw) would over-ride a fib associated by
>>     a more default source. (such as cases 1 or 2).
> 
> For the 2/ and 3/ cases I added (in a personal work i've been doing
> lately) additional field in struct mbuf which can be set by a packet
> filter or other application upon receiving which points the right
> table to use for the lookup. This way a simple "marking" can be used
> to divide different flows and create policy based routing.

This would be the final way but I want to really minimise problems
in the compat versions, so I'll avoid doing that for now.

Do you have this work available?
And have you looked at mi diffs below?

> 
>> Routing messages would be associated with their
>> process, and thus select one FIB or another.
>>
>> In addition Netstat has been edited to be able to cope with the
>> fact that the array is now 2 dimensional. (It looks in system
>> memory using libkvm (!)).
>>
>> In addition two sysctls are added to give:
>> a) the number of FIBs compiled in (active)
>> b) the default FIB of the calling process.
>>
>> Early testing experience:
>> -------------------------
>>
>> Basically our (IronPort's) appliance does this functionality already
>> using ipfw fwd but that method has some drawbacks.
>>
>> For example,
>> It can't fully simulate a routing table because it can't influence the
>> socket's choice of local address when a connect() is done.
>>
>>
>> Testing during the generating of these changes has been
>> remarkably smooth so far. Multiple tables have co-existed
>> with no notable side effects, and packets have been routes
>> accordingly.
>>
>> I have not yet added the changes to ipfw.
>> pf has some similar changes already but they seem to rely on
>> the various FIBs having symbolic names. Which I do not plan to support
>> in the first version of these changes.
>>
>> SCTP has interestingly enough built in support for this, called VRFs
>> in Cisco parlance. it will be interesting to see how that handles it
>> when it suddenly actually does something.
>>
>> I have not redone my testing since my last edits, but will be
>> retesting with the current code asap.
>>
>>
>> Where to next:
>> --------------------
>>
>> After committing the ABI compatible version and MFCing it, I'd
>> like to proceed in a forward direction in -current. this will
>> result in some roto-tilling in the routing code.
>>
>> Firstly: the current code's idea of having a separate tree per
>> protocol family, all of the same format, and pointed to by the
>> 1 dimensional array is a bit silly. Especially when one considers that
>> there
>> is code that makes assumptions about every protocol having the same
>> internal structures there. Some protocols don't WANT that
>> sort of structure. (for example the whole idea of a netmask is foreign
>> to appletalk). This needs to be made opaque to the external code.
>>
>> My suggested first change is to add routing method pointers to the
>> 'domain' structure, along with information pointing the data.
>> instead of having an array of pointers to uniform structures,
>> there would be an array pointing to the 'domain' structures
>> for each protocol address domain (protocol family),
>> and the methods this reached would be called. The methods would have
>> an argument that gives FIB number, but the protocol would be free
>> to ignore it.
>>
>> Interaction with the ARP layer/ LL layer would need to be
>> revisited as well. Qing Li has been working on this already.
>>
>>
>> diffs
>> for those with p4 access:
>> p4 diff2 -du //depot/vendor/freebsd/src/sys/...@131121
>> //depot/user/julian/routing/src/sys/...
>>
>> for those with the makediff perl script:
>> perl ~/makediff.pl //depot/vendor/freebsd/src/sys/...@131121
>> //depot/user/julian/routing/src/sys/...
>>
>> for those with neither:
>>
>> http://people.freebsd.org/~julian/mrt2.diff
>>
>> I just put the userland utility in usr.sbin/setfib/ in p4.
>> and changes to netstat in usr.bin/netstat/
>>
>> see:
>> http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/julian/routing/src&HIDEDEL=NO
>>
>>
>>
>>
>> I'd like to get comments on this (compat) version, so that I can
>> commit it,
>> get general testing under way to start the clock for MFC, and then get
>> moving on the fuller implementation (that breaks ABIs) and other
>> routing issues.
>>
>>
>> Julian
>>
>>
>>
>>
>> _______________________________________________
>> freebsd-arch@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
>> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"
>>


From owner-freebsd-arch@FreeBSD.ORG  Thu Dec 27 23:21:50 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1149F16A419
	for <arch@FreeBSD.org>; Thu, 27 Dec 2007 23:21:50 +0000 (UTC)
	(envelope-from jhb@FreeBSD.org)
Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219])
	by mx1.freebsd.org (Postfix) with ESMTP id C183113C43E
	for <arch@FreeBSD.org>; Thu, 27 Dec 2007 23:21:49 +0000 (UTC)
	(envelope-from jhb@FreeBSD.org)
Received: from server.baldwin.cx (unverified [66.23.211.162]) 
	by speedfactory.net (SurgeMail 3.8q) with ESMTP id 226289009-1834499 
	for <arch@FreeBSD.org>; Thu, 27 Dec 2007 18:24:01 -0500
Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id lBRNLf8G054109
	for <arch@FreeBSD.org>; Thu, 27 Dec 2007 18:21:43 -0500 (EST)
	(envelope-from jhb@FreeBSD.org)
From: John Baldwin <jhb@FreeBSD.org>
To: arch@FreeBSD.org
Date: Thu, 27 Dec 2007 17:04:44 -0500
User-Agent: KMail/1.9.6
MIME-Version: 1.0
Content-Type: text/plain;
  charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200712271704.44796.jhb@FreeBSD.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]);
	Thu, 27 Dec 2007 18:21:43 -0500 (EST)
X-Virus-Scanned: ClamAV 0.91.2/5270/Thu Dec 27 12:48:18 2007 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Cc: 
Subject: kernel features MIB
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Dec 2007 23:21:50 -0000

One of the things we have at work is a kern.features sysctl MIB that contains 
nodes to indicate if a named feature is present.  For example, on i386 we 
have kern.features.pae and we auto enable -DPAE for kernel modules if the 
currently running kernel is using PAE using that sysctl.

One of the patches I want to commit soon is support for handling 
shm_open/shm_unlink directly in the kernel via swap-backed VM objects (the 
long-heralded memfd stuff).  I would like to have the sysctl MIB so that 
libc's for older releases (e.g. libc.so.6) could use the syscalls if they are 
available so that shm segments are shared between compat apps (e.g. 4.x or 
6.x) and up-to-date apps.

At work we don't have a pretty API for this at all, but I'm thinking for 
FreeBSD we can do this:

FEATURE(foo, "description of foo")

which is a macro to create the 'kern.features.foo' node and set it to 1.  Then 
we could have a routine in libc:

int	feature_present(const char *name);

That returns a boolean to indicate if a given feature is present or not by 
invoking sysctlbyname(3), etc.

Any objections to the idea?

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Thu Dec 27 23:21:54 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B585D16A420;
	Thu, 27 Dec 2007 23:21:54 +0000 (UTC) (envelope-from jhb@FreeBSD.org)
Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219])
	by mx1.freebsd.org (Postfix) with ESMTP id 2C29D13C45B;
	Thu, 27 Dec 2007 23:21:54 +0000 (UTC) (envelope-from jhb@FreeBSD.org)
Received: from server.baldwin.cx (unverified [66.23.211.162]) 
	by speedfactory.net (SurgeMail 3.8q) with ESMTP id 226289019-1834499 
	for multiple; Thu, 27 Dec 2007 18:24:05 -0500
Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id lBRNLf8H054109;
	Thu, 27 Dec 2007 18:21:47 -0500 (EST) (envelope-from jhb@FreeBSD.org)
From: John Baldwin <jhb@FreeBSD.org>
To: freebsd-arch@FreeBSD.org
Date: Thu, 27 Dec 2007 18:05:40 -0500
User-Agent: KMail/1.9.6
References: <18378.1196596684@critter.freebsd.dk>
	<4752AABE.6090006@freebsd.org>
In-Reply-To: <4752AABE.6090006@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200712271805.40972.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]);
	Thu, 27 Dec 2007 18:21:48 -0500 (EST)
X-Virus-Scanned: ClamAV 0.91.2/5270/Thu Dec 27 12:48:18 2007 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Cc: Attilio Rao <attilio@FreeBSD.org>, arch@FreeBSD.org,
	Poul-Henning Kamp <phk@phk.freebsd.dk>,
	Robert Watson <rwatson@FreeBSD.org>, Andre Oppermann <andre@FreeBSD.org>
Subject: Re: New "timeout" api, to replace callout
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Dec 2007 23:21:54 -0000

On Sunday 02 December 2007 07:53:18 am Andre Oppermann wrote:
> Poul-Henning Kamp wrote:
> > In message <4752998A.9030007@freebsd.org>, Andre Oppermann writes:
> >>  o TCP puts the timer into an allocated structure and upon close of the
> >>    session it has to be deallocated including stopping of all currently
> >>    running timers.
> >>    [...]
> >>     -> The timer facility should provide an atomic stop/remove call
> >>        that prevent any further callbacks upon return.  It should not
> >>        do a 'drain' where the callback may be run anyway.
> >>        Note: We hold the lock the callback would have to obtain.
> > 
> > It is my intent, that the implementation behind the new API will
> > only ever grab the specified lock when it calls the timeout function.
> 
> This is the same for the current one and pretty much a given.
> 
> > When you do a timeout_disable() or timeout_cleanup() you will be
> > sleeping on a mutex internal to the implementation, if the timeout
> > is currently executing.
> 
> This is the problematic part.  We can't sleep in TCP when cleaning up
> the timer.  We're not always called from userland but from interrupt
> context.  And when calling the cleanup we currently hold the lock the
> callout wants to obtain.  We can't drop it either as the race would
> be back again.  What you describe here is the equivalent of callout_
> drain().  This is unfortunately unworkable in TCP's context.  The
> callout has to go away even if it is already pending and waiting on
> the lock.  Maybe that can only be solved by a flag in the lock saying
> "give up and go away".

The reason you need to do a drain is to allow for safe destroying of the lock.  
Specifically, drivers tend to do this:

	FOO_LOCK(sc);
	...
	callout_stop(...);
	FOO_UNLOCK(sc);
	...
	callout_drain(...);
	...
	mtx_destroy(&sc->foo_mtx);

If you don't have the drain and softclock is trying to acquire the backing 
mutex while you have it held (before the callout_stop) then Bad Things can 
happen if you don't do the drain.  Having the lock just "give up" doesn't 
work either because if the memory containing the lock is free'd and 
reinitialized such that it looks enough like a valid lock then softclock (or 
its equivalent) will still try to obtain it.  Also, you need to do a drain so 
it is safe to free the callout structure to prevent it from being recycled 
and having weird races where it gets recycled and rescheduled but the timer 
code thinks it has a pending stop for that pointer and so it aborts the wrong 
instance of the timer, etc.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Thu Dec 27 23:21:54 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B585D16A420;
	Thu, 27 Dec 2007 23:21:54 +0000 (UTC) (envelope-from jhb@FreeBSD.org)
Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219])
	by mx1.freebsd.org (Postfix) with ESMTP id 2C29D13C45B;
	Thu, 27 Dec 2007 23:21:54 +0000 (UTC) (envelope-from jhb@FreeBSD.org)
Received: from server.baldwin.cx (unverified [66.23.211.162]) 
	by speedfactory.net (SurgeMail 3.8q) with ESMTP id 226289019-1834499 
	for multiple; Thu, 27 Dec 2007 18:24:05 -0500
Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id lBRNLf8H054109;
	Thu, 27 Dec 2007 18:21:47 -0500 (EST) (envelope-from jhb@FreeBSD.org)
From: John Baldwin <jhb@FreeBSD.org>
To: freebsd-arch@FreeBSD.org
Date: Thu, 27 Dec 2007 18:05:40 -0500
User-Agent: KMail/1.9.6
References: <18378.1196596684@critter.freebsd.dk>
	<4752AABE.6090006@freebsd.org>
In-Reply-To: <4752AABE.6090006@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200712271805.40972.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]);
	Thu, 27 Dec 2007 18:21:48 -0500 (EST)
X-Virus-Scanned: ClamAV 0.91.2/5270/Thu Dec 27 12:48:18 2007 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Cc: Attilio Rao <attilio@FreeBSD.org>, arch@FreeBSD.org,
	Poul-Henning Kamp <phk@phk.freebsd.dk>,
	Robert Watson <rwatson@FreeBSD.org>, Andre Oppermann <andre@FreeBSD.org>
Subject: Re: New "timeout" api, to replace callout
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Dec 2007 23:21:54 -0000

On Sunday 02 December 2007 07:53:18 am Andre Oppermann wrote:
> Poul-Henning Kamp wrote:
> > In message <4752998A.9030007@freebsd.org>, Andre Oppermann writes:
> >>  o TCP puts the timer into an allocated structure and upon close of the
> >>    session it has to be deallocated including stopping of all currently
> >>    running timers.
> >>    [...]
> >>     -> The timer facility should provide an atomic stop/remove call
> >>        that prevent any further callbacks upon return.  It should not
> >>        do a 'drain' where the callback may be run anyway.
> >>        Note: We hold the lock the callback would have to obtain.
> > 
> > It is my intent, that the implementation behind the new API will
> > only ever grab the specified lock when it calls the timeout function.
> 
> This is the same for the current one and pretty much a given.
> 
> > When you do a timeout_disable() or timeout_cleanup() you will be
> > sleeping on a mutex internal to the implementation, if the timeout
> > is currently executing.
> 
> This is the problematic part.  We can't sleep in TCP when cleaning up
> the timer.  We're not always called from userland but from interrupt
> context.  And when calling the cleanup we currently hold the lock the
> callout wants to obtain.  We can't drop it either as the race would
> be back again.  What you describe here is the equivalent of callout_
> drain().  This is unfortunately unworkable in TCP's context.  The
> callout has to go away even if it is already pending and waiting on
> the lock.  Maybe that can only be solved by a flag in the lock saying
> "give up and go away".

The reason you need to do a drain is to allow for safe destroying of the lock.  
Specifically, drivers tend to do this:

	FOO_LOCK(sc);
	...
	callout_stop(...);
	FOO_UNLOCK(sc);
	...
	callout_drain(...);
	...
	mtx_destroy(&sc->foo_mtx);

If you don't have the drain and softclock is trying to acquire the backing 
mutex while you have it held (before the callout_stop) then Bad Things can 
happen if you don't do the drain.  Having the lock just "give up" doesn't 
work either because if the memory containing the lock is free'd and 
reinitialized such that it looks enough like a valid lock then softclock (or 
its equivalent) will still try to obtain it.  Also, you need to do a drain so 
it is safe to free the callout structure to prevent it from being recycled 
and having weird races where it gets recycled and rescheduled but the timer 
code thinks it has a pending stop for that pointer and so it aborts the wrong 
instance of the timer, etc.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Thu Dec 27 23:21:56 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B7D4916A47B
	for <freebsd-arch@FreeBSD.org>; Thu, 27 Dec 2007 23:21:56 +0000 (UTC)
	(envelope-from jhb@FreeBSD.org)
Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219])
	by mx1.freebsd.org (Postfix) with ESMTP id 7056E13C468
	for <freebsd-arch@FreeBSD.org>; Thu, 27 Dec 2007 23:21:56 +0000 (UTC)
	(envelope-from jhb@FreeBSD.org)
Received: from server.baldwin.cx (unverified [66.23.211.162]) 
	by speedfactory.net (SurgeMail 3.8q) with ESMTP id 226289028-1834499 
	for multiple; Thu, 27 Dec 2007 18:24:08 -0500
Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id lBRNLf8I054109;
	Thu, 27 Dec 2007 18:21:53 -0500 (EST) (envelope-from jhb@FreeBSD.org)
From: John Baldwin <jhb@FreeBSD.org>
To: freebsd-arch@FreeBSD.org
Date: Thu, 27 Dec 2007 18:17:28 -0500
User-Agent: KMail/1.9.6
References: <15391.1196547545@critter.freebsd.dk>
In-Reply-To: <15391.1196547545@critter.freebsd.dk>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-15"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200712271817.28789.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]);
	Thu, 27 Dec 2007 18:21:53 -0500 (EST)
X-Virus-Scanned: ClamAV 0.91.2/5270/Thu Dec 27 12:48:18 2007 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Cc: Poul-Henning Kamp <phk@phk.freebsd.dk>
Subject: Re: New "timeout" api, to replace callout
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Dec 2007 23:21:56 -0000

On Saturday 01 December 2007 05:19:05 pm Poul-Henning Kamp wrote:
> 
> Here is my proposed new timeout API for 8.x.
> 
> The primary objective is to make it possible to have multiple timeout
> "providers" of possibly different kind, so that we can have per-cpu
> or per-net-stack timeout handing.
> 
> A secondary goal, is to shove the anti-race handling in destruction of
> timeouts back into the implemenation, rather than force users to spend
> 20+ lines doing that.

I don't see this anymore.  Perhaps you haven't looked at updated drivers 
recently?  Right now it looks like this:

foo_attach/create()
{
	mtx_init(&foo->lock, ...);
	callout_init_mtx(&foo->callout, &foo->lock);
}

foo_something()
{
	callout_reset(&foo->callout, foo_timer, ...)
}

/* Called with lock held */
foo_timer()
{
	/*
	 * Doesn't have to check 'is detaching' or any other such crap
	 * anymore.
	 */
}

foo_stop()
{
	FOO_LOCK();
	callout_stop(&foo->callout);

	/* foo_timer() will no longer run after this point. */
	FOO_UNLOCK();
}

foo_detach/destroy()
{
	foo_stop();

	/*
	 * This drain ensures softclock() is done frobbing with our mutex
	 * so we can safely destroy it.  Also makes sure it has no references
	 * to our callout structure either.
	 */
	callout_drain(&foo->callout);
	mtx_destroy(&foo->lock);
}

That's not 20 lines.  You have to do the reset/stop anyway and those now work 
intuitively.  The only "extra" code is an init routine (which you will need 
anyway) and a teardown routine (callout_drain()).  From what I can tell, 
you've basically mandated a lock and when you use callout_init_mtx() (or now 
callout_init_rw()), callout_stop() == timeout_safe() and callout_drain() == 
timeout_cleanup().

Thus, as far as the MPSAFEty stuff, I think the timeout changes are just 
reshuffling deck chairs.  The other goals (axeing hz) I agree with, but I 
don't think you've changed anything as far as MPSAFEty is concerned.  Also, 
I'd probably find timeout_stop() more intuitive than timeout_safe() to be 
honest.  Maybe timeout_disarm()?

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 28 02:26:05 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C27FA16A46E
	for <freebsd-arch@freebsd.org>; Fri, 28 Dec 2007 02:26:05 +0000 (UTC)
	(envelope-from peterjeremy@optushome.com.au)
Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au
	[211.29.132.186])
	by mx1.freebsd.org (Postfix) with ESMTP id 405F913C469
	for <freebsd-arch@freebsd.org>; Fri, 28 Dec 2007 02:26:05 +0000 (UTC)
	(envelope-from peterjeremy@optushome.com.au)
Received: from server.vk2pj.dyndns.org
	(c220-239-20-82.belrs4.nsw.optusnet.com.au [220.239.20.82])
	by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	lBS2Pw0n032333
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Fri, 28 Dec 2007 13:25:58 +1100
Received: from server.vk2pj.dyndns.org (localhost.vk2pj.dyndns.org [127.0.0.1])
	by server.vk2pj.dyndns.org (8.14.2/8.14.1) with ESMTP id lBS2PvwK048587;
	Fri, 28 Dec 2007 13:25:57 +1100 (EST)
	(envelope-from peter@server.vk2pj.dyndns.org)
Received: (from peter@localhost)
	by server.vk2pj.dyndns.org (8.14.2/8.14.2/Submit) id lBS2Pv0k048586;
	Fri, 28 Dec 2007 13:25:57 +1100 (EST) (envelope-from peter)
Date: Fri, 28 Dec 2007 13:25:57 +1100
From: Peter Jeremy <peterjeremy@optushome.com.au>
To: "Aryeh M. Friedman" <aryeh.friedman@gmail.com>
Message-ID: <20071228022557.GT40785@server.vk2pj.dyndns.org>
References: <4772A742.4050106@gmail.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="Pk6IbRAofICFmK5e"
Content-Disposition: inline
In-Reply-To: <4772A742.4050106@gmail.com>
X-PGP-Key: http://members.optusnet.com.au/peterjeremy/pubkey.asc
User-Agent: Mutt/1.5.17 (2007-11-01)
Cc: freebsd-arch@freebsd.org
Subject: Re: Adding better database support to the base system
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2007 02:26:05 -0000


--Pk6IbRAofICFmK5e
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Dec 26, 2007 at 02:10:58PM -0500, Aryeh M. Friedman wrote:
>Currently the only available DB support in the base system is Berkeley
>DB (1.x) there are several items that would benefit from migrating
>something like minisql into the base system.   The most immediate
>application that comes to mind is enabling some interesting features
>for the ports system.  Therefor I purpose migrating some minimal
>RDBM's features into the base system.

Firstly, minisql (mSQL) is a non-starter because of its license.

Secondly, you haven't provided any justification for the inclusion of
an RDBMS in the base system.  In general, the base system only
contains tools necessary to build and manage the base system.  In
order to add this to the base system, you need to justify both why an
RDBMS is needed and why it can't be a port.

Thirdly, this topic has been thrashed out recently and I suggest you
review that thread before continuing.

--=20
Peter Jeremy
Please excuse any delays as the result of my ISP's inability to implement
an MTA that is either RFC2821-compliant or matches their claimed behaviour.

--Pk6IbRAofICFmK5e
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4 (FreeBSD)

iD8DBQFHdF61/opHv/APuIcRAuIPAKCUNpZ0iDJpAzb1t6dyrfF+YfrO2QCgvpf4
R5NwBC4IT+QafB4Y5m6nfqE=
=rGAj
-----END PGP SIGNATURE-----

--Pk6IbRAofICFmK5e--

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 28 05:31:02 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 733A416A418;
	Fri, 28 Dec 2007 05:31:02 +0000 (UTC)
	(envelope-from bright@elvis.mu.org)
Received: from elvis.mu.org (elvis.mu.org [192.203.228.196])
	by mx1.freebsd.org (Postfix) with ESMTP id 5F04D13C457;
	Fri, 28 Dec 2007 05:31:02 +0000 (UTC)
	(envelope-from bright@elvis.mu.org)
Received: by elvis.mu.org (Postfix, from userid 1192)
	id 89C8E1A4D7C; Thu, 27 Dec 2007 21:29:06 -0800 (PST)
Date: Thu, 27 Dec 2007 21:29:06 -0800
From: Alfred Perlstein <alfred@freebsd.org>
To: John Baldwin <jhb@FreeBSD.org>
Message-ID: <20071228052906.GP16982@elvis.mu.org>
References: <200712271704.44796.jhb@FreeBSD.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <200712271704.44796.jhb@FreeBSD.org>
User-Agent: Mutt/1.4.2.3i
Cc: arch@FreeBSD.org
Subject: Re: kernel features MIB
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2007 05:31:02 -0000

Sounds pretty rad.  

* John Baldwin <jhb@FreeBSD.org> [071227 15:20] wrote:
> One of the things we have at work is a kern.features sysctl MIB that contains 
> nodes to indicate if a named feature is present.  For example, on i386 we 
> have kern.features.pae and we auto enable -DPAE for kernel modules if the 
> currently running kernel is using PAE using that sysctl.
> 
> One of the patches I want to commit soon is support for handling 
> shm_open/shm_unlink directly in the kernel via swap-backed VM objects (the 
> long-heralded memfd stuff).  I would like to have the sysctl MIB so that 
> libc's for older releases (e.g. libc.so.6) could use the syscalls if they are 
> available so that shm segments are shared between compat apps (e.g. 4.x or 
> 6.x) and up-to-date apps.
> 
> At work we don't have a pretty API for this at all, but I'm thinking for 
> FreeBSD we can do this:
> 
> FEATURE(foo, "description of foo")
> 
> which is a macro to create the 'kern.features.foo' node and set it to 1.  Then 
> we could have a routine in libc:
> 
> int	feature_present(const char *name);
> 
> That returns a boolean to indicate if a given feature is present or not by 
> invoking sysctlbyname(3), etc.
> 
> Any objections to the idea?
> 
> -- 
> John Baldwin
> _______________________________________________
> freebsd-arch@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"

-- 
- Alfred Perlstein

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 28 09:03:09 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A598816A418
	for <freebsd-arch@freebsd.org>; Fri, 28 Dec 2007 09:03:09 +0000 (UTC)
	(envelope-from hselasky@c2i.net)
Received: from swip.net (mailfe14.swipnet.se [212.247.155.161])
	by mx1.freebsd.org (Postfix) with ESMTP id E3D9013C448
	for <freebsd-arch@freebsd.org>; Fri, 28 Dec 2007 09:03:07 +0000 (UTC)
	(envelope-from hselasky@c2i.net)
X-Cloudmark-Score: 0.000000 []
Received: from [193.217.102.3] (account mc467741@c2i.net HELO [10.0.0.249])
	by mailfe14.swip.net (CommuniGate Pro SMTP 5.1.13)
	with ESMTPA id 11653783; Fri, 28 Dec 2007 10:03:06 +0100
From: Hans Petter Selasky <hselasky@c2i.net>
To: freebsd-arch@freebsd.org
Date: Fri, 28 Dec 2007 10:03:50 +0100
User-Agent: KMail/1.9.7
References: <18378.1196596684@critter.freebsd.dk>
	<4752AABE.6090006@freebsd.org> <200712271805.40972.jhb@freebsd.org>
In-Reply-To: <200712271805.40972.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200712281003.52062.hselasky@c2i.net>
Cc: Andre Oppermann <andre@freebsd.org>, Attilio Rao <attilio@freebsd.org>,
	arch@freebsd.org, Poul-Henning Kamp <phk@phk.freebsd.dk>,
	Robert Watson <rwatson@freebsd.org>
Subject: Re: New "timeout" api, to replace callout
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2007 09:03:09 -0000

On Friday 28 December 2007, John Baldwin wrote:
> On Sunday 02 December 2007 07:53:18 am Andre Oppermann wrote:
> > Poul-Henning Kamp wrote:
> > > In message <4752998A.9030007@freebsd.org>, Andre Oppermann writes:
> > >>  o TCP puts the timer into an allocated structure and upon close of
> > >> the session it has to be deallocated including stopping of all
> > >> currently running timers.
> > >>    [...]
> > >>     -> The timer facility should provide an atomic stop/remove call
> > >>        that prevent any further callbacks upon return.  It should not
> > >>        do a 'drain' where the callback may be run anyway.
> > >>        Note: We hold the lock the callback would have to obtain.
> > >
> > > It is my intent, that the implementation behind the new API will
> > > only ever grab the specified lock when it calls the timeout function.
> >
> > This is the same for the current one and pretty much a given.
> >
> > > When you do a timeout_disable() or timeout_cleanup() you will be
> > > sleeping on a mutex internal to the implementation, if the timeout
> > > is currently executing.
> >
> > This is the problematic part.  We can't sleep in TCP when cleaning up
> > the timer.  We're not always called from userland but from interrupt
> > context.  And when calling the cleanup we currently hold the lock the
> > callout wants to obtain.  We can't drop it either as the race would
> > be back again.  What you describe here is the equivalent of callout_
> > drain().  This is unfortunately unworkable in TCP's context.  The
> > callout has to go away even if it is already pending and waiting on
> > the lock.  Maybe that can only be solved by a flag in the lock saying
> > "give up and go away".
>
> The reason you need to do a drain is to allow for safe destroying of the
> lock. Specifically, drivers tend to do this:
>
> 	FOO_LOCK(sc);
> 	...
> 	callout_stop(...);
> 	FOO_UNLOCK(sc);
> 	...
> 	callout_drain(...);
> 	...
> 	mtx_destroy(&sc->foo_mtx);
>
> If you don't have the drain and softclock is trying to acquire the backing
> mutex while you have it held (before the callout_stop) then Bad Things can
> happen if you don't do the drain.  Having the lock just "give up" doesn't
> work either because if the memory containing the lock is free'd and
> reinitialized such that it looks enough like a valid lock then softclock
> (or its equivalent) will still try to obtain it.  Also, you need to do a
> drain so it is safe to free the callout structure to prevent it from being
> recycled and having weird races where it gets recycled and rescheduled but
> the timer code thinks it has a pending stop for that pointer and so it
> aborts the wrong instance of the timer, etc.

Hi,

I completely agree to what John Baldwin is writing. You need two 
stop-functions:

xxx_stop which is non-blocking and
xxx_drain which can block i.e. sleep

BTW: The USB code in P4 uses the same semantics, due to the same reasons:

usbd_transfer_stop() and usbd_transfer_drain()

The only difference is that I pass an error code to the callback which might 
happen after that usbd_transfer_stop is called.

I think that xxx_stop() and xxx_drain() is a generic approach that should be 
applied to all callback systems. Whenever you have a callback you need to be 
able to stop it and drain it.

--HPS

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 28 10:03:12 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C70B916A469;
	Fri, 28 Dec 2007 10:03:11 +0000 (UTC)
	(envelope-from hselasky@c2i.net)
Received: from swip.net (mailfe14.swipnet.se [212.247.155.161])
	by mx1.freebsd.org (Postfix) with ESMTP id 5EA1813C468;
	Fri, 28 Dec 2007 10:03:10 +0000 (UTC)
	(envelope-from hselasky@c2i.net)
X-Cloudmark-Score: 0.000000 []
Received: from [193.217.102.3] (account mc467741@c2i.net HELO [10.0.0.249])
	by mailfe14.swip.net (CommuniGate Pro SMTP 5.1.13)
	with ESMTPA id 11653783; Fri, 28 Dec 2007 10:03:06 +0100
From: Hans Petter Selasky <hselasky@c2i.net>
To: freebsd-arch@freebsd.org
Date: Fri, 28 Dec 2007 10:03:50 +0100
User-Agent: KMail/1.9.7
References: <18378.1196596684@critter.freebsd.dk>
	<4752AABE.6090006@freebsd.org> <200712271805.40972.jhb@freebsd.org>
In-Reply-To: <200712271805.40972.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200712281003.52062.hselasky@c2i.net>
Cc: Andre Oppermann <andre@freebsd.org>, Attilio Rao <attilio@freebsd.org>,
	arch@freebsd.org, Poul-Henning Kamp <phk@phk.freebsd.dk>,
	Robert Watson <rwatson@freebsd.org>
Subject: Re: New "timeout" api, to replace callout
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2007 10:03:12 -0000

On Friday 28 December 2007, John Baldwin wrote:
> On Sunday 02 December 2007 07:53:18 am Andre Oppermann wrote:
> > Poul-Henning Kamp wrote:
> > > In message <4752998A.9030007@freebsd.org>, Andre Oppermann writes:
> > >>  o TCP puts the timer into an allocated structure and upon close of
> > >> the session it has to be deallocated including stopping of all
> > >> currently running timers.
> > >>    [...]
> > >>     -> The timer facility should provide an atomic stop/remove call
> > >>        that prevent any further callbacks upon return.  It should not
> > >>        do a 'drain' where the callback may be run anyway.
> > >>        Note: We hold the lock the callback would have to obtain.
> > >
> > > It is my intent, that the implementation behind the new API will
> > > only ever grab the specified lock when it calls the timeout function.
> >
> > This is the same for the current one and pretty much a given.
> >
> > > When you do a timeout_disable() or timeout_cleanup() you will be
> > > sleeping on a mutex internal to the implementation, if the timeout
> > > is currently executing.
> >
> > This is the problematic part.  We can't sleep in TCP when cleaning up
> > the timer.  We're not always called from userland but from interrupt
> > context.  And when calling the cleanup we currently hold the lock the
> > callout wants to obtain.  We can't drop it either as the race would
> > be back again.  What you describe here is the equivalent of callout_
> > drain().  This is unfortunately unworkable in TCP's context.  The
> > callout has to go away even if it is already pending and waiting on
> > the lock.  Maybe that can only be solved by a flag in the lock saying
> > "give up and go away".
>
> The reason you need to do a drain is to allow for safe destroying of the
> lock. Specifically, drivers tend to do this:
>
> 	FOO_LOCK(sc);
> 	...
> 	callout_stop(...);
> 	FOO_UNLOCK(sc);
> 	...
> 	callout_drain(...);
> 	...
> 	mtx_destroy(&sc->foo_mtx);
>
> If you don't have the drain and softclock is trying to acquire the backing
> mutex while you have it held (before the callout_stop) then Bad Things can
> happen if you don't do the drain.  Having the lock just "give up" doesn't
> work either because if the memory containing the lock is free'd and
> reinitialized such that it looks enough like a valid lock then softclock
> (or its equivalent) will still try to obtain it.  Also, you need to do a
> drain so it is safe to free the callout structure to prevent it from being
> recycled and having weird races where it gets recycled and rescheduled but
> the timer code thinks it has a pending stop for that pointer and so it
> aborts the wrong instance of the timer, etc.

Hi,

I completely agree to what John Baldwin is writing. You need two 
stop-functions:

xxx_stop which is non-blocking and
xxx_drain which can block i.e. sleep

BTW: The USB code in P4 uses the same semantics, due to the same reasons:

usbd_transfer_stop() and usbd_transfer_drain()

The only difference is that I pass an error code to the callback which might 
happen after that usbd_transfer_stop is called.

I think that xxx_stop() and xxx_drain() is a generic approach that should be 
applied to all callback systems. Whenever you have a callback you need to be 
able to stop it and drain it.

--HPS

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 28 10:30:15 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8DBCD16A41A;
	Fri, 28 Dec 2007 10:30:15 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 4783813C4E3;
	Fri, 28 Dec 2007 10:30:15 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id E705746EB6;
	Fri, 28 Dec 2007 05:30:14 -0500 (EST)
Date: Fri, 28 Dec 2007 10:30:14 +0000 (GMT)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Hans Petter Selasky <hselasky@c2i.net>
In-Reply-To: <200712281003.52062.hselasky@c2i.net>
Message-ID: <20071228102544.J45653@fledge.watson.org>
References: <18378.1196596684@critter.freebsd.dk>
	<4752AABE.6090006@freebsd.org> <200712271805.40972.jhb@freebsd.org>
	<200712281003.52062.hselasky@c2i.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Andre Oppermann <andre@freebsd.org>, Attilio Rao <attilio@freebsd.org>,
	arch@freebsd.org, Poul-Henning Kamp <phk@phk.freebsd.dk>,
	freebsd-arch@freebsd.org
Subject: Re: New "timeout" api, to replace callout
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2007 10:30:15 -0000


On Fri, 28 Dec 2007, Hans Petter Selasky wrote:

>> The reason you need to do a drain is to allow for safe destroying of the
>> lock. Specifically, drivers tend to do this:
>>
>> 	FOO_LOCK(sc);
>> 	...
>> 	callout_stop(...);
>> 	FOO_UNLOCK(sc);
>> 	...
>> 	callout_drain(...);
>> 	...
>> 	mtx_destroy(&sc->foo_mtx);
>>
>> If you don't have the drain and softclock is trying to acquire the backing 
>> mutex while you have it held (before the callout_stop) then Bad Things can 
>> happen if you don't do the drain.  Having the lock just "give up" doesn't 
>> work either because if the memory containing the lock is free'd and 
>> reinitialized such that it looks enough like a valid lock then softclock 
>> (or its equivalent) will still try to obtain it.  Also, you need to do a 
>> drain so it is safe to free the callout structure to prevent it from being 
>> recycled and having weird races where it gets recycled and rescheduled but 
>> the timer code thinks it has a pending stop for that pointer and so it 
>> aborts the wrong instance of the timer, etc.
>
> I completely agree to what John Baldwin is writing. You need two 
> stop-functions:
>
> xxx_stop which is non-blocking and xxx_drain which can block i.e. sleep
>
> BTW: The USB code in P4 uses the same semantics, due to the same reasons:
>
> usbd_transfer_stop() and usbd_transfer_drain()
>
> The only difference is that I pass an error code to the callback which might 
> happen after that usbd_transfer_stop is called.
>
> I think that xxx_stop() and xxx_drain() is a generic approach that should be 
> applied to all callback systems. Whenever you have a callback you need to be 
> able to stop it and drain it.

I think the argument that Poul-Henning is making is not that you don't need 
something that behaves "like drain", but rather, we're like the wait for drain 
to be a short, mutex-length wait rather than a long, msleep-length wait. 
Remember that the bodies of callouts are expected to run in a very short 
period of time in order to not stall the timer system, in fact, in such a way 
that a mutex could be held over the entirely timeout call.  Given that this is 
the case, one might reasonably expect callout_stop() to perform the drain 
rather than having a separate call.  Such a model would be very advantageous 
in TCP, where rather than having to defer GC'ing the inpcb/tcpcb to a GC 
worker thread, which we don't do but should, we could use the stop call safely 
and eliminate a whole class of races from the stack.

Robert N M Watson
Computer Laboratory
University of Cambridge

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 28 10:30:15 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8DBCD16A41A;
	Fri, 28 Dec 2007 10:30:15 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 4783813C4E3;
	Fri, 28 Dec 2007 10:30:15 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id E705746EB6;
	Fri, 28 Dec 2007 05:30:14 -0500 (EST)
Date: Fri, 28 Dec 2007 10:30:14 +0000 (GMT)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Hans Petter Selasky <hselasky@c2i.net>
In-Reply-To: <200712281003.52062.hselasky@c2i.net>
Message-ID: <20071228102544.J45653@fledge.watson.org>
References: <18378.1196596684@critter.freebsd.dk>
	<4752AABE.6090006@freebsd.org> <200712271805.40972.jhb@freebsd.org>
	<200712281003.52062.hselasky@c2i.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Andre Oppermann <andre@freebsd.org>, Attilio Rao <attilio@freebsd.org>,
	arch@freebsd.org, Poul-Henning Kamp <phk@phk.freebsd.dk>,
	freebsd-arch@freebsd.org
Subject: Re: New "timeout" api, to replace callout
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2007 10:30:15 -0000


On Fri, 28 Dec 2007, Hans Petter Selasky wrote:

>> The reason you need to do a drain is to allow for safe destroying of the
>> lock. Specifically, drivers tend to do this:
>>
>> 	FOO_LOCK(sc);
>> 	...
>> 	callout_stop(...);
>> 	FOO_UNLOCK(sc);
>> 	...
>> 	callout_drain(...);
>> 	...
>> 	mtx_destroy(&sc->foo_mtx);
>>
>> If you don't have the drain and softclock is trying to acquire the backing 
>> mutex while you have it held (before the callout_stop) then Bad Things can 
>> happen if you don't do the drain.  Having the lock just "give up" doesn't 
>> work either because if the memory containing the lock is free'd and 
>> reinitialized such that it looks enough like a valid lock then softclock 
>> (or its equivalent) will still try to obtain it.  Also, you need to do a 
>> drain so it is safe to free the callout structure to prevent it from being 
>> recycled and having weird races where it gets recycled and rescheduled but 
>> the timer code thinks it has a pending stop for that pointer and so it 
>> aborts the wrong instance of the timer, etc.
>
> I completely agree to what John Baldwin is writing. You need two 
> stop-functions:
>
> xxx_stop which is non-blocking and xxx_drain which can block i.e. sleep
>
> BTW: The USB code in P4 uses the same semantics, due to the same reasons:
>
> usbd_transfer_stop() and usbd_transfer_drain()
>
> The only difference is that I pass an error code to the callback which might 
> happen after that usbd_transfer_stop is called.
>
> I think that xxx_stop() and xxx_drain() is a generic approach that should be 
> applied to all callback systems. Whenever you have a callback you need to be 
> able to stop it and drain it.

I think the argument that Poul-Henning is making is not that you don't need 
something that behaves "like drain", but rather, we're like the wait for drain 
to be a short, mutex-length wait rather than a long, msleep-length wait. 
Remember that the bodies of callouts are expected to run in a very short 
period of time in order to not stall the timer system, in fact, in such a way 
that a mutex could be held over the entirely timeout call.  Given that this is 
the case, one might reasonably expect callout_stop() to perform the drain 
rather than having a separate call.  Such a model would be very advantageous 
in TCP, where rather than having to defer GC'ing the inpcb/tcpcb to a GC 
worker thread, which we don't do but should, we could use the stop call safely 
and eliminate a whole class of races from the stack.

Robert N M Watson
Computer Laboratory
University of Cambridge

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 28 11:19:16 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D8C8416A469;
	Fri, 28 Dec 2007 11:19:16 +0000 (UTC)
	(envelope-from kris@FreeBSD.org)
Received: from weak.local (freefall.freebsd.org [IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id 240AE13C509;
	Fri, 28 Dec 2007 11:19:15 +0000 (UTC)
	(envelope-from kris@FreeBSD.org)
Message-ID: <4774DBB2.5060707@FreeBSD.org>
Date: Fri, 28 Dec 2007 12:19:14 +0100
From: Kris Kennaway <kris@FreeBSD.org>
User-Agent: Thunderbird 2.0.0.9 (Macintosh/20071031)
MIME-Version: 1.0
To: John Baldwin <jhb@FreeBSD.org>
References: <200712271704.44796.jhb@FreeBSD.org>
In-Reply-To: <200712271704.44796.jhb@FreeBSD.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: arch@FreeBSD.org
Subject: Re: kernel features MIB
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2007 11:19:16 -0000

John Baldwin wrote:
> One of the things we have at work is a kern.features sysctl MIB that contains 
> nodes to indicate if a named feature is present.  For example, on i386 we 
> have kern.features.pae and we auto enable -DPAE for kernel modules if the 
> currently running kernel is using PAE using that sysctl.
> 
> One of the patches I want to commit soon is support for handling 
> shm_open/shm_unlink directly in the kernel via swap-backed VM objects (the 
> long-heralded memfd stuff).  I would like to have the sysctl MIB so that 
> libc's for older releases (e.g. libc.so.6) could use the syscalls if they are 
> available so that shm segments are shared between compat apps (e.g. 4.x or 
> 6.x) and up-to-date apps.
> 
> At work we don't have a pretty API for this at all, but I'm thinking for 
> FreeBSD we can do this:
> 
> FEATURE(foo, "description of foo")
> 
> which is a macro to create the 'kern.features.foo' node and set it to 1.  Then 
> we could have a routine in libc:
> 
> int	feature_present(const char *name);
> 
> That returns a boolean to indicate if a given feature is present or not by 
> invoking sysctlbyname(3), etc.
> 
> Any objections to the idea?
> 

I have wanted something like this for a long time.  In ports land they 
often need to know this kind of thing, e.g. is compat4x support enabled 
in the kernel, etc.

Kris

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 28 14:56:35 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5E3B516A46E;
	Fri, 28 Dec 2007 14:56:35 +0000 (UTC)
	(envelope-from gnn@neville-neil.com)
Received: from outbound0.mx.meer.net (outbound0.mx.meer.net [209.157.153.23])
	by mx1.freebsd.org (Postfix) with ESMTP id 21F6E13C4DD;
	Fri, 28 Dec 2007 14:56:34 +0000 (UTC)
	(envelope-from gnn@neville-neil.com)
Received: from mail.meer.net (mail.meer.net [209.157.152.14])
	by outbound0.sv.meer.net (8.12.10/8.12.6) with ESMTP id lBSDnHih047757; 
	Fri, 28 Dec 2007 05:49:17 -0800 (PST)
	(envelope-from gnn@neville-neil.com)
Received: from minion.local.neville-neil.com
	(61.204.211.246.customerlink.pwd.ne.jp [61.204.211.246])
	by mail.meer.net (8.13.3/8.13.3/meer) with ESMTP id lBSDnGBQ048390;
	Fri, 28 Dec 2007 05:49:16 -0800 (PST)
	(envelope-from gnn@neville-neil.com)
Date: Fri, 28 Dec 2007 22:49:15 +0900
Message-ID: <m2bq8bvsis.wl%gnn@neville-neil.com>
From: gnn@freebsd.org
To: Julian Elischer <julian@elischer.org>
In-Reply-To: <4772F123.5030303@elischer.org>
References: <4772F123.5030303@elischer.org>
User-Agent: Wanderlust/2.15.5 (Almost Unreal) SEMI/1.14.6 (Maruoka)
	FLIM/1.14.8 (=?ISO-8859-4?Q?Shij=F2?=) APEL/10.7 Emacs/22.1.50
	(i386-apple-darwin8.10.1) MULE/5.0 (SAKAKI)
MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka")
Content-Type: text/plain; charset=US-ASCII
Cc: FreeBSD Net <freebsd-net@freebsd.org>, Robert Watson <rwatson@freebsd.org>,
	Qing Li <qingli@freebsd.org>, arch@freebsd.org
Subject: Re: resend: multiple routing table roadmap (format fix)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2007 14:56:35 -0000

At Wed, 26 Dec 2007 16:26:11 -0800,
julian wrote:
> 
> Resending as my mailer made a dog's breakfast of the first one
> with all sorts of wierd line breaks... hopefully this will be better.
> (I haven't sent it yet so I'm hoping)..
> 
> 
> -------------------------------------------
> 
> 
> 
> On thing where FreeBSD has been falling behind, and which by chance
> I have some time to work on is "policy based routing", which allows
> different packet streams to be routed by more than just the
> destination address.
> 
> Constraints:
> ------------
> 
> I want to make some form of this available in the 6.x tree
> (and by extension 7.x) , but FreeBSD in general needs it so I might as
> well
> do it in -current and back port the portions I need.
> 
> One of the ways that this can be done is to have the ability to
> instantiate multiple kernel routing tables (which I will now
> refer to as "Forwarding Information Bases" or "FIBs" for political
> correctness reasons. Which FIB a particular packet uses to make
> the next hop decision can be decided by a number of mechanisms.
> The policies these mechanisms implement are the "Policies" referred
> to in "Policy based routing".
> 
> One of the constraints I have if I try to back port this work to
> 6.x is that it must be implemented as a EXTENSION to the existing
> ABIs in 6.x so that third party applications do not need to be
> recompiled in timespan of the branch.
> 
> Implementation method, (part 1)
> -------------------------------
> For this reason I have implemented a "sufficient subset" of a
> multiple routing table solution in Perforce, and back-ported it
> to 6.x. (also in Perforce though not yet caught up with what I
> have done in -current/P4). The subset allows a number of FIBs
> to be defined at compile time (sufficient for my purposes in 6.x) and
> implements the changes needed to allow IPV4 to use them. I have not done
> the changes for ipv6 simply because I do not need it, and I do not
> have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.
> 
> Other protocol families are left untouched and should there be
> users with proprietary protocol families, they should continue to work
> and be oblivious to the existence of the extra FIBs.
> 
> To understand how this is done, one must know that the current FIB
> code starts everything off with a single dimensional array of
> pointers to FIB head structures (One per protocol family), each of
> which in turn points to the trie of routes available to that family.
> 
> The basic change in the ABI compatible version of the change is to
> extent that array to be a 2 dimensional array, so that
> instead of protocol family X looking at rt_tables[X] for the
> table it needs, it looks at rt_tables[Y][X] when for all
> protocol families except ipv4 Y is always 0.
> Code that is unaware of the change always just sees the first row
> of the table, which of course looks just like the one dimensional
> array that existed before.
> 
> 
> The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
> are all maintained, but refer only to the first row of the array,
> so that existing callers in proprietary protocols can continue to
> do the "right thing".
> Some new entry points are added, for the exclusive use of ipv4 code
> called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
> which have an extra argument which refers the code to the correct row.
> 
> In addition, there are some new entry points (currently called
> dom_rtalloc() and friends) that check the Address family being
> looked up and call either rtalloc() (and friends) if the protocol
> is not IPv4 forcing the action to row 0 or to the appropriate row
> if it IS IPv4 (and that info is available). These are for calling
> from code that is not specific to any particular protocol. The way
> these are implemented would change in the non ABI preserving code
> to be added later.
> 
> One feature of the first version of the code is that for ipv4,
> the interface routes show up automatically on all the FIBs, so
> that no matter what FIB you select you always have the basic
> direct attached hosts available to you. (rtinit() does this
> automatically).
> You CAN delete an interface route from one FIB should you want
> to but by default it's there. ARP information is also available
> in each FIB. It's assumed that the same machine would have the
> same MAC address, regardless of which FIB you are using to get
> to it.
> 
> 
> This brings us as to how the correct FIB is selected for an outgoing
> IPV4 packet.
> 
> Packets fall into one of a number of classes.
> 1/ locally generated packets, coming from a socket/PCB.
>     Such packets select a FIB from a number associated with the
>     socket/PCB. This in turn is inherited from the process,
>     but can be changed by a socket option. The process in turn
>     inherits it on fork. I have written a utility call setfib
>     that acts a bit like nice..
> 
>         setfib -n 3 ping target.example.com # will use fib 3 for ping.
> 
> 2/ packets received on an interface for forwarding.
>     By default these packets would use table 0,
>     (or possibly a number settable in a sysctl(not yet)).
>     but prior to routing the firewall can inspect them (see below).
> 
> 3/ packets inspected by a packet classifier, which can arbitrarily
>     associate a fib with it on a packet by packet basis.
>     A fib assigned to a packet by a packet classifier
>     (such as ipfw) would over-ride a fib associated by
>     a more default source. (such as cases 1 or 2).
> 
> Routing messages would be associated with their
> process, and thus select one FIB or another.
> 
> In addition Netstat has been edited to be able to cope with the
> fact that the array is now 2 dimensional. (It looks in system
> memory using libkvm (!)).
> 
> In addition two sysctls are added to give:
> a) the number of FIBs compiled in (active)
> b) the default FIB of the calling process.
> 
> Early testing experience:
> -------------------------
> 
> Basically our (IronPort's) appliance does this functionality already
> using ipfw fwd but that method has some drawbacks.
> 
> For example,
> It can't fully simulate a routing table because it can't influence the
> socket's choice of local address when a connect() is done.
> 
> 
> Testing during the generating of these changes has been
> remarkably smooth so far. Multiple tables have co-existed
> with no notable side effects, and packets have been routes
> accordingly.
> 
> I have not yet added the changes to ipfw.
> pf has some similar changes already but they seem to rely on
> the various FIBs having symbolic names. Which I do not plan to support
> in the first version of these changes.
> 
> SCTP has interestingly enough built in support for this, called VRFs
> in Cisco parlance. it will be interesting to see how that handles it 
> when it suddenly actually does something.
> 
> I have not redone my testing since my last edits, but will be
> retesting with the current code asap.
> 
> 
> Where to next:
> --------------------
> 
> After committing the ABI compatible version and MFCing it, I'd
> like to proceed in a forward direction in -current. this will
> result in some roto-tilling in the routing code.
> 
> Firstly: the current code's idea of having a separate tree per
> protocol family, all of the same format, and pointed to by the
> 1 dimensional array is a bit silly. Especially when one considers that
> there
> is code that makes assumptions about every protocol having the same
> internal structures there. Some protocols don't WANT that
> sort of structure. (for example the whole idea of a netmask is foreign
> to appletalk). This needs to be made opaque to the external code.
> 
> My suggested first change is to add routing method pointers to the
> 'domain' structure, along with information pointing the data.
> instead of having an array of pointers to uniform structures,
> there would be an array pointing to the 'domain' structures
> for each protocol address domain (protocol family),
> and the methods this reached would be called. The methods would have
> an argument that gives FIB number, but the protocol would be free
> to ignore it.
> 
> Interaction with the ARP layer/ LL layer would need to be
> revisited as well. Qing Li has been working on this already.
> 
> 
> diffs
> for those with p4 access:
> p4 diff2 -du //depot/vendor/freebsd/src/sys/...@131121
> //depot/user/julian/routing/src/sys/...
> 
> for those with the makediff perl script:
> perl ~/makediff.pl //depot/vendor/freebsd/src/sys/...@131121 
> //depot/user/julian/routing/src/sys/...
> 
> for those with neither:
> 
> http://people.freebsd.org/~julian/mrt2.diff
> 
> I just put the userland utility in usr.sbin/setfib/ in p4.
> and changes to netstat in usr.bin/netstat/
> 
> see:
> http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/julian/routing/src&HIDEDEL=NO
> 
> 
> 
> 
> I'd like to get comments on this (compat) version, so that I can
> commit it, get general testing under way to start the clock for MFC,
> and then get moving on the fuller implementation (that breaks ABIs)
> and other routing issues.
> 

How does this work with Marko Zec's virtual stack system?

Best,
George

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 28 15:15:05 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 50FC216A419
	for <arch@freebsd.org>; Fri, 28 Dec 2007 15:15:05 +0000 (UTC)
	(envelope-from ivo.vachkov@gmail.com)
Received: from wx-out-0506.google.com (wx-out-0506.google.com [66.249.82.236])
	by mx1.freebsd.org (Postfix) with ESMTP id BAD5913C44B
	for <arch@freebsd.org>; Fri, 28 Dec 2007 15:15:04 +0000 (UTC)
	(envelope-from ivo.vachkov@gmail.com)
Received: by wx-out-0506.google.com with SMTP id i29so990613wxd.7
	for <arch@freebsd.org>; Fri, 28 Dec 2007 07:15:04 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
	bh=Xgce59oOh55fxeTG2J6YoIG8nu/AgUUjXBJ/AWMokiQ=;
	b=fzgYaAoedMzuUszJkE+GNfe/ea4DqL7/0IEoBi5AP0yY3yaII2mMzfLKQU5nYMg36cPsoN4mmbrrPfFAL24SH2o0PfDd4ZRdjLR9Ro1YHMlLm4ooQQKe8sBU9shqfzVISxuZ7QkSZYosPB5cWncR+GKj1iONsVSiFTLHjvKB7Zw=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
	b=h+hJojK1RXsDr/WZwnFxT9GJjnvQLycMaqF82K6zCN/ip4cM35z2JSdhkaVJnnZe1cB+7Xs5Ea5MwZVvyiB1RWwedWe8TTQbY8+wRUCNK9hqVY+i0poM+MKuM5KszSeKGeBgDOaUow4eVcuAiCPfiepyUaNhmAnFp+5+1jB5mu0=
Received: by 10.150.197.8 with SMTP id u8mr2605151ybf.131.1198854903741;
	Fri, 28 Dec 2007 07:15:03 -0800 (PST)
Received: by 10.150.219.5 with HTTP; Fri, 28 Dec 2007 07:15:03 -0800 (PST)
Message-ID: <f85d6aa70712280715q6adbfc2bj595584153ca6cadf@mail.gmail.com>
Date: Fri, 28 Dec 2007 17:15:03 +0200
From: "Ivo Vachkov" <ivo.vachkov@gmail.com>
To: "Julian Elischer" <julian@elischer.org>
In-Reply-To: <477416CC.4090906@elischer.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <4772F123.5030303@elischer.org>
	<f85d6aa70712261728h331eadb8p205d350dc7fb7f4c@mail.gmail.com>
	<477416CC.4090906@elischer.org>
Cc: FreeBSD Net <freebsd-net@freebsd.org>, Robert Watson <rwatson@freebsd.org>,
	Qing Li <qingli@freebsd.org>, arch@freebsd.org
Subject: Re: resend: multiple routing table roadmap (format fix)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2007 15:15:05 -0000

On Dec 27, 2007 11:19 PM, Julian Elischer <julian@elischer.org> wrote:
>
> Ivo Vachkov wrote:
> > On Dec 27, 2007 2:26 AM, Julian Elischer <julian@elischer.org> wrote:
> >> Resending as my mailer made a dog's breakfast of the first one
> >> with all sorts of wierd line breaks... hopefully this will be better.
> >> (I haven't sent it yet so I'm hoping)..
> >>
> >>
> >> -------------------------------------------
> >>
> >>
> >>
> >> On thing where FreeBSD has been falling behind, and which by chance I
> >> have some time to work on is "policy based routing", which allows
> >> different
> >> packet streams to be routed by more than just the destination address.
> >>
> >> Constraints:
> >> ------------
> >>
> >> I want to make some form of this available in the 6.x tree
> >> (and by extension 7.x) , but FreeBSD in general needs it so I might as
> >> well
> >> do it in -current and back port the portions I need.
> >>
> >> One of the ways that this can be done is to have the ability to
> >> instantiate multiple kernel routing tables (which I will now
> >> refer to as "Forwarding Information Bases" or "FIBs" for political
> >> correctness reasons. Which FIB a particular packet uses to make
> >> the next hop decision can be decided by a number of mechanisms.
> >> The policies these mechanisms implement are the "Policies" referred
> >> to in "Policy based routing".
> >>
> >> One of the constraints I have if I try to back port this work to
> >> 6.x is that it must be implemented as a EXTENSION to the existing
> >> ABIs in 6.x so that third party applications do not need to be
> >> recompiled in timespan of the branch.
> >>
> >> Implementation method, (part 1)
> >> -------------------------------
> >> For this reason I have implemented a "sufficient subset" of a
> >> multiple routing table solution in Perforce, and back-ported it
> >> to 6.x. (also in Perforce though not yet caught up with what I
> >> have done in -current/P4). The subset allows a number of FIBs
> >> to be defined at compile time (sufficient for my purposes in 6.x) and
> >> implements the changes needed to allow IPV4 to use them. I have not done
> >> the changes for ipv6 simply because I do not need it, and I do not
> >> have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.
>
> By the way, I might add that in the 6.x compat. version I may end up
> limiting the feature to 8 tables. This is because I need to store some
> stuff in an efficient way in the mbuf, and in a compatible manner this
> is easiest done by stealing the top 4 bits in the mbuf dlags word
> and defining them as:
>
>   #define M_HAVEFIB     0x10000000
>   #define M_FIBMASK     0x07
>   #define M_FIBNUM      0xe0000000
>   #define M_FIBSHIFT    29
>   #define m_getfib(_m, _default) ((m->m_flags & M_HAVE_FIBNUM) ?
> ((m->m_flags >> M_FIBSHIFT) & M_FIBMASK) : _default)
>   #M_SETFIB(_m, _fib) do { \
>     _m->m_flags &= ~M_FIBNUM; \
>     _m->m_flags |= (M_HAVEFIB|((_fib & M_FIBMASK) << M_FIBSHIFT));\
> } while (0)
>
> This then becomes very easy to change to use a tag or
> whatever is needed in later versions , and the number can
> be expanded past 8 predefined  FIBs at that time..
>
> >>
> >> Other protocol families are left untouched and should there be
> >> users with proprietary protocol families, they should continue to work
> >> and be oblivious to the existence of the extra FIBs.
> >>
> >> To understand how this is done, one must know that the current FIB
> >> code starts everything off with a single dimensional array of
> >> pointers to FIB head structures (One per protocol family), each of
> >> which in turn points to the trie of routes available to that family.
> >>
> >> The basic change in the ABI compatible version of the change is to
> >> extent that array to be a 2 dimensional array, so that
> >> instead of protocol family X looking at rt_tables[X] for the
> >> table it needs, it looks at rt_tables[Y][X] when for all
> >> protocol families except ipv4 Y is always 0.
> >> Code that is unaware of the change always just sees the first row
> >> of the table, which of course looks just like the one dimensional
> >> array that existed before.
> >
> > Pretty much like the OpenBSD approach :)
>
> well, I did look at the code briefly, but I didn't base it on it..
>
>
> >
> >> The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
> >> are all maintained, but refer only to the first row of the array,
> >> so that existing callers in proprietary protocols can continue to
> >> do the "right thing".
> >> Some new entry points are added, for the exclusive use of ipv4 code
> >> called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
> >> which have an extra argument which refers the code to the correct row.
> >>
> >> In addition, there are some new entry points (currently called
> >> dom_rtalloc() and friends) that check the Address family being
> >> looked up and call either rtalloc() (and friends) if the protocol
> >> is not IPv4 forcing the action to row 0 or to the appropriate row
> >> if it IS IPv4 (and that info is available). These are for calling
> >> from code that is not specific to any particular protocol. The way
> >> these are implemented would change in the non ABI preserving code
> >> to be added later.
> >>
> >> One feature of the first version of the code is that for ipv4,
> >> the interface routes show up automatically on all the FIBs, so
> >> that no matter what FIB you select you always have the basic
> >> direct attached hosts available to you. (rtinit() does this
> >> automatically).
> >> You CAN delete an interface route from one FIB should you want
> >> to but by default it's there. ARP information is also available
> >> in each FIB. It's assumed that the same machine would have the
> >> same MAC address, regardless of which FIB you are using to get
> >> to it.
> >>
> >>
> >> This brings us as to how the correct FIB is selected for an outgoing
> >> IPV4 packet.
> >>
> >> Packets fall into one of a number of classes.
> >> 1/ locally generated packets, coming from a socket/PCB.
> >>     Such packets select a FIB from a number associated with the
> >>     socket/PCB. This in turn is inherited from the process,
> >>     but can be changed by a socket option. The process in turn
> >>     inherits it on fork. I have written a utility call setfib
> >>     that acts a bit like nice..
> >>
> >>         setfib -n 3 ping target.example.com # will use fib 3 for ping.
> >>
> >> 2/ packets received on an interface for forwarding.
> >>     By default these packets would use table 0,
> >>     (or possibly a number settable in a sysctl(not yet)).
> >>     but prior to routing the firewall can inspect them (see below).
> >>
> >> 3/ packets inspected by a packet classifier, which can arbitrarily
> >>     associate a fib with it on a packet by packet basis.
> >>     A fib assigned to a packet by a packet classifier
> >>     (such as ipfw) would over-ride a fib associated by
> >>     a more default source. (such as cases 1 or 2).
> >
> > For the 2/ and 3/ cases I added (in a personal work i've been doing
> > lately) additional field in struct mbuf which can be set by a packet
> > filter or other application upon receiving which points the right
> > table to use for the lookup. This way a simple "marking" can be used
> > to divide different flows and create policy based routing.
>
> This would be the final way but I want to really minimise problems
> in the compat versions, so I'll avoid doing that for now.
>
> Do you have this work available?

I have it. However, I'll break a NDA if I 'open' it.

> And have you looked at mi diffs below?

I plan to look at your code asap.

>
> >
> >> Routing messages would be associated with their
> >> process, and thus select one FIB or another.
> >>
> >> In addition Netstat has been edited to be able to cope with the
> >> fact that the array is now 2 dimensional. (It looks in system
> >> memory using libkvm (!)).
> >>
> >> In addition two sysctls are added to give:
> >> a) the number of FIBs compiled in (active)
> >> b) the default FIB of the calling process.
> >>
> >> Early testing experience:
> >> -------------------------
> >>
> >> Basically our (IronPort's) appliance does this functionality already
> >> using ipfw fwd but that method has some drawbacks.
> >>
> >> For example,
> >> It can't fully simulate a routing table because it can't influence the
> >> socket's choice of local address when a connect() is done.
> >>
> >>
> >> Testing during the generating of these changes has been
> >> remarkably smooth so far. Multiple tables have co-existed
> >> with no notable side effects, and packets have been routes
> >> accordingly.
> >>
> >> I have not yet added the changes to ipfw.
> >> pf has some similar changes already but they seem to rely on
> >> the various FIBs having symbolic names. Which I do not plan to support
> >> in the first version of these changes.
> >>
> >> SCTP has interestingly enough built in support for this, called VRFs
> >> in Cisco parlance. it will be interesting to see how that handles it
> >> when it suddenly actually does something.
> >>
> >> I have not redone my testing since my last edits, but will be
> >> retesting with the current code asap.
> >>
> >>
> >> Where to next:
> >> --------------------
> >>
> >> After committing the ABI compatible version and MFCing it, I'd
> >> like to proceed in a forward direction in -current. this will
> >> result in some roto-tilling in the routing code.
> >>
> >> Firstly: the current code's idea of having a separate tree per
> >> protocol family, all of the same format, and pointed to by the
> >> 1 dimensional array is a bit silly. Especially when one considers that
> >> there
> >> is code that makes assumptions about every protocol having the same
> >> internal structures there. Some protocols don't WANT that
> >> sort of structure. (for example the whole idea of a netmask is foreign
> >> to appletalk). This needs to be made opaque to the external code.
> >>
> >> My suggested first change is to add routing method pointers to the
> >> 'domain' structure, along with information pointing the data.
> >> instead of having an array of pointers to uniform structures,
> >> there would be an array pointing to the 'domain' structures
> >> for each protocol address domain (protocol family),
> >> and the methods this reached would be called. The methods would have
> >> an argument that gives FIB number, but the protocol would be free
> >> to ignore it.
> >>
> >> Interaction with the ARP layer/ LL layer would need to be
> >> revisited as well. Qing Li has been working on this already.
> >>
> >>
> >> diffs
> >> for those with p4 access:
> >> p4 diff2 -du //depot/vendor/freebsd/src/sys/...@131121
> >> //depot/user/julian/routing/src/sys/...
> >>
> >> for those with the makediff perl script:
> >> perl ~/makediff.pl //depot/vendor/freebsd/src/sys/...@131121
> >> //depot/user/julian/routing/src/sys/...
> >>
> >> for those with neither:
> >>
> >> http://people.freebsd.org/~julian/mrt2.diff
> >>
> >> I just put the userland utility in usr.sbin/setfib/ in p4.
> >> and changes to netstat in usr.bin/netstat/
> >>
> >> see:
> >> http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/julian/routing/src&HIDEDEL=NO
> >>
> >>
> >>
> >>
> >> I'd like to get comments on this (compat) version, so that I can
> >> commit it,
> >> get general testing under way to start the clock for MFC, and then get
> >> moving on the fuller implementation (that breaks ABIs) and other
> >> routing issues.
> >>
> >>
> >> Julian
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> freebsd-arch@freebsd.org mailing list
> >> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> >> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"
> >>
>
>


-- 
"UNIX is basically a simple operating system, but you have to be a
genius to understand the simplicity." Dennis Ritchie

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 28 17:17:04 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6523E16A420
	for <arch@freebsd.org>; Fri, 28 Dec 2007 17:17:04 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from outD.internet-mail-service.net (outD.internet-mail-service.net
	[216.240.47.227])
	by mx1.freebsd.org (Postfix) with ESMTP id 2F74C13C45A
	for <arch@freebsd.org>; Fri, 28 Dec 2007 17:17:04 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160)
	by out.internet-mail-service.net (qpsmtpd/0.40) with ESMTP;
	Fri, 28 Dec 2007 09:17:03 -0800
Received: from julian-mac.elischer.org (localhost [127.0.0.1])
	by idiom.com (Postfix) with ESMTP id 01D3E126DA3;
	Fri, 28 Dec 2007 09:17:02 -0800 (PST)
Message-ID: <47752F98.6050209@elischer.org>
Date: Fri, 28 Dec 2007 09:17:12 -0800
From: Julian Elischer <julian@elischer.org>
User-Agent: Thunderbird 2.0.0.9 (Macintosh/20071031)
MIME-Version: 1.0
To: gnn@freebsd.org
References: <4772F123.5030303@elischer.org>
	<m2bq8bvsis.wl%gnn@neville-neil.com>
In-Reply-To: <m2bq8bvsis.wl%gnn@neville-neil.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: FreeBSD Net <freebsd-net@freebsd.org>, Robert Watson <rwatson@freebsd.org>,
	Qing Li <qingli@freebsd.org>, arch@freebsd.org
Subject: Re: resend: multiple routing table roadmap (format fix)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2007 17:17:04 -0000

gnn@freebsd.org wrote:
> At Wed, 26 Dec 2007 16:26:11 -0800,
> julian wrote:
[...]
> 
> How does this work with Marko Zec's virtual stack system?
> 
> Best,
> George

orthogonal

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 28 17:30:13 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id CBAE816A46E
	for <arch@freebsd.org>; Fri, 28 Dec 2007 17:30:13 +0000 (UTC)
	(envelope-from SRS0=mkQ4=RT=tm.uka.de=max.laier@srs.kundenserver.de)
Received: from moutng.kundenserver.de (moutng.kundenserver.de
	[212.227.126.174])
	by mx1.freebsd.org (Postfix) with ESMTP id 44BF713C46E
	for <arch@freebsd.org>; Fri, 28 Dec 2007 17:30:13 +0000 (UTC)
	(envelope-from SRS0=mkQ4=RT=tm.uka.de=max.laier@srs.kundenserver.de)
Received: from vampire.homelinux.org (dslb-088-066-001-237.pools.arcor-ip.net
	[88.66.1.237])
	by mrelayeu.kundenserver.de (node=mrelayeu7) with ESMTP (Nemesis)
	id 0ML2xA-1J8Iq03Cp7-0005oJ; Fri, 28 Dec 2007 18:17:37 +0100
Received: (qmail 25901 invoked by uid 80); 28 Dec 2007 17:17:00 -0000
Received: from 2001:6f8:12c8:1:21d:60ff:fe0c:1771
	(SquirrelMail authenticated user mlaier)
	by router.laiers.local with HTTP;
	Fri, 28 Dec 2007 18:17:00 +0100 (CET)
Message-ID: <43684.2001:6f8:12c8:1:21d:60ff:fe0c:1771.1198862220.squirrel@router.laiers.local>
In-Reply-To: <200712271704.44796.jhb@FreeBSD.org>
References: <200712271704.44796.jhb@FreeBSD.org>
Date: Fri, 28 Dec 2007 18:17:00 +0100 (CET)
From: "Max Laier" <max.laier@tm.uka.de>
To: "John Baldwin" <jhb@FreeBSD.org>
User-Agent: SquirrelMail/1.4.13
MIME-Version: 1.0
Content-Type: text/plain;charset=iso-8859-1
Content-Transfer-Encoding: 8bit
X-Priority: 3 (Normal)
Importance: Normal
X-Provags-ID: V01U2FsdGVkX1+8Z4peOOieo9iKYjZ2mV9zj0elWboszuGnoKX
	tmQ0f9HGEdFuGDTbwm7VR75LJHy5UNzbFR3jIpLyOqduCKuqKe
	2iZQzY/TRSwxHB789Q1Sg==
Cc: arch@freebsd.org
Subject: Re: kernel features MIB
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2007 17:30:13 -0000


Am Do, 27.12.2007, 23:04, schrieb John Baldwin:
> One of the things we have at work is a kern.features sysctl MIB that
> contains
> nodes to indicate if a named feature is present.  For example, on i386 we
> have kern.features.pae and we auto enable -DPAE for kernel modules if the
> currently running kernel is using PAE using that sysctl.
>
> One of the patches I want to commit soon is support for handling
> shm_open/shm_unlink directly in the kernel via swap-backed VM objects (the
> long-heralded memfd stuff).  I would like to have the sysctl MIB so that
> libc's for older releases (e.g. libc.so.6) could use the syscalls if they
> are
> available so that shm segments are shared between compat apps (e.g. 4.x or
> 6.x) and up-to-date apps.
>
> At work we don't have a pretty API for this at all, but I'm thinking for
> FreeBSD we can do this:
>
> FEATURE(foo, "description of foo")
>
> which is a macro to create the 'kern.features.foo' node and set it to 1.
> Then
> we could have a routine in libc:
>
> int	feature_present(const char *name);
>
> That returns a boolean to indicate if a given feature is present or not by
> invoking sysctlbyname(3), etc.
>
> Any objections to the idea?

Sounds like a good idea indeed.  What about modules, though?  Would it
make sense to have something ident/strings parseable in the .kld to
identify features provided by that module?  feature_present (or
_available) could search the default module paths and return which module
needs to be loaded.  This could depend on FEATURE(kld, ...) and maybe
kern.securelevel.

-- 
/"\  Best regards,                      | mlaier@freebsd.org
\ /  Max Laier                          | ICQ #67774661
 X   http://pf4freebsd.love2party.net/  | mlaier@EFnet
/ \  ASCII Ribbon Campaign              | Against HTML Mail and News

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 28 19:39:51 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 90EEE16A418;
	Fri, 28 Dec 2007 19:39:51 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219])
	by mx1.freebsd.org (Postfix) with ESMTP id D785613C458;
	Fri, 28 Dec 2007 19:39:50 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from server.baldwin.cx (unverified [66.23.211.162]) 
	by speedfactory.net (SurgeMail 3.8q) with ESMTP id 226422892-1834499 
	for multiple; Fri, 28 Dec 2007 14:41:57 -0500
Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id lBSJdX4B064130;
	Fri, 28 Dec 2007 14:39:35 -0500 (EST) (envelope-from jhb@freebsd.org)
From: John Baldwin <jhb@freebsd.org>
To: Robert Watson <rwatson@freebsd.org>
Date: Fri, 28 Dec 2007 12:25:13 -0500
User-Agent: KMail/1.9.6
References: <18378.1196596684@critter.freebsd.dk>
	<200712281003.52062.hselasky@c2i.net>
	<20071228102544.J45653@fledge.watson.org>
In-Reply-To: <20071228102544.J45653@fledge.watson.org>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200712281225.14954.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]);
	Fri, 28 Dec 2007 14:39:36 -0500 (EST)
X-Virus-Scanned: ClamAV 0.91.2/5278/Fri Dec 28 11:55:36 2007 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Cc: Andre Oppermann <andre@freebsd.org>, Hans Petter Selasky <hselasky@c2i.net>,
	Attilio Rao <attilio@freebsd.org>, arch@freebsd.org,
	Poul-Henning Kamp <phk@phk.freebsd.dk>, freebsd-arch@freebsd.org
Subject: Re: New "timeout" api, to replace callout
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2007 19:39:51 -0000

On Friday 28 December 2007 05:30:14 am Robert Watson wrote:
> 
> On Fri, 28 Dec 2007, Hans Petter Selasky wrote:
> 
> >> The reason you need to do a drain is to allow for safe destroying of the
> >> lock. Specifically, drivers tend to do this:
> >>
> >> 	FOO_LOCK(sc);
> >> 	...
> >> 	callout_stop(...);
> >> 	FOO_UNLOCK(sc);
> >> 	...
> >> 	callout_drain(...);
> >> 	...
> >> 	mtx_destroy(&sc->foo_mtx);
> >>
> >> If you don't have the drain and softclock is trying to acquire the 
backing 
> >> mutex while you have it held (before the callout_stop) then Bad Things 
can 
> >> happen if you don't do the drain.  Having the lock just "give up" doesn't 
> >> work either because if the memory containing the lock is free'd and 
> >> reinitialized such that it looks enough like a valid lock then softclock 
> >> (or its equivalent) will still try to obtain it.  Also, you need to do a 
> >> drain so it is safe to free the callout structure to prevent it from 
being 
> >> recycled and having weird races where it gets recycled and rescheduled 
but 
> >> the timer code thinks it has a pending stop for that pointer and so it 
> >> aborts the wrong instance of the timer, etc.
> >
> > I completely agree to what John Baldwin is writing. You need two 
> > stop-functions:
> >
> > xxx_stop which is non-blocking and xxx_drain which can block i.e. sleep
> >
> > BTW: The USB code in P4 uses the same semantics, due to the same reasons:
> >
> > usbd_transfer_stop() and usbd_transfer_drain()
> >
> > The only difference is that I pass an error code to the callback which 
might 
> > happen after that usbd_transfer_stop is called.
> >
> > I think that xxx_stop() and xxx_drain() is a generic approach that should 
be 
> > applied to all callback systems. Whenever you have a callback you need to 
be 
> > able to stop it and drain it.
> 
> I think the argument that Poul-Henning is making is not that you don't need 
> something that behaves "like drain", but rather, we're like the wait for 
drain 
> to be a short, mutex-length wait rather than a long, msleep-length wait. 
> Remember that the bodies of callouts are expected to run in a very short 
> period of time in order to not stall the timer system, in fact, in such a 
way 
> that a mutex could be held over the entirely timeout call.  Given that this 
is 
> the case, one might reasonably expect callout_stop() to perform the drain 
> rather than having a separate call.  Such a model would be very advantageous 
> in TCP, where rather than having to defer GC'ing the inpcb/tcpcb to a GC 
> worker thread, which we don't do but should, we could use the stop call 
safely 
> and eliminate a whole class of races from the stack.

The problem is if softclock() (or similar replacement in future) was preempted 
and hasn't run yet but has already gotten to the point that callout_drain() 
has to block, then sitting in a spin loop waiting for softclock() to 
acknowledge the stop isn't very optimal.  The amount of time you are asleep 
in this case is actually very small, and you probably won't even block the 
vast majority of the time if you follow the 'lock / stop / unlock / drain' 
model.  You only sleep when you lose the race and softclock() has chosen to 
run your callout and is waiting for the driver/client/whatever's lock.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 28 19:39:51 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 90EEE16A418;
	Fri, 28 Dec 2007 19:39:51 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219])
	by mx1.freebsd.org (Postfix) with ESMTP id D785613C458;
	Fri, 28 Dec 2007 19:39:50 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from server.baldwin.cx (unverified [66.23.211.162]) 
	by speedfactory.net (SurgeMail 3.8q) with ESMTP id 226422892-1834499 
	for multiple; Fri, 28 Dec 2007 14:41:57 -0500
Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id lBSJdX4B064130;
	Fri, 28 Dec 2007 14:39:35 -0500 (EST) (envelope-from jhb@freebsd.org)
From: John Baldwin <jhb@freebsd.org>
To: Robert Watson <rwatson@freebsd.org>
Date: Fri, 28 Dec 2007 12:25:13 -0500
User-Agent: KMail/1.9.6
References: <18378.1196596684@critter.freebsd.dk>
	<200712281003.52062.hselasky@c2i.net>
	<20071228102544.J45653@fledge.watson.org>
In-Reply-To: <20071228102544.J45653@fledge.watson.org>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200712281225.14954.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]);
	Fri, 28 Dec 2007 14:39:36 -0500 (EST)
X-Virus-Scanned: ClamAV 0.91.2/5278/Fri Dec 28 11:55:36 2007 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Cc: Andre Oppermann <andre@freebsd.org>, Hans Petter Selasky <hselasky@c2i.net>,
	Attilio Rao <attilio@freebsd.org>, arch@freebsd.org,
	Poul-Henning Kamp <phk@phk.freebsd.dk>, freebsd-arch@freebsd.org
Subject: Re: New "timeout" api, to replace callout
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2007 19:39:51 -0000

On Friday 28 December 2007 05:30:14 am Robert Watson wrote:
> 
> On Fri, 28 Dec 2007, Hans Petter Selasky wrote:
> 
> >> The reason you need to do a drain is to allow for safe destroying of the
> >> lock. Specifically, drivers tend to do this:
> >>
> >> 	FOO_LOCK(sc);
> >> 	...
> >> 	callout_stop(...);
> >> 	FOO_UNLOCK(sc);
> >> 	...
> >> 	callout_drain(...);
> >> 	...
> >> 	mtx_destroy(&sc->foo_mtx);
> >>
> >> If you don't have the drain and softclock is trying to acquire the 
backing 
> >> mutex while you have it held (before the callout_stop) then Bad Things 
can 
> >> happen if you don't do the drain.  Having the lock just "give up" doesn't 
> >> work either because if the memory containing the lock is free'd and 
> >> reinitialized such that it looks enough like a valid lock then softclock 
> >> (or its equivalent) will still try to obtain it.  Also, you need to do a 
> >> drain so it is safe to free the callout structure to prevent it from 
being 
> >> recycled and having weird races where it gets recycled and rescheduled 
but 
> >> the timer code thinks it has a pending stop for that pointer and so it 
> >> aborts the wrong instance of the timer, etc.
> >
> > I completely agree to what John Baldwin is writing. You need two 
> > stop-functions:
> >
> > xxx_stop which is non-blocking and xxx_drain which can block i.e. sleep
> >
> > BTW: The USB code in P4 uses the same semantics, due to the same reasons:
> >
> > usbd_transfer_stop() and usbd_transfer_drain()
> >
> > The only difference is that I pass an error code to the callback which 
might 
> > happen after that usbd_transfer_stop is called.
> >
> > I think that xxx_stop() and xxx_drain() is a generic approach that should 
be 
> > applied to all callback systems. Whenever you have a callback you need to 
be 
> > able to stop it and drain it.
> 
> I think the argument that Poul-Henning is making is not that you don't need 
> something that behaves "like drain", but rather, we're like the wait for 
drain 
> to be a short, mutex-length wait rather than a long, msleep-length wait. 
> Remember that the bodies of callouts are expected to run in a very short 
> period of time in order to not stall the timer system, in fact, in such a 
way 
> that a mutex could be held over the entirely timeout call.  Given that this 
is 
> the case, one might reasonably expect callout_stop() to perform the drain 
> rather than having a separate call.  Such a model would be very advantageous 
> in TCP, where rather than having to defer GC'ing the inpcb/tcpcb to a GC 
> worker thread, which we don't do but should, we could use the stop call 
safely 
> and eliminate a whole class of races from the stack.

The problem is if softclock() (or similar replacement in future) was preempted 
and hasn't run yet but has already gotten to the point that callout_drain() 
has to block, then sitting in a spin loop waiting for softclock() to 
acknowledge the stop isn't very optimal.  The amount of time you are asleep 
in this case is actually very small, and you probably won't even block the 
vast majority of the time if you follow the 'lock / stop / unlock / drain' 
model.  You only sleep when you lose the race and softclock() has chosen to 
run your callout and is waiting for the driver/client/whatever's lock.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 28 19:40:06 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9F3A116A418
	for <arch@freebsd.org>; Fri, 28 Dec 2007 19:40:06 +0000 (UTC)
	(envelope-from jhb@freebsd.org)
Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219])
	by mx1.freebsd.org (Postfix) with ESMTP id 357F613C474
	for <arch@freebsd.org>; Fri, 28 Dec 2007 19:40:06 +0000 (UTC)
	(envelope-from jhb@freebsd.org)
Received: from server.baldwin.cx (unverified [66.23.211.162]) 
	by speedfactory.net (SurgeMail 3.8q) with ESMTP id 226422916-1834499 
	for multiple; Fri, 28 Dec 2007 14:42:04 -0500
Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id lBSJdg7G064138;
	Fri, 28 Dec 2007 14:39:43 -0500 (EST) (envelope-from jhb@freebsd.org)
From: John Baldwin <jhb@freebsd.org>
To: "Max Laier" <max.laier@tm.uka.de>
Date: Fri, 28 Dec 2007 13:00:28 -0500
User-Agent: KMail/1.9.6
References: <200712271704.44796.jhb@FreeBSD.org>
	<43684.2001:6f8:12c8:1:21d:60ff:fe0c:1771.1198862220.squirrel@router.laiers.local>
In-Reply-To: <43684.2001:6f8:12c8:1:21d:60ff:fe0c:1771.1198862220.squirrel@router.laiers.local>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200712281300.28899.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]);
	Fri, 28 Dec 2007 14:39:43 -0500 (EST)
X-Virus-Scanned: ClamAV 0.91.2/5278/Fri Dec 28 11:55:36 2007 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Cc: arch@freebsd.org
Subject: Re: kernel features MIB
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2007 19:40:06 -0000

On Friday 28 December 2007 12:17:00 pm Max Laier wrote:
> 
> Am Do, 27.12.2007, 23:04, schrieb John Baldwin:
> > One of the things we have at work is a kern.features sysctl MIB that
> > contains
> > nodes to indicate if a named feature is present.  For example, on i386 we
> > have kern.features.pae and we auto enable -DPAE for kernel modules if the
> > currently running kernel is using PAE using that sysctl.
> >
> > One of the patches I want to commit soon is support for handling
> > shm_open/shm_unlink directly in the kernel via swap-backed VM objects (the
> > long-heralded memfd stuff).  I would like to have the sysctl MIB so that
> > libc's for older releases (e.g. libc.so.6) could use the syscalls if they
> > are
> > available so that shm segments are shared between compat apps (e.g. 4.x or
> > 6.x) and up-to-date apps.
> >
> > At work we don't have a pretty API for this at all, but I'm thinking for
> > FreeBSD we can do this:
> >
> > FEATURE(foo, "description of foo")
> >
> > which is a macro to create the 'kern.features.foo' node and set it to 1.
> > Then
> > we could have a routine in libc:
> >
> > int	feature_present(const char *name);
> >
> > That returns a boolean to indicate if a given feature is present or not by
> > invoking sysctlbyname(3), etc.
> >
> > Any objections to the idea?
> 
> Sounds like a good idea indeed.  What about modules, though?  Would it
> make sense to have something ident/strings parseable in the .kld to
> identify features provided by that module?  feature_present (or
> _available) could search the default module paths and return which module
> needs to be loaded.  This could depend on FEATURE(kld, ...) and maybe
> kern.securelevel.

You could have a userland tool that parses the linker set for sysctl's and 
uses the name of the symbol to figure this out if that was desired.  Modules 
already have the MODULE_DEPEND stuff available that could be used, but I'm 
thinking about things that aren't in modules.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 28 20:01:50 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id AAD2516A469;
	Fri, 28 Dec 2007 20:01:50 +0000 (UTC) (envelope-from zec@tel.fer.hr)
Received: from xaqua.tel.fer.hr (xaqua.tel.fer.hr [161.53.19.25])
	by mx1.freebsd.org (Postfix) with ESMTP id E7FEF13C45D;
	Fri, 28 Dec 2007 20:01:49 +0000 (UTC) (envelope-from zec@tel.fer.hr)
Received: by xaqua.tel.fer.hr (Postfix, from userid 20006)
	id 9E5D99B742; Fri, 28 Dec 2007 20:42:43 +0100 (CET)
X-Spam-Checker-Version: SpamAssassin 3.1.7 (2006-10-05) on xaqua.tel.fer.hr
X-Spam-Level: 
X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 
	autolearn=ham version=3.1.7
Received: from [192.168.200.112] (zec2.tel.fer.hr [161.53.19.79])
	by xaqua.tel.fer.hr (Postfix) with ESMTP id ABBC89B6C9;
	Fri, 28 Dec 2007 20:42:40 +0100 (CET)
From: Marko Zec <zec@tel.fer.hr>
To: freebsd-arch@freebsd.org,
 FreeBSD Net <freebsd-net@freebsd.org>
Date: Fri, 28 Dec 2007 20:40:30 +0100
User-Agent: KMail/1.9.7
References: <4772F123.5030303@elischer.org>
	<m2bq8bvsis.wl%gnn@neville-neil.com>
In-Reply-To: <m2bq8bvsis.wl%gnn@neville-neil.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200712282040.30745.zec@tel.fer.hr>
Cc: gnn@freebsd.org, Robert Watson <rwatson@freebsd.org>,
	Julian Elischer <julian@elischer.org>, Qing Li <qingli@freebsd.org>
Subject: Re: resend: multiple routing table roadmap (format fix)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2007 20:01:50 -0000

On Friday 28 December 2007 14:49:15 gnn@freebsd.org wrote:
> At Wed, 26 Dec 2007 16:26:11 -0800,
>
> julian wrote:
> >
> > On thing where FreeBSD has been falling behind, and which by chance
> > I have some time to work on is "policy based routing", which allows
> > different packet streams to be routed by more than just the
> > destination address.
> >
> > Constraints:
> > ------------
> >
> > I want to make some form of this available in the 6.x tree
> > (and by extension 7.x) , but FreeBSD in general needs it so I might
> > as well
> > do it in -current and back port the portions I need.
> >
> > One of the ways that this can be done is to have the ability to
> > instantiate multiple kernel routing tables (which I will now
> > refer to as "Forwarding Information Bases" or "FIBs" for political
> > correctness reasons. Which FIB a particular packet uses to make
> > the next hop decision can be decided by a number of mechanisms.
> > The policies these mechanisms implement are the "Policies" referred
> > to in "Policy based routing".
> >
> > One of the constraints I have if I try to back port this work to
> > 6.x is that it must be implemented as a EXTENSION to the existing
> > ABIs in 6.x so that third party applications do not need to be
> > recompiled in timespan of the branch.
> >
> > Implementation method, (part 1)
> > -------------------------------
> > For this reason I have implemented a "sufficient subset" of a
> > multiple routing table solution in Perforce, and back-ported it
> > to 6.x. (also in Perforce though not yet caught up with what I
> > have done in -current/P4). The subset allows a number of FIBs
> > to be defined at compile time (sufficient for my purposes in 6.x)
> > and implements the changes needed to allow IPV4 to use them. I have
> > not done the changes for ipv6 simply because I do not need it, and
> > I do not have enough knowledge of ipv6 (e.g. neighbor discovery)
> > needed to do it.
> >
> > Other protocol families are left untouched and should there be
> > users with proprietary protocol families, they should continue to
> > work and be oblivious to the existence of the extra FIBs.
> >
> > To understand how this is done, one must know that the current FIB
> > code starts everything off with a single dimensional array of
> > pointers to FIB head structures (One per protocol family), each of
> > which in turn points to the trie of routes available to that
> > family.
> >
> > The basic change in the ABI compatible version of the change is to
> > extent that array to be a 2 dimensional array, so that
> > instead of protocol family X looking at rt_tables[X] for the
> > table it needs, it looks at rt_tables[Y][X] when for all
> > protocol families except ipv4 Y is always 0.
> > Code that is unaware of the change always just sees the first row
> > of the table, which of course looks just like the one dimensional
> > array that existed before.
> >
> >
> > The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
> > are all maintained, but refer only to the first row of the array,
> > so that existing callers in proprietary protocols can continue to
> > do the "right thing".
> > Some new entry points are added, for the exclusive use of ipv4 code
> > called in_rtrequest(), in_rtalloc(), in_rtalloc1() and
> > in_rtalloc_ign(), which have an extra argument which refers the
> > code to the correct row.
> >
> > In addition, there are some new entry points (currently called
> > dom_rtalloc() and friends) that check the Address family being
> > looked up and call either rtalloc() (and friends) if the protocol
> > is not IPv4 forcing the action to row 0 or to the appropriate row
> > if it IS IPv4 (and that info is available). These are for calling
> > from code that is not specific to any particular protocol. The way
> > these are implemented would change in the non ABI preserving code
> > to be added later.
> >
> > One feature of the first version of the code is that for ipv4,
> > the interface routes show up automatically on all the FIBs, so
> > that no matter what FIB you select you always have the basic
> > direct attached hosts available to you. (rtinit() does this
> > automatically).
> > You CAN delete an interface route from one FIB should you want
> > to but by default it's there. ARP information is also available
> > in each FIB. It's assumed that the same machine would have the
> > same MAC address, regardless of which FIB you are using to get
> > to it.
> >
> >
> > This brings us as to how the correct FIB is selected for an
> > outgoing IPV4 packet.
> >
> > Packets fall into one of a number of classes.
> > 1/ locally generated packets, coming from a socket/PCB.
> >     Such packets select a FIB from a number associated with the
> >     socket/PCB. This in turn is inherited from the process,
> >     but can be changed by a socket option. The process in turn
> >     inherits it on fork. I have written a utility call setfib
> >     that acts a bit like nice..
> >
> >         setfib -n 3 ping target.example.com # will use fib 3 for
> > ping.
> >
> > 2/ packets received on an interface for forwarding.
> >     By default these packets would use table 0,
> >     (or possibly a number settable in a sysctl(not yet)).
> >     but prior to routing the firewall can inspect them (see below).
> >
> > 3/ packets inspected by a packet classifier, which can arbitrarily
> >     associate a fib with it on a packet by packet basis.
> >     A fib assigned to a packet by a packet classifier
> >     (such as ipfw) would over-ride a fib associated by
> >     a more default source. (such as cases 1 or 2).
> >
> > Routing messages would be associated with their
> > process, and thus select one FIB or another.
> >
> > In addition Netstat has been edited to be able to cope with the
> > fact that the array is now 2 dimensional. (It looks in system
> > memory using libkvm (!)).
> >
> > In addition two sysctls are added to give:
> > a) the number of FIBs compiled in (active)
> > b) the default FIB of the calling process.
> >
> > Early testing experience:
> > -------------------------
> >
> > Basically our (IronPort's) appliance does this functionality
> > already using ipfw fwd but that method has some drawbacks.
> >
> > For example,
> > It can't fully simulate a routing table because it can't influence
> > the socket's choice of local address when a connect() is done.
> >
> >
> > Testing during the generating of these changes has been
> > remarkably smooth so far. Multiple tables have co-existed
> > with no notable side effects, and packets have been routes
> > accordingly.
> >
> > I have not yet added the changes to ipfw.
> > pf has some similar changes already but they seem to rely on
> > the various FIBs having symbolic names. Which I do not plan to
> > support in the first version of these changes.
> >
> > SCTP has interestingly enough built in support for this, called
> > VRFs in Cisco parlance. it will be interesting to see how that
> > handles it when it suddenly actually does something.
> >
> > I have not redone my testing since my last edits, but will be
> > retesting with the current code asap.
> >
> >
> > Where to next:
> > --------------------
> >
> > After committing the ABI compatible version and MFCing it, I'd
> > like to proceed in a forward direction in -current. this will
> > result in some roto-tilling in the routing code.
> >
> > Firstly: the current code's idea of having a separate tree per
> > protocol family, all of the same format, and pointed to by the
> > 1 dimensional array is a bit silly. Especially when one considers
> > that there
> > is code that makes assumptions about every protocol having the same
> > internal structures there. Some protocols don't WANT that
> > sort of structure. (for example the whole idea of a netmask is
> > foreign to appletalk). This needs to be made opaque to the external
> > code.
> >
> > My suggested first change is to add routing method pointers to the
> > 'domain' structure, along with information pointing the data.
> > instead of having an array of pointers to uniform structures,
> > there would be an array pointing to the 'domain' structures
> > for each protocol address domain (protocol family),
> > and the methods this reached would be called. The methods would
> > have an argument that gives FIB number, but the protocol would be
> > free to ignore it.
> >
> > Interaction with the ARP layer/ LL layer would need to be
> > revisited as well. Qing Li has been working on this already.
> >
> >
> > diffs
> > for those with p4 access:
> > p4 diff2 -du //depot/vendor/freebsd/src/sys/...@131121
> > //depot/user/julian/routing/src/sys/...
> >
> > for those with the makediff perl script:
> > perl ~/makediff.pl //depot/vendor/freebsd/src/sys/...@131121
> > //depot/user/julian/routing/src/sys/...
> >
> > for those with neither:
> >
> > http://people.freebsd.org/~julian/mrt2.diff
> >
> > I just put the userland utility in usr.sbin/setfib/ in p4.
> > and changes to netstat in usr.bin/netstat/
> >
> > see:
> > http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/
> >julian/routing/src&HIDEDEL=NO
> >
> > I'd like to get comments on this (compat) version, so that I can
> > commit it, get general testing under way to start the clock for
> > MFC, and then get moving on the fuller implementation (that breaks
> > ABIs) and other routing issues.
>
> How does this work with Marko Zec's virtual stack system?

The thrust behind Julian's work seems to be providing multiple 
forwarding tables for for purposes of traffic engineering / policy 
based routing, with a single firewall instance used as a classifier.  
vimage-style network stack virtualization provides for more strict 
isolation on both port and IP address space, independent firewall 
instances, IPSEC config / state etc., and as such might be better 
suited for providing enhanced jail-style virtual hosting environments, 
as well as for providing virtual router "slices".

So once we get Julian's multi-FIB stuff in the base system, I see no 
reason why we couldn't have this functionality replicated in 
each "vimage" instance, i.e. have multiple independent virtual 
networking environnments, each with multiple FIBs.

Implementationwise, my hacks currently rely on macros for conditional 
virtualization of global variables / structs.  As long as Julian's 
changes continue to be unconditional, i.e. without playing a similar 
macroization game, I think integrating this code (once it hits HEAD) 
into p4/projects/vimage should be more or less a straightforward job.

Marko

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 28 22:56:35 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5184016A419
	for <freebsd-arch@freebsd.org>; Fri, 28 Dec 2007 22:56:35 +0000 (UTC)
	(envelope-from jhb@freebsd.org)
Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219])
	by mx1.freebsd.org (Postfix) with ESMTP id 1BF2413C44B
	for <freebsd-arch@freebsd.org>; Fri, 28 Dec 2007 22:56:34 +0000 (UTC)
	(envelope-from jhb@freebsd.org)
Received: from server.baldwin.cx (unverified [66.23.211.162]) 
	by speedfactory.net (SurgeMail 3.8q) with ESMTP id 226448658-1834499 
	for <freebsd-arch@freebsd.org>; Fri, 28 Dec 2007 17:58:50 -0500
Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id lBSMuWMw065495
	for <freebsd-arch@freebsd.org>; Fri, 28 Dec 2007 17:56:32 -0500 (EST)
	(envelope-from jhb@freebsd.org)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-arch@freebsd.org
Date: Fri, 28 Dec 2007 17:45:07 -0500
User-Agent: KMail/1.9.6
References: <200712271704.44796.jhb@FreeBSD.org>
In-Reply-To: <200712271704.44796.jhb@FreeBSD.org>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200712281745.08144.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]);
	Fri, 28 Dec 2007 17:56:32 -0500 (EST)
X-Virus-Scanned: ClamAV 0.91.2/5278/Fri Dec 28 11:55:36 2007 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Subject: Re: kernel features MIB
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2007 22:56:35 -0000

On Thursday 27 December 2007 05:04:44 pm John Baldwin wrote:
> At work we don't have a pretty API for this at all, but I'm thinking for 
> FreeBSD we can do this:
> 
> FEATURE(foo, "description of foo")
> 
> which is a macro to create the 'kern.features.foo' node and set it to 1.  Then 
> we could have a routine in libc:
> 
> int	feature_present(const char *name);
> 
> That returns a boolean to indicate if a given feature is present or not by 
> invoking sysctlbyname(3), etc.
> 
> Any objections to the idea?

So here's a bikeshed question I have no idea for.  Which header should
feature_present()'s prototype go in?  I anticipate this routine being
used in libc itself, so I don't think it can go into libutil.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 28 22:57:25 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 68DEE16A46B;
	Fri, 28 Dec 2007 22:57:25 +0000 (UTC)
	(envelope-from bright@elvis.mu.org)
Received: from elvis.mu.org (elvis.mu.org [192.203.228.196])
	by mx1.freebsd.org (Postfix) with ESMTP id 7646A13C45A;
	Fri, 28 Dec 2007 22:57:25 +0000 (UTC)
	(envelope-from bright@elvis.mu.org)
Received: by elvis.mu.org (Postfix, from userid 1192)
	id 1C8441A4D80; Fri, 28 Dec 2007 14:55:26 -0800 (PST)
Date: Fri, 28 Dec 2007 14:55:26 -0800
From: Alfred Perlstein <alfred@freebsd.org>
To: John Baldwin <jhb@freebsd.org>
Message-ID: <20071228225526.GJ76698@elvis.mu.org>
References: <200712271704.44796.jhb@FreeBSD.org>
	<200712281745.08144.jhb@freebsd.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <200712281745.08144.jhb@freebsd.org>
User-Agent: Mutt/1.4.2.3i
Cc: freebsd-arch@freebsd.org
Subject: Re: kernel features MIB
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2007 22:57:25 -0000

* John Baldwin <jhb@freebsd.org> [071228 14:54] wrote:
> On Thursday 27 December 2007 05:04:44 pm John Baldwin wrote:
> > At work we don't have a pretty API for this at all, but I'm thinking for 
> > FreeBSD we can do this:
> > 
> > FEATURE(foo, "description of foo")
> > 
> > which is a macro to create the 'kern.features.foo' node and set it to 1.  Then 
> > we could have a routine in libc:
> > 
> > int	feature_present(const char *name);
> > 
> > That returns a boolean to indicate if a given feature is present or not by 
> > invoking sysctlbyname(3), etc.
> > 
> > Any objections to the idea?
> 
> So here's a bikeshed question I have no idea for.  Which header should
> feature_present()'s prototype go in?  I anticipate this routine being
> used in libc itself, so I don't think it can go into libutil.

Whereever sysconf/pathconf stuff is.

-- 
- Alfred Perlstein

From owner-freebsd-arch@FreeBSD.ORG  Sat Dec 29 00:31:05 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1A84616A417;
	Sat, 29 Dec 2007 00:31:05 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.freebsd.org (Postfix) with ESMTP id EE59113C447;
	Sat, 29 Dec 2007 00:31:04 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id 41C2F47CAC;
	Fri, 28 Dec 2007 19:31:04 -0500 (EST)
Date: Sat, 29 Dec 2007 00:31:04 +0000 (GMT)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: John Baldwin <jhb@freebsd.org>
In-Reply-To: <200712281745.08144.jhb@freebsd.org>
Message-ID: <20071229002903.M45653@fledge.watson.org>
References: <200712271704.44796.jhb@FreeBSD.org>
	<200712281745.08144.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-arch@freebsd.org
Subject: Re: kernel features MIB
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 29 Dec 2007 00:31:05 -0000


On Fri, 28 Dec 2007, John Baldwin wrote:

> On Thursday 27 December 2007 05:04:44 pm John Baldwin wrote:
>
>> At work we don't have a pretty API for this at all, but I'm thinking for 
>> FreeBSD we can do this:
>>
>> FEATURE(foo, "description of foo")
>>
>> which is a macro to create the 'kern.features.foo' node and set it to 1. 
>> Then we could have a routine in libc:
>>
>> int feature_present(const char *name);
>>
>> That returns a boolean to indicate if a given feature is present or not by 
>> invoking sysctlbyname(3), etc.
>>
>> Any objections to the idea?
>
> So here's a bikeshed question I have no idea for.  Which header should 
> feature_present()'s prototype go in?  I anticipate this routine being used 
> in libc itself, so I don't think it can go into libutil.

#include <sys/feature.h>

feature_check(2)?

Does POSIX talk about the namespace for non-portable names being passed to 
sysconf(3)?

Robert N M Watson
Computer Laboratory
University of Cambridge

From owner-freebsd-arch@FreeBSD.ORG  Sat Dec 29 05:09:42 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 65C4716A418
	for <freebsd-arch@freebsd.org>; Sat, 29 Dec 2007 05:09:42 +0000 (UTC)
	(envelope-from gnn@neville-neil.com)
Received: from outbound0.mx.meer.net (outbound0.mx.meer.net [209.157.153.23])
	by mx1.freebsd.org (Postfix) with ESMTP id 5385013C469
	for <freebsd-arch@freebsd.org>; Sat, 29 Dec 2007 05:09:42 +0000 (UTC)
	(envelope-from gnn@neville-neil.com)
Received: from mail.meer.net (mail.meer.net [209.157.152.14])
	by outbound0.sv.meer.net (8.12.10/8.12.6) with ESMTP id lBT32Pih000379; 
	Fri, 28 Dec 2007 19:02:25 -0800 (PST)
	(envelope-from gnn@neville-neil.com)
Received: from minion.local.neville-neil.com
	(61.204.211.246.customerlink.pwd.ne.jp [61.204.211.246])
	by mail.meer.net (8.13.3/8.13.3/meer) with ESMTP id lBT32OIN095874;
	Fri, 28 Dec 2007 19:02:24 -0800 (PST)
	(envelope-from gnn@neville-neil.com)
Date: Sat, 29 Dec 2007 12:02:22 +0900
Message-ID: <m2hci2ursx.wl%gnn@neville-neil.com>
From: gnn@freebsd.org
To: Marko Zec <zec@tel.fer.hr>
In-Reply-To: <200712282040.30745.zec@tel.fer.hr>
References: <4772F123.5030303@elischer.org>
	<m2bq8bvsis.wl%gnn@neville-neil.com>
	<200712282040.30745.zec@tel.fer.hr>
User-Agent: Wanderlust/2.15.5 (Almost Unreal) SEMI/1.14.6 (Maruoka)
	FLIM/1.14.8 (=?ISO-8859-4?Q?Shij=F2?=) APEL/10.7 Emacs/22.1.50
	(i386-apple-darwin8.10.1) MULE/5.0 (SAKAKI)
MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka")
Content-Type: text/plain; charset=US-ASCII
Cc: FreeBSD Net <freebsd-net@freebsd.org>, Qing Li <qingli@freebsd.org>,
	Robert Watson <rwatson@freebsd.org>,
	Julian Elischer <julian@elischer.org>, freebsd-arch@freebsd.org
Subject: Re: resend: multiple routing table roadmap (format fix)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 29 Dec 2007 05:09:42 -0000

At Fri, 28 Dec 2007 20:40:30 +0100,
Marko Zec wrote:
> The thrust behind Julian's work seems to be providing multiple 
> forwarding tables for for purposes of traffic engineering / policy 
> based routing, with a single firewall instance used as a classifier.  
> vimage-style network stack virtualization provides for more strict 
> isolation on both port and IP address space, independent firewall 
> instances, IPSEC config / state etc., and as such might be better 
> suited for providing enhanced jail-style virtual hosting environments, 
> as well as for providing virtual router "slices".
> 
> So once we get Julian's multi-FIB stuff in the base system, I see no 
> reason why we couldn't have this functionality replicated in 
> each "vimage" instance, i.e. have multiple independent virtual 
> networking environnments, each with multiple FIBs.
> 
> Implementationwise, my hacks currently rely on macros for conditional 
> virtualization of global variables / structs.  As long as Julian's 
> changes continue to be unconditional, i.e. without playing a similar 
> macroization game, I think integrating this code (once it hits HEAD) 
> into p4/projects/vimage should be more or less a straightforward job.

Cool, that's what I wanted to hear.

Best,
George

From owner-freebsd-arch@FreeBSD.ORG  Sat Dec 29 23:43:58 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1527A16A417
	for <arch@freebsd.org>; Sat, 29 Dec 2007 23:43:58 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com
	[216.240.101.25])
	by mx1.freebsd.org (Postfix) with ESMTP id D3D6913C4D3
	for <arch@freebsd.org>; Sat, 29 Dec 2007 23:43:57 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from [192.168.1.107] (cpe-24-94-75-93.hawaii.res.rr.com
	[24.94.75.93]) (authenticated bits=0)
	by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id
	lBTNhrx4067553
	for <arch@freebsd.org>; Sat, 29 Dec 2007 18:43:54 -0500 (EST)
	(envelope-from jroberson@chesapeake.net)
Date: Sat, 29 Dec 2007 13:44:50 -1000 (HST)
From: Jeff Roberson <jroberson@chesapeake.net>
X-X-Sender: jroberson@desktop
To: arch@freebsd.org
Message-ID: <20071229133256.D957@desktop>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: 
Subject: kvm_getfiles is badly broken
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 29 Dec 2007 23:43:58 -0000

>From kvm_getfiles(3):

      The number of files found is returned in the reference parameter cnt.
      The files are returned as a contiguous array of file structures, 
preceded
      by the address of the first file entry in the kernel.

sysctl kern.file is used if the kernel is live.  This code assumes the 
kernel copies out a struct filelist before any files.  It does not.  I can 
not find any consumers of this interface however.  I also don't understand 
why it supplies the address of the first file and what this would be used 
for.

There are other users of sysctl kern.file which assume it does not prepend 
this address so it would be wrong to change that.  Would it also be wrong 
to change kvm to supply null as the first address?

Other inconsistencies include live kernels returning strcut xfile and dead 
kernels returning struct file.  The interface in kvm_getfiles() claims to 
return struct files.  I can't imagine any code actually relies on this 
routine.

Any opinions on what we should do with this?  It has been broken since 
2002 at least.  I'm committing changes for my lockless struct file work. 
As part of that I'll commit a broken but compiling implementation that 
matches current bugs but causes the code to fail whenever it is called.

Cheers,
Jeff