From owner-freebsd-net@FreeBSD.ORG  Mon Sep 27 18:33:36 2010
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C47201065670
	for <net@freebsd.org>; Mon, 27 Sep 2010 18:33:36 +0000 (UTC)
	(envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 2AC1B8FC1F
	for <net@freebsd.org>; Mon, 27 Sep 2010 18:33:35 +0000 (UTC)
Received: (qmail 84025 invoked from network); 27 Sep 2010 18:26:06 -0000
Received: from localhost (HELO [127.0.0.1]) ([127.0.0.1])
	(envelope-sender <andre@freebsd.org>)
	by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
	for <julian@freebsd.org>; 27 Sep 2010 18:26:06 -0000
Message-ID: <4CA0E382.90101@freebsd.org>
Date: Mon, 27 Sep 2010 20:33:38 +0200
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
	rv:1.9.2.9) Gecko/20100825 Thunderbird/3.1.3
MIME-Version: 1.0
To: Julian Elischer <julian@freebsd.org>
References: <4C9DA26D.7000309@freebsd.org> <4C9DB0C3.5010601@freebsd.org>
	<4C9EE905.5090701@freebsd.org> <4CA09792.3070307@freebsd.org>
	<4CA0C2A3.7000508@freebsd.org>
In-Reply-To: <4CA0C2A3.7000508@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: FreeBSD Net <net@freebsd.org>
Subject: Re: mbuf changes
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 27 Sep 2010 18:33:36 -0000

On 27.09.2010 18:13, Julian Elischer wrote:
> On 9/27/10 6:09 AM, Andre Oppermann wrote:
>> On 26.09.2010 08:32, Julian Elischer wrote:
>>> On 9/25/10 1:20 AM, Andre Oppermann wrote:
>>>> On 25.09.2010 09:19, Julian Elischer wrote:
>>>>> * dynamically working out what the front padding size should be.. per session.. i.e.
>>>>> when a packet is sent out and needs to be adjusted to add more headers, the originating
>>>>> socket should be notified, or maybe the route should have this information...
>>>>> so that future packets can start out with enough head room.
>>>>> (this is not strictly to do with mbufs but might need some added field to point to the structure
>>>>> that needs to be
>>>>> updated.
>>>>
>>>> We already have "max_linkhdr" that specifies how much space is left
>>>> for prepends at the start of each packet. The link protocols set
>>>> this and also IPSec adds itself in there if enabled. If you have
>>>> other encapsulations you should make them add in there as well.
>>>
>>> this doesn't take into account tunneling and encapsulation.
>>
>> It should/could but the tunneling and encapsulation protocols have to
>> add themself to it when active. IPSec does this.
>
> yes bit the troubel is that every packet is then given a worst -case reserved area at the front

Yes, but so what?  We've got the space in the mbuf anyway.  Right now
it lays unused at the end.  See below for more detailed explanation.

  <----------mbuf---------->
   ppdddddddddd............  now
   pppppppppdddddddddd.....  with large prepend area
   p = prepend
   d = data

>>> we could do a lot better than this.
>>> especially on a per-route basis.
>>> if the first mbuf in a session had a pointer to the relevent rtentry,
>>> then as it is processed that could be updated..
>>
>> Please please please don't add a rtentry pointer to the mbuf. Besides
>> that the routing table is a very poor place to do this. We don't have
>> host routes anymore and the locking and refcounting is rather expensive.
>
> yes but we do have a route cache
> (and we probably should still have some form of host routes but that's a
> different issue not to be argued here.)

We have the hostcache (which needs some revisiting).

>> max_linkhdr should be sufficient (fix small fixes to some protocol mbuf
>> allocators) even for excessive cases of encapsulation:
>
> max-linkhdr is way too big for 99% of all packets.

That doesn't matter in practice.  We have a very binary distribution for
the packets and the space in the mbuf is there anyway.  Today it's simply
not used.

We tend to have small packets (TCP ACK for example) and large packets at
around MTU (bulk data transfer).  For normal mbuf's (256 bytes) the header
and lots of encapsulation fit.  For mbuf clusters (2Kbytes) there is plenty
of space too.

For packets in between that currently may have fit into a normal mbuf we
may have to switch to allocating a cluster earlier.  That's no biggy though
and doesn't happen too often, is not much overhead and only with excessive
encapsulation.

Unless you can demonstrate a realistic case where the encapsulation overhead
with a large max_linkhdr is actually causing a measurable pessimization I'd
say the complexity of adding a mechanism you propose is not justified.

>> TCP over IPv4 over IPSec(AH+ESP) over UDP over IPv6 over PPPoE over Ethernet =
>> 60 + 20 + (8+24) + 8 + 40 + 8 + 14 = 182 total, of which 102 are prepends.

I forgot MPLS, add another 4 bytes. ;-)

For 32bit machines (60 bytes mbuf headers) this fits just fine.
For 64bit machines (84 bytes mbuf headers) it fits for TCP ACK just fine.

>> Maybe we need an API for the tunneling and encapsulation protocols to
>> add their overhead to max_linkhdr.

-- 
Andre