From owner-freebsd-net@FreeBSD.ORG  Tue Nov 13 08:06:30 2012
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id 4F54ECA5
 for <freebsd-net@freebsd.org>; Tue, 13 Nov 2012 08:06:30 +0000 (UTC)
 (envelope-from oppermann@networx.ch)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 by mx1.freebsd.org (Postfix) with ESMTP id A1C618FC14
 for <freebsd-net@freebsd.org>; Tue, 13 Nov 2012 08:06:29 +0000 (UTC)
Received: (qmail 23275 invoked from network); 13 Nov 2012 09:40:42 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <oppermann@networx.ch>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <bright@mu.org>; 13 Nov 2012 09:40:42 -0000
Message-ID: <50A1FF80.3040900@networx.ch>
Date: Tue, 13 Nov 2012 09:06:24 +0100
From: Andre Oppermann <oppermann@networx.ch>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:16.0) Gecko/20121010 Thunderbird/16.0.1
MIME-Version: 1.0
To: Alfred Perlstein <bright@mu.org>
Subject: Re: auto tuning tcp
References: <50A0A0EF.3020109@mu.org> <50A0A502.1030306@networx.ch>
 <50A0B8DA.9090409@mu.org> <50A0C0F4.8010706@networx.ch>
 <EB2C22B5-C18D-4AC2-8694-C5C0D96C07B3@mu.org> <50A13961.1030909@networx.ch>
 <50A14460.9020504@mu.org> <50A1E2E7.3090705@mu.org> <50A1E47C.1030208@mu.org>
 <CAGE5yCoj1dL9w-EMMi8iYMTOq9uUUHmFg4rMY7aPneUBHBv67Q@mail.gmail.com>
 <50A1EC92.9000507@mu.org>
In-Reply-To: <50A1EC92.9000507@mu.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>,
 Adrian Chadd <adrian@freebsd.org>, Peter Wemm <peter@wemm.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 13 Nov 2012 08:06:30 -0000

On 13.11.2012 07:45, Alfred Perlstein wrote:
> On 11/12/12 10:23 PM, Peter Wemm wrote:
>> On Mon, Nov 12, 2012 at 10:11 PM, Alfred Perlstein <bright@mu.org> wrote:
>>> On 11/12/12 10:04 PM, Alfred Perlstein wrote:
>>>> On 11/12/12 10:48 AM, Alfred Perlstein wrote:
>>>>> On 11/12/12 10:01 AM, Andre Oppermann wrote:
>>>>>>
>>>>>> I've already added the tunable "kern.maxmbufmem" which is in pages.
>>>>>> That's probably not very convenient to work with.  I can change it
>>>>>> to a percentage of phymem/kva.  Would that make you happy?
>>>>>>
>>>>> It really makes sense to have the hash table be some relation to sockets
>>>>> rather than buffers.
>>>>>
>>>>> If you are hashing "foo-objects" you want the hash to be some relation to
>>>>> the max amount of "foo-objects" you'll see, not backwards derived from the
>>>>> number of "bar-objects" that "foo-objects" contain, right?
>>>>>
>>>>> Because we are hashing the sockets, right?   not clusters.
>>>>>
>>>>> Maybe I'm wrong?  I'm open to ideas.
>>>>
>>>> Hey Andre, the following patch is what I was thinking
>>>> (uncompiled/untested), it basically rounds up the maxsockets to a power of 2
>>>> and replaces the default 512 tcb hashsize.
>>>>
>>>> It might make sense to make the auto-tuning default to a minimum of 512.
>>>>
>>>> There are a number of other hashes with static sizes that could make use
>>>> of this logic provided it's not upside-down.
>>>>
>>>> Any thoughts on this?
>>>>
>>>> Tune the tcp pcb hash based on maxsockets.
>>>> Be more forgiving of poorly chosen tunables by finding a closer power
>>>> of two rather than clamping down to 512.
>>>> Index: tcp_subr.c
>>>> ===================================================================
>>>
>>> Sorry, GUI mangled the patch... attaching a plain text version.
>>>
>>>
>> Wait, you want to replace a hash with a flat array?  Why even bother
>> to call it a hash at that point?
>>
>>
>
> If you are concerned about the space/time tradeoff I'm pretty happy with making it 1/2, 1/4th, 1/8th
> the size of maxsockets.  (smaller?)
>
> Would that work better?

I'd go for 1/8 or even 1/16 with a lower bound of 512.  More than
that is excessive.

> The reason I chose to make it equal to max sockets was a space/time tradeoff, ideally a hash should
> have zero collisions and if a user has enough memory for 250,000 sockets, then surely they have
> enough memory for 256,000 pointers.

I agree in general.  Though not all large memory servers do serve a
large amount of connections.  We have find a tradeoff here.

Having a perfect hash would certainly be laudable.  As long as the
average hash chain doesn't go beyond few entries it's not a problem.

> If you strongly disagree then I am fine with a more conservative setting, just note that effectively
> the hash table will require 1/2 the factor that we go smaller in additional traversals when we max
> out the number of sockets.  Meaning if the table is 1/4 the size of max sockets, when we hit that
> many tcp connections I think we'll see an order of average 2 linked list traversals to find a node.
> At 1/8, then that number becomes 4.

I'm fine with that and claim that if you expect N sockets that you
would also increase maxfiles/sockets to N*2 to have some headroom.

> I recall back in 2001 on a PII400 with a custom webserver I wrote having a huge benefit by upping
> this to 2^14 or maybe even 2^16, I forget, but suddenly my CPU went down a huge amount and I didn't
> have to worry about a load balancer or other tricks.

I can certainly believe that.  A hash size of 512 is no good if
you have more than 4K connections.

PS: Please note that my patch for mbuf and maxfiles tuning is not yet
in HEAD, it's still sitting in my tcp_workqueue branch.  I still have
to search for derived values that may get totally out of whack with
the new scaling scheme.

-- 
Andre