From owner-freebsd-net@FreeBSD.ORG Tue Nov 13 08:06:30 2012 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4F54ECA5 for ; Tue, 13 Nov 2012 08:06:30 +0000 (UTC) (envelope-from oppermann@networx.ch) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) by mx1.freebsd.org (Postfix) with ESMTP id A1C618FC14 for ; Tue, 13 Nov 2012 08:06:29 +0000 (UTC) Received: (qmail 23275 invoked from network); 13 Nov 2012 09:40:42 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 13 Nov 2012 09:40:42 -0000 Message-ID: <50A1FF80.3040900@networx.ch> Date: Tue, 13 Nov 2012 09:06:24 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko/20121010 Thunderbird/16.0.1 MIME-Version: 1.0 To: Alfred Perlstein Subject: Re: auto tuning tcp References: <50A0A0EF.3020109@mu.org> <50A0A502.1030306@networx.ch> <50A0B8DA.9090409@mu.org> <50A0C0F4.8010706@networx.ch> <50A13961.1030909@networx.ch> <50A14460.9020504@mu.org> <50A1E2E7.3090705@mu.org> <50A1E47C.1030208@mu.org> <50A1EC92.9000507@mu.org> In-Reply-To: <50A1EC92.9000507@mu.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "freebsd-net@freebsd.org" , Adrian Chadd , Peter Wemm X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 Nov 2012 08:06:30 -0000 On 13.11.2012 07:45, Alfred Perlstein wrote: > On 11/12/12 10:23 PM, Peter Wemm wrote: >> On Mon, Nov 12, 2012 at 10:11 PM, Alfred Perlstein wrote: >>> On 11/12/12 10:04 PM, Alfred Perlstein wrote: >>>> On 11/12/12 10:48 AM, Alfred Perlstein wrote: >>>>> On 11/12/12 10:01 AM, Andre Oppermann wrote: >>>>>> >>>>>> I've already added the tunable "kern.maxmbufmem" which is in pages. >>>>>> That's probably not very convenient to work with. I can change it >>>>>> to a percentage of phymem/kva. Would that make you happy? >>>>>> >>>>> It really makes sense to have the hash table be some relation to sockets >>>>> rather than buffers. >>>>> >>>>> If you are hashing "foo-objects" you want the hash to be some relation to >>>>> the max amount of "foo-objects" you'll see, not backwards derived from the >>>>> number of "bar-objects" that "foo-objects" contain, right? >>>>> >>>>> Because we are hashing the sockets, right? not clusters. >>>>> >>>>> Maybe I'm wrong? I'm open to ideas. >>>> >>>> Hey Andre, the following patch is what I was thinking >>>> (uncompiled/untested), it basically rounds up the maxsockets to a power of 2 >>>> and replaces the default 512 tcb hashsize. >>>> >>>> It might make sense to make the auto-tuning default to a minimum of 512. >>>> >>>> There are a number of other hashes with static sizes that could make use >>>> of this logic provided it's not upside-down. >>>> >>>> Any thoughts on this? >>>> >>>> Tune the tcp pcb hash based on maxsockets. >>>> Be more forgiving of poorly chosen tunables by finding a closer power >>>> of two rather than clamping down to 512. >>>> Index: tcp_subr.c >>>> =================================================================== >>> >>> Sorry, GUI mangled the patch... attaching a plain text version. >>> >>> >> Wait, you want to replace a hash with a flat array? Why even bother >> to call it a hash at that point? >> >> > > If you are concerned about the space/time tradeoff I'm pretty happy with making it 1/2, 1/4th, 1/8th > the size of maxsockets. (smaller?) > > Would that work better? I'd go for 1/8 or even 1/16 with a lower bound of 512. More than that is excessive. > The reason I chose to make it equal to max sockets was a space/time tradeoff, ideally a hash should > have zero collisions and if a user has enough memory for 250,000 sockets, then surely they have > enough memory for 256,000 pointers. I agree in general. Though not all large memory servers do serve a large amount of connections. We have find a tradeoff here. Having a perfect hash would certainly be laudable. As long as the average hash chain doesn't go beyond few entries it's not a problem. > If you strongly disagree then I am fine with a more conservative setting, just note that effectively > the hash table will require 1/2 the factor that we go smaller in additional traversals when we max > out the number of sockets. Meaning if the table is 1/4 the size of max sockets, when we hit that > many tcp connections I think we'll see an order of average 2 linked list traversals to find a node. > At 1/8, then that number becomes 4. I'm fine with that and claim that if you expect N sockets that you would also increase maxfiles/sockets to N*2 to have some headroom. > I recall back in 2001 on a PII400 with a custom webserver I wrote having a huge benefit by upping > this to 2^14 or maybe even 2^16, I forget, but suddenly my CPU went down a huge amount and I didn't > have to worry about a load balancer or other tricks. I can certainly believe that. A hash size of 512 is no good if you have more than 4K connections. PS: Please note that my patch for mbuf and maxfiles tuning is not yet in HEAD, it's still sitting in my tcp_workqueue branch. I still have to search for derived values that may get totally out of whack with the new scaling scheme. -- Andre