Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 07 Dec 2012 19:54:22 +0100
From:      Andre Oppermann <andre@freebsd.org>
To:        Alfred Perlstein <bright@mu.org>
Cc:        "svn-src-head@freebsd.org" <svn-src-head@freebsd.org>, "svn-src-all@freebsd.org" <svn-src-all@freebsd.org>, Alfred Perlstein <alfred@FreeBSD.org>, "src-committers@freebsd.org" <src-committers@freebsd.org>
Subject:   Re: svn commit: r243594 - head/sys/netinet
Message-ID:  <50C23B5E.3020509@freebsd.org>
In-Reply-To: <AC499CCF-3812-4CA9-929C-05AA4C39F223@mu.org>
References:  <201211270304.qAR34PfD056691@svn.freebsd.org> <50B53ABC.1010304@freebsd.org> <50B57F46.1060207@mu.org> <50C205DC.1040701@freebsd.org> <AC499CCF-3812-4CA9-929C-05AA4C39F223@mu.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 07.12.2012 18:31, Alfred Perlstein wrote:
> It's good to put the power or two thing in its own function.
>
> That said, about the comment changes,  hash size changes and auto tuning changes respectively....
>
> Comments: Dude... It's not like removing the helpful comments is going to speed up either compiling this code or how fast it can process packets.  Please leave those in.

The comment sentence starting with "previously" is not useful anymore.
Nobody cares what is was before, it's no longer relevant.  That's why
such comments belong only into commit messages.

> The hash: Your comment about it not being perfect is incorrect. By making it 1/8 you guarantee that it can never be perfect when fully loaded. If it is so important for you to guarantee that the minimum hash traversal on this hash is at LEAST FOUR when fully loaded then by all means make the change.  I think it's unbelievably wrong, penny wise/dollar foolish or maybe bit wise/CPU foolish but for some reason others agree with you for reasons that still do not make sense to me. So again by all means pessimize the hash table to save a few bytes.

It's a bit more than a few bytes.  On my modest dev box the TCBHASH at 1/8 would
be 256kB times two for TCBHASH and PORTHASH.  The hash will never be perfect
as there isn't a perfect hash function for this because the input is unbounded.

If you have more than 32k concurrent connections on a modest server then there
may be a few hash collisions.  That's fine.  So far we've managed to survive
with a hash table of only 512 slots.  Having a small chain on a hash slot
isn't bad.  Those with > 32k concurrent connections tend to have a lot more
RAM.  At 64GB the hash table will be 128k entries (2MB times two).  If you
have more than 128k concurrent connections you either specifically tune your
kernel anyway, or it is slightly less efficient than it possibly could be.

The hash table size doesn't limit the number of concurrent connections.  Only
maxsockets does that.  With my other changes the maxsockets limit is massively
increased (1/8 of physpages) compared to when it was dependent on maxusers.

My point is that for the vast majority a smaller hash table (maxsockets / 8)
is entirely sufficient and certainly better than the previous default of 512.
Anything more than that is IMHO excessive for the vast majority of users.
While a megabyte isn't that much we don't want to waste it too easily either.
There's no need to swing from one extreme to another.

> The tuning: Additionally you've removed the informational output that shows what the input was and what it was changed to. It's as if you want to make it harder for my techs to figure out what had gone wrong.

It still says "WARNING: TCB hash size not a power of 2, rounded down".
If the admin can't figure out why his value isn't a power of 2 then all
bets are off.  You wouldn't tell him either with your message.  He gets
a warning message and told what happened.  From there he has to look it
up in /boot/loader.conf anyway.

> Tuning 2: Additionally it appears you've removed the safety net in this code for clipping it down to min 512.

Nope, it's there: powerof2rdfloor(tcp_tcbhashsize, TCBMINHASHSIZE)

> I don't get why this is such an issue, you do realize this code is run only once and is not in the critical path?  We should be optimizing those code for utility, user-friendliness and general readability not as if it was part of TCP's fast path.

User friendliness would be to either not have to worry about this at
all or to update the related and relevant man pages like tuning(7).
Users are unlikely to even find the right place to look for in the
source code.

-- 
Andre

> Sent from my iPhone
>
> On Dec 7, 2012, at 7:06 AM, Andre Oppermann <andre@freebsd.org> wrote:
>
>> On 28.11.2012 04:04, Alfred Perlstein wrote:
>>> On 11/27/12 2:12 PM, Andre Oppermann wrote:
>>>> On 27.11.2012 04:04, Alfred Perlstein wrote:
>>>>> Author: alfred
>>>>> Date: Tue Nov 27 03:04:24 2012
>>>>> New Revision: 243594
>>>>> URL: http://svnweb.freebsd.org/changeset/base/243594
>>>>>
>>>>> Log:
>>>>>    Auto size the tcbhashsize structure based on max sockets.
>>>>>
>>>>>    While here, also make the code that enforces power-of-two more
>>>>>    forgiving, instead of just resetting to 512, graciously round-down
>>>>>    to the next lower power of two.
>>>>
>>>> The porthash (also a parameter to in_pcbinfo_init()) should not be
>>>> scaled the same as tcbhashsize.  It's unlikely that anyone is going
>>>> to have more than a thousand listen ports.
>>>>
>>>> The default tcbhashsize scaling factor of maxsockets / 4 is IMHO
>>>> excessive.  If someone is running at the limit he's going to up
>>>> maxsockets to double the target load.  Dividing by 8 is a better
>>>> balance for normal and well used large memory machines.
>>>> The hashsize calculation logic is done a bit in a roundabout way.
>>>>
>>>> What do think of the attached patch to fix the porthash and to simplify
>>>> the hashsize logic?
>>> I like the porthash fix.
>>
>> After tracking down how the porthash works in in_pcb.c it became
>> apparent that every connection ends up there too.  So in the end
>> the port hash should be the same size as the main hash for now.
>> Having every connection on the port hash doesn't make a lot of
>> sense, at least for TCP.  Only ports from listen sockets should
>> end up there.  Changing that seems to be non-trivial though.
>>
>>> However your change:
>>> 1) removes explanation of what the code is doing. (comments/clarity)
>>
>> Added your comments back in, in a more condensed form.  The comment
>> about the perfect hash is nice but applies to every hash usage and
>> is in every book about data structures.  The problem with the inpcb
>> hash is that it can never be perfect because the input in not known.
>> Even with an 1:1 mapping of maxsockets to the inpcb hash there will
>> be collisions based on pure chance since the hash is derived from
>> srcip:dstip:srcport:dstport.
>>
>>> 2) removes the more forgiving code that rounds-down the tcp_tcbhashsize.
>>>
>>> Can you please maintain the comments (probably keeping the logic of maketcp_hashsize() in a separate
>>> function).
>>
>> We have a quite a number of places where this power-of-2 requirement
>> applies.  I think it's better to add a generic macro for this rounding
>> instead of re-inventing the wheel every time we have the same situation.
>>
>> The comment about being more forgiving is sufficient in a commit
>> message as the reader of the code wouldn't know that it previously
>> was clamped down to TCBMINHASHSIZE (after printing a warning though).
>>
>>> Can you please maintain the more forgiving code for dealing with non-power of 2 tunable?  I like
>>> this, it makes it easier for admins and less error prone.
>>>
>>> -       if (!powerof2(hashsize)) {
>>> -               int oldhashsize = hashsize;
>>>
>>> -               hashsize = maketcp_hashsize(hashsize);
>>> -               /* prevent absurdly low value */
>>> -               if (hashsize < 16)
>>> -                       hashsize = 16;
>>> -               printf("%s: WARNING: TCB hash size not a power of 2, "
>>> -                   "clipped from %d to %d.\n", __func__, oldhashsize,
>>> -                   hashsize);
>>
>> Please see revised patch below.
>>
>> --
>> Andre
>>
>> Index: netinet/tcp_subr.c
>> ===================================================================
>> --- netinet/tcp_subr.c    (revision 243985)
>> +++ netinet/tcp_subr.c    (working copy)
>> @@ -229,13 +229,13 @@
>>             void *ip4hdr, const void *ip6hdr);
>>
>> /*
>> - * Target size of TCP PCB hash tables. Must be a power of two.
>> + * Minimal size of TCP PCB hash tables. Must be a power of two.
>>   *
>>   * Note that this can be overridden by the kernel environment
>>   * variable net.inet.tcp.tcbhashsize
>>   */
>> -#ifndef TCBHASHSIZE
>> -#define TCBHASHSIZE    0
>> +#ifndef TCBMINHASHSIZE
>> +#define TCBMINHASHSIZE    512
>> #endif
>>
>> /*
>> @@ -282,35 +282,10 @@
>>     return (0);
>> }
>>
>> -/*
>> - * Take a value and get the next power of 2 that doesn't overflow.
>> - * Used to size the tcp_inpcb hash buckets.
>> - */
>> -static int
>> -maketcp_hashsize(int size)
>> -{
>> -    int hashsize;
>> -
>> -    /*
>> -     * auto tune.
>> -     * get the next power of 2 higher than maxsockets.
>> -     */
>> -    hashsize = 1 << fls(size);
>> -    /* catch overflow, and just go one power of 2 smaller */
>> -    if (hashsize < size) {
>> -        hashsize = 1 << (fls(size) - 1);
>> -    }
>> -    return (hashsize);
>> -}
>> -
>> void
>> tcp_init(void)
>> {
>> -    const char *tcbhash_tuneable;
>> -    int hashsize;
>>
>> -    tcbhash_tuneable = "net.inet.tcp.tcbhashsize";
>> -
>>     if (hhook_head_register(HHOOK_TYPE_TCP, HHOOK_TCP_EST_IN,
>>         &V_tcp_hhh[HHOOK_TCP_EST_IN], HHOOK_NOWAIT|HHOOK_HEADISINVNET) != 0)
>>         printf("%s: WARNING: unable to register helper hook\n", __func__);
>> @@ -318,48 +293,21 @@
>>         &V_tcp_hhh[HHOOK_TCP_EST_OUT], HHOOK_NOWAIT|HHOOK_HEADISINVNET) != 0)
>>         printf("%s: WARNING: unable to register helper hook\n", __func__);
>>
>> -    hashsize = TCBHASHSIZE;
>> -    TUNABLE_INT_FETCH(tcbhash_tuneable, &hashsize);
>> -    if (hashsize == 0) {
>> -        /*
>> -         * Auto tune the hash size based on maxsockets.
>> -         * A perfect hash would have a 1:1 mapping
>> -         * (hashsize = maxsockets) however it's been
>> -         * suggested that O(2) average is better.
>> -         */
>> -        hashsize = maketcp_hashsize(maxsockets / 4);
>> -        /*
>> -         * Our historical default is 512,
>> -         * do not autotune lower than this.
>> -         */
>> -        if (hashsize < 512)
>> -            hashsize = 512;
>> -        if (bootverbose)
>> -            printf("%s: %s auto tuned to %d\n", __func__,
>> -                tcbhash_tuneable, hashsize);
>> -    }
>>     /*
>> -     * We require a hashsize to be a power of two.
>> -     * Previously if it was not a power of two we would just reset it
>> -     * back to 512, which could be a nasty surprise if you did not notice
>> -     * the error message.
>> -     * Instead what we do is clip it to the closest power of two lower
>> -     * than the specified hash value.
>> +     * Auto tune the hash size based on maxsockets.
>> +     * It is set to 1/8 of maxsockets and rounded down
>> +     * to the next power of two with a lower bound of
>> +     * TCBMINHASHSIZE.
>>      */
>> -    if (!powerof2(hashsize)) {
>> -        int oldhashsize = hashsize;
>> -
>> -        hashsize = maketcp_hashsize(hashsize);
>> -        /* prevent absurdly low value */
>> -        if (hashsize < 16)
>> -            hashsize = 16;
>> -        printf("%s: WARNING: TCB hash size not a power of 2, "
>> -            "clipped from %d to %d.\n", __func__, oldhashsize,
>> -            hashsize);
>> +    tcp_tcbhashsize = powerof2rdown(maxsockets / 8);
>> +    TUNABLE_INT_FETCH("net.inet.tcp.tcbhashsize", &tcp_tcbhashsize);
>> +    if (!powerof2(tcp_tcbhashsize)) {
>> +        tcp_tcbhashsize = powerof2rdfloor(tcp_tcbhashsize, TCBMINHASHSIZE);
>> +        printf("WARNING: TCB hash size not a power of 2, rounded down\n");
>>     }
>> -    in_pcbinfo_init(&V_tcbinfo, "tcp", &V_tcb, hashsize, hashsize,
>> -        "tcp_inpcb", tcp_inpcb_init, NULL, UMA_ZONE_NOFREE,
>> -        IPI_HASHFIELDS_4TUPLE);
>> +    in_pcbinfo_init(&V_tcbinfo, "tcp", &V_tcb, tcp_tcbhashsize,
>> +        tcp_tcbhashsize, "tcp_inpcb", tcp_inpcb_init, NULL,
>> +        UMA_ZONE_NOFREE, IPI_HASHFIELDS_4TUPLE);
>>
>>     /*
>>      * These have to be type stable for the benefit of the timers.
>> @@ -393,7 +341,6 @@
>>         tcp_rexmit_min = 1;
>>     tcp_rexmit_slop = TCPTV_CPU_VAR;
>>     tcp_finwait2_timeout = TCPTV_FINWAIT2_TIMEOUT;
>> -    tcp_tcbhashsize = hashsize;
>>
>>     TUNABLE_INT_FETCH("net.inet.tcp.soreceive_stream", &tcp_soreceive_stream);
>>     if (tcp_soreceive_stream) {
>> Index: sys/param.h
>> ===================================================================
>> --- sys/param.h    (revision 243985)
>> +++ sys/param.h    (working copy)
>> @@ -284,6 +284,9 @@
>> #define    MIN(a,b) (((a)<(b))?(a):(b))
>> #define    MAX(a,b) (((a)>(b))?(a):(b))
>>
>> +#define    powerof2rdown(x)    (powerof2(x) ? (x) : (0x1 << fls((x) - 1) - 1))
>> +#define    powerof2rdfloor(x, y)    (MAX(powerof2rdown(x), (y)))
>> +
>> #ifdef _KERNEL
>> /*
>>   * Basic byte order function prototypes for non-inline functions.
>>
>
>




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50C23B5E.3020509>