Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 7 Dec 2012 09:31:34 -0800
From:      Alfred Perlstein <bright@mu.org>
To:        Andre Oppermann <andre@freebsd.org>
Cc:        "svn-src-head@freebsd.org" <svn-src-head@freebsd.org>, "svn-src-all@freebsd.org" <svn-src-all@freebsd.org>, Alfred Perlstein <alfred@FreeBSD.org>, "src-committers@freebsd.org" <src-committers@freebsd.org>
Subject:   Re: svn commit: r243594 - head/sys/netinet
Message-ID:  <AC499CCF-3812-4CA9-929C-05AA4C39F223@mu.org>
In-Reply-To: <50C205DC.1040701@freebsd.org>
References:  <201211270304.qAR34PfD056691@svn.freebsd.org> <50B53ABC.1010304@freebsd.org> <50B57F46.1060207@mu.org> <50C205DC.1040701@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
It's good to put the power or two thing in its own function.=20

That said, about the comment changes,  hash size changes and auto tuning cha=
nges respectively....

Comments: Dude... It's not like removing the helpful comments is going to sp=
eed up either compiling this code or how fast it can process packets.  Pleas=
e leave those in.=20

The hash: Your comment about it not being perfect is incorrect. By making it=
 1/8 you guarantee that it can never be perfect when fully loaded. If it is s=
o important for you to guarantee that the minimum hash traversal on this has=
h is at LEAST FOUR when fully loaded then by all means make the change.  I t=
hink it's unbelievably wrong, penny wise/dollar foolish or maybe bit wise/CP=
U foolish but for some reason others agree with you for reasons that still d=
o not make sense to me. So again by all means pessimize the hash table to sa=
ve a few bytes.=20

The tuning: Additionally you've removed the informational output that shows w=
hat the input was and what it was changed to. It's as if you want to make it=
 harder for my techs to figure out what had gone wrong.=20

Tuning 2: Additionally it appears you've removed the safety net in this code=
 for clipping it down to min 512.

I don't get why this is such an issue, you do realize this code is run only o=
nce and is not in the critical path?  We should be optimizing those code for=
 utility, user-friendliness and general readability not as if it was part of=
 TCP's fast path.=20


Sent from my iPhone

On Dec 7, 2012, at 7:06 AM, Andre Oppermann <andre@freebsd.org> wrote:

> On 28.11.2012 04:04, Alfred Perlstein wrote:
>> On 11/27/12 2:12 PM, Andre Oppermann wrote:
>>> On 27.11.2012 04:04, Alfred Perlstein wrote:
>>>> Author: alfred
>>>> Date: Tue Nov 27 03:04:24 2012
>>>> New Revision: 243594
>>>> URL: http://svnweb.freebsd.org/changeset/base/243594
>>>>=20
>>>> Log:
>>>>   Auto size the tcbhashsize structure based on max sockets.
>>>>=20
>>>>   While here, also make the code that enforces power-of-two more
>>>>   forgiving, instead of just resetting to 512, graciously round-down
>>>>   to the next lower power of two.
>>>=20
>>> The porthash (also a parameter to in_pcbinfo_init()) should not be
>>> scaled the same as tcbhashsize.  It's unlikely that anyone is going
>>> to have more than a thousand listen ports.
>>>=20
>>> The default tcbhashsize scaling factor of maxsockets / 4 is IMHO
>>> excessive.  If someone is running at the limit he's going to up
>>> maxsockets to double the target load.  Dividing by 8 is a better
>>> balance for normal and well used large memory machines.
>>> The hashsize calculation logic is done a bit in a roundabout way.
>>>=20
>>> What do think of the attached patch to fix the porthash and to simplify
>>> the hashsize logic?
>> I like the porthash fix.
>=20
> After tracking down how the porthash works in in_pcb.c it became
> apparent that every connection ends up there too.  So in the end
> the port hash should be the same size as the main hash for now.
> Having every connection on the port hash doesn't make a lot of
> sense, at least for TCP.  Only ports from listen sockets should
> end up there.  Changing that seems to be non-trivial though.
>=20
>> However your change:
>> 1) removes explanation of what the code is doing. (comments/clarity)
>=20
> Added your comments back in, in a more condensed form.  The comment
> about the perfect hash is nice but applies to every hash usage and
> is in every book about data structures.  The problem with the inpcb
> hash is that it can never be perfect because the input in not known.
> Even with an 1:1 mapping of maxsockets to the inpcb hash there will
> be collisions based on pure chance since the hash is derived from
> srcip:dstip:srcport:dstport.
>=20
>> 2) removes the more forgiving code that rounds-down the tcp_tcbhashsize.
> >
>> Can you please maintain the comments (probably keeping the logic of maket=
cp_hashsize() in a separate
>> function).
>=20
> We have a quite a number of places where this power-of-2 requirement
> applies.  I think it's better to add a generic macro for this rounding
> instead of re-inventing the wheel every time we have the same situation.
>=20
> The comment about being more forgiving is sufficient in a commit
> message as the reader of the code wouldn't know that it previously
> was clamped down to TCBMINHASHSIZE (after printing a warning though).
>=20
>> Can you please maintain the more forgiving code for dealing with non-powe=
r of 2 tunable?  I like
>> this, it makes it easier for admins and less error prone.
>>=20
>> -       if (!powerof2(hashsize)) {
>> -               int oldhashsize =3D hashsize;
>>=20
>> -               hashsize =3D maketcp_hashsize(hashsize);
>> -               /* prevent absurdly low value */
>> -               if (hashsize < 16)
>> -                       hashsize =3D 16;
>> -               printf("%s: WARNING: TCB hash size not a power of 2, "
>> -                   "clipped from %d to %d.\n", __func__, oldhashsize,
>> -                   hashsize);
>=20
> Please see revised patch below.
>=20
> --=20
> Andre
>=20
> Index: netinet/tcp_subr.c
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- netinet/tcp_subr.c    (revision 243985)
> +++ netinet/tcp_subr.c    (working copy)
> @@ -229,13 +229,13 @@
>            void *ip4hdr, const void *ip6hdr);
>=20
> /*
> - * Target size of TCP PCB hash tables. Must be a power of two.
> + * Minimal size of TCP PCB hash tables. Must be a power of two.
>  *
>  * Note that this can be overridden by the kernel environment
>  * variable net.inet.tcp.tcbhashsize
>  */
> -#ifndef TCBHASHSIZE
> -#define TCBHASHSIZE    0
> +#ifndef TCBMINHASHSIZE
> +#define TCBMINHASHSIZE    512
> #endif
>=20
> /*
> @@ -282,35 +282,10 @@
>    return (0);
> }
>=20
> -/*
> - * Take a value and get the next power of 2 that doesn't overflow.
> - * Used to size the tcp_inpcb hash buckets.
> - */
> -static int
> -maketcp_hashsize(int size)
> -{
> -    int hashsize;
> -
> -    /*
> -     * auto tune.
> -     * get the next power of 2 higher than maxsockets.
> -     */
> -    hashsize =3D 1 << fls(size);
> -    /* catch overflow, and just go one power of 2 smaller */
> -    if (hashsize < size) {
> -        hashsize =3D 1 << (fls(size) - 1);
> -    }
> -    return (hashsize);
> -}
> -
> void
> tcp_init(void)
> {
> -    const char *tcbhash_tuneable;
> -    int hashsize;
>=20
> -    tcbhash_tuneable =3D "net.inet.tcp.tcbhashsize";
> -
>    if (hhook_head_register(HHOOK_TYPE_TCP, HHOOK_TCP_EST_IN,
>        &V_tcp_hhh[HHOOK_TCP_EST_IN], HHOOK_NOWAIT|HHOOK_HEADISINVNET) !=3D=
 0)
>        printf("%s: WARNING: unable to register helper hook\n", __func__);
> @@ -318,48 +293,21 @@
>        &V_tcp_hhh[HHOOK_TCP_EST_OUT], HHOOK_NOWAIT|HHOOK_HEADISINVNET) !=3D=
 0)
>        printf("%s: WARNING: unable to register helper hook\n", __func__);
>=20
> -    hashsize =3D TCBHASHSIZE;
> -    TUNABLE_INT_FETCH(tcbhash_tuneable, &hashsize);
> -    if (hashsize =3D=3D 0) {
> -        /*
> -         * Auto tune the hash size based on maxsockets.
> -         * A perfect hash would have a 1:1 mapping
> -         * (hashsize =3D maxsockets) however it's been
> -         * suggested that O(2) average is better.
> -         */
> -        hashsize =3D maketcp_hashsize(maxsockets / 4);
> -        /*
> -         * Our historical default is 512,
> -         * do not autotune lower than this.
> -         */
> -        if (hashsize < 512)
> -            hashsize =3D 512;
> -        if (bootverbose)
> -            printf("%s: %s auto tuned to %d\n", __func__,
> -                tcbhash_tuneable, hashsize);
> -    }
>    /*
> -     * We require a hashsize to be a power of two.
> -     * Previously if it was not a power of two we would just reset it
> -     * back to 512, which could be a nasty surprise if you did not notice=

> -     * the error message.
> -     * Instead what we do is clip it to the closest power of two lower
> -     * than the specified hash value.
> +     * Auto tune the hash size based on maxsockets.
> +     * It is set to 1/8 of maxsockets and rounded down
> +     * to the next power of two with a lower bound of
> +     * TCBMINHASHSIZE.
>     */
> -    if (!powerof2(hashsize)) {
> -        int oldhashsize =3D hashsize;
> -
> -        hashsize =3D maketcp_hashsize(hashsize);
> -        /* prevent absurdly low value */
> -        if (hashsize < 16)
> -            hashsize =3D 16;
> -        printf("%s: WARNING: TCB hash size not a power of 2, "
> -            "clipped from %d to %d.\n", __func__, oldhashsize,
> -            hashsize);
> +    tcp_tcbhashsize =3D powerof2rdown(maxsockets / 8);
> +    TUNABLE_INT_FETCH("net.inet.tcp.tcbhashsize", &tcp_tcbhashsize);
> +    if (!powerof2(tcp_tcbhashsize)) {
> +        tcp_tcbhashsize =3D powerof2rdfloor(tcp_tcbhashsize, TCBMINHASHSI=
ZE);
> +        printf("WARNING: TCB hash size not a power of 2, rounded down\n")=
;
>    }
> -    in_pcbinfo_init(&V_tcbinfo, "tcp", &V_tcb, hashsize, hashsize,
> -        "tcp_inpcb", tcp_inpcb_init, NULL, UMA_ZONE_NOFREE,
> -        IPI_HASHFIELDS_4TUPLE);
> +    in_pcbinfo_init(&V_tcbinfo, "tcp", &V_tcb, tcp_tcbhashsize,
> +        tcp_tcbhashsize, "tcp_inpcb", tcp_inpcb_init, NULL,
> +        UMA_ZONE_NOFREE, IPI_HASHFIELDS_4TUPLE);
>=20
>    /*
>     * These have to be type stable for the benefit of the timers.
> @@ -393,7 +341,6 @@
>        tcp_rexmit_min =3D 1;
>    tcp_rexmit_slop =3D TCPTV_CPU_VAR;
>    tcp_finwait2_timeout =3D TCPTV_FINWAIT2_TIMEOUT;
> -    tcp_tcbhashsize =3D hashsize;
>=20
>    TUNABLE_INT_FETCH("net.inet.tcp.soreceive_stream", &tcp_soreceive_strea=
m);
>    if (tcp_soreceive_stream) {
> Index: sys/param.h
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- sys/param.h    (revision 243985)
> +++ sys/param.h    (working copy)
> @@ -284,6 +284,9 @@
> #define    MIN(a,b) (((a)<(b))?(a):(b))
> #define    MAX(a,b) (((a)>(b))?(a):(b))
>=20
> +#define    powerof2rdown(x)    (powerof2(x) ? (x) : (0x1 << fls((x) - 1) -=
 1))
> +#define    powerof2rdfloor(x, y)    (MAX(powerof2rdown(x), (y)))
> +
> #ifdef _KERNEL
> /*
>  * Basic byte order function prototypes for non-inline functions.
>=20



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?AC499CCF-3812-4CA9-929C-05AA4C39F223>