Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 29 Nov 2012 08:45:29 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Alan Cox <alc@rice.edu>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, Andre Oppermann <andre@freebsd.org>
Subject:   Re: svn commit: r243631 - in head/sys: kern sys
Message-ID:  <20121129072617.T1123@besplex.bde.org>
In-Reply-To: <50B64BE8.3040708@rice.edu>
References:  <201211272119.qARLJxXV061083@svn.freebsd.org> <50B64BE8.3040708@rice.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 28 Nov 2012, Alan Cox wrote:

> I'm pretty sure that the "realmem" calculation is going to overflow on
> i386/PAE, where the number of bytes of physical memory is greater than
> the type long can represent.

It overflows on i386 even without PAE, where the number of bytes of
physical memory is greater than the type long can represent (2GB). 
This is the usual case for new hardware.

I think it changes the defaults for machines with small amounts of
memory (say 32MB) in dangerous ways.

Reserving half of kva for mbufs is network-centric.  I reserve more
than half of kva for the buffer cache in some configurations
(VM_BCACHE_SIZE_MAX defaults too 200 MB, but I use 512 MB), and in the
old scheme where the default for mbufs was under-sized, this may even
have fitted without separate tuning for mbufs.

The code has lots of style bugs.

> On 11/27/2012 15:19, Andre Oppermann wrote:
>> ...
>>   NB: Every cluster also has an mbuf associated with it.
>>
>>   Two examples on the revised mbuf sizing limits:
>>
>>   1GB KVM:
>>    512MB limit for mbufs
>>    419,430 mbufs
>>     65,536 2K mbuf clusters
>>     32,768 4K mbuf clusters
>>      9,709 9K mbuf clusters
>>      5,461 16K mbuf clusters
>>
>>   16GB RAM:
>>    8GB limit for mbufs
>>    33,554,432 mbufs
>>     1,048,576 2K mbuf clusters
>>       524,288 4K mbuf clusters
>>       155,344 9K mbuf clusters
>>        87,381 16K mbuf clusters
>>
>>   These defaults should be sufficient for even the most demanding
>>   network loads.

I noticed in earlier mail that there were no examples with small amounts
of physical memory.  This case mixes unusually with the relatively large
amount of kva.  PAE gives the opposite extreme.

>> Modified: head/sys/kern/kern_mbuf.c
>> ==============================================================================
>> --- head/sys/kern/kern_mbuf.c	Tue Nov 27 20:22:36 2012	(r243630)
>> +++ head/sys/kern/kern_mbuf.c	Tue Nov 27 21:19:58 2012	(r243631)
>> @@ -147,9 +148,11 @@ sysctl_nmbclusters(SYSCTL_HANDLER_ARGS)
>>  	newnmbclusters = nmbclusters;
>>  	error = sysctl_handle_int(oidp, &newnmbclusters, 0, req);
>>  	if (error == 0 && req->newptr) {
>> -		if (newnmbclusters > nmbclusters) {
>> +		if (newnmbclusters > nmbclusters &&
>> +		    nmbufs >= nmbclusters + nmbjumbop + nmbjumbo9 + nmbjumbo16) {

Style bug: long line.  This bogotifies early splitting in the previous line.

>>  			nmbclusters = newnmbclusters;
>>  			uma_zone_set_max(zone_clust, nmbclusters);
>> +			nmbclusters = uma_zone_get_max(zone_clust);
>>  			EVENTHANDLER_INVOKE(nmbclusters_change);
>>  		} else
>>  			error = EINVAL;
>> @@ -168,9 +171,11 @@ sysctl_nmbjumbop(SYSCTL_HANDLER_ARGS)
>>  	newnmbjumbop = nmbjumbop;
>>  	error = sysctl_handle_int(oidp, &newnmbjumbop, 0, req);
>>  	if (error == 0 && req->newptr) {
>> -		if (newnmbjumbop> nmbjumbop) {
>> +		if (newnmbjumbop > nmbjumbop &&
>> +		    nmbufs >= nmbclusters + nmbjumbop + nmbjumbo9 + nmbjumbo16) {

Style bug, as above.

>>  			nmbjumbop = newnmbjumbop;
>>  			uma_zone_set_max(zone_jumbop, nmbjumbop);
>> +			nmbjumbop = uma_zone_get_max(zone_jumbop);
>>  		} else
>>  			error = EINVAL;
>>  	}
>> @@ -189,9 +194,11 @@ sysctl_nmbjumbo9(SYSCTL_HANDLER_ARGS)
>>  	newnmbjumbo9 = nmbjumbo9;
>>  	error = sysctl_handle_int(oidp, &newnmbjumbo9, 0, req);
>>  	if (error == 0 && req->newptr) {
>> -		if (newnmbjumbo9> nmbjumbo9) {
>> +		if (newnmbjumbo9 > nmbjumbo9&&

Style bug, not as above (no space before '&&').

>> +		    nmbufs >= nmbclusters + nmbjumbop + nmbjumbo9 + nmbjumbo16) {

Style bug, as above.

>>  			nmbjumbo9 = newnmbjumbo9;
>>  			uma_zone_set_max(zone_jumbo9, nmbjumbo9);
>> +			nmbjumbo9 = uma_zone_get_max(zone_jumbo9);
>>  		} else
>>  			error = EINVAL;
>>  	}
>> @@ -209,9 +216,11 @@ sysctl_nmbjumbo16(SYSCTL_HANDLER_ARGS)
>>  	newnmbjumbo16 = nmbjumbo16;
>>  	error = sysctl_handle_int(oidp, &newnmbjumbo16, 0, req);
>>  	if (error == 0 && req->newptr) {
>> -		if (newnmbjumbo16> nmbjumbo16) {
>> +		if (newnmbjumbo16 > nmbjumbo16 &&
>> +		    nmbufs >= nmbclusters + nmbjumbop + nmbjumbo9 + nmbjumbo16) {

Style bug, as above.

>>  			nmbjumbo16 = newnmbjumbo16;
>>  			uma_zone_set_max(zone_jumbo16, nmbjumbo16);
>> +			nmbjumbo16 = uma_zone_get_max(zone_jumbo16);
>>  		} else
>>  			error = EINVAL;
>>  	}

The code surrounding these style bugs uses the technique of nested elses
to make the indentation march to the right, when it could simply return
after an error and after the usual case where the sysctl only does a
read.  For one of the sysctls, the old code is:

@ static int
@ sysctl_nmbclusters(SYSCTL_HANDLER_ARGS)
@ {
@ 	int error, newnmbclusters;
@ 
@ 	newnmbclusters = nmbclusters;
@ 	error = sysctl_handle_int(oidp, &newnmbclusters, 0, req);

Style bug: hard-coding of the variable type (int) hard-coding
sysctl_handle_int() here.

@ 	if (error == 0 && req->newptr) {

Style bug: boolean check for null pointer.

Technique for marching to the right starts here.

@ 		if (newnmbclusters > nmbclusters) {

Use another nested else to get 2 indentation levels deeper than necessary
altogether.

@ 			nmbclusters = newnmbclusters;
@ 			uma_zone_set_max(zone_clust, nmbclusters);
@ 			EVENTHANDLER_INVOKE(nmbclusters_change);
@ 		} else
@ 			error = EINVAL;

Only increasing nmbclusters is supported.  Even null changes of it give
this error.

@ 	}
@ 	return (error);
@ }

After fixing some style bugs and reducing the indentation.

@@ 	if (error != 0 || req->newptr == NULL)
@@		return (error);
@@ 	if (newnmbclusters <= nmbclusters)
@@		return (EINVAL);	/* XXX preserve bogusness when equal */
@@ 	nmbclusters = newnmbclusters;
@@ 	uma_zone_set_max(zone_clust, nmbclusters);
@@ 	EVENTHANDLER_INVOKE(nmbclusters_change);
@@ 	return (0);
@@ }

@ SYSCTL_PROC(_kern_ipc, OID_AUTO, nmbclusters, CTLTYPE_INT|CTLFLAG_RW,

Missing spaces around binary operation '|'.

@ &nmbclusters, 0, sysctl_nmbclusters, "IU",

Missing indentation.

@     "Maximum number of mbuf clusters allowed");
@

Back to the commit.

>> @@ -221,6 +230,27 @@ SYSCTL_PROC(_kern_ipc, OID_AUTO, nmbjumb
>>  &nmbjumbo16, 0, sysctl_nmbjumbo16, "IU",
>>      "Maximum number of mbuf 16k jumbo clusters allowed");
>>
>> +static int
>> +sysctl_nmbufs(SYSCTL_HANDLER_ARGS)
>> +{
>> +	int error, newnmbufs;
>> +
>> +	newnmbufs = nmbufs;
>> +	error = sysctl_handle_int(oidp, &newnmbufs, 0, req);
>> +	if (error == 0 && req->newptr) {
>> +		if (newnmbufs > nmbufs) {
>> +			nmbufs = newnmbufs;
>> +			uma_zone_set_max(zone_mbuf, nmbufs);
>> +			nmbclusters = uma_zone_get_max(zone_mbuf);
>> +			EVENTHANDLER_INVOKE(nmbufs_change);
>> +		} else
>> +			error = EINVAL;
>> +	}
>> +	return (error);
>> +}
>> +SYSCTL_PROC(_kern_ipc, OID_AUTO, nmbuf, CTLTYPE_INT|CTLFLAG_RW,
>> +&nmbufs, 0, sysctl_nmbufs, "IU",
>> +    "Maximum number of mbufs allowed");

All the bugs in the old sysctl are cloned here.

>>
>>

This is not bug for bug compatible.  There used to be 3 blank lines here
(2 extra).  Now there are only 2 (1 extra).

>> Modified: head/sys/kern/subr_param.c
>> ==============================================================================
>> --- head/sys/kern/subr_param.c	Tue Nov 27 20:22:36 2012	(r243630)
>> +++ head/sys/kern/subr_param.c	Tue Nov 27 21:19:58 2012	(r243631)
>> @@ -93,6 +93,7 @@ int	ncallout;			/* maximum # of timer ev
>>  int	nbuf;
>>  int	ngroups_max;			/* max # groups per process */
>>  int	nswbuf;
>> +long	maxmbufmem;			/* max mbuf memory */

This can overflow too, since once the other overflows are fixed, it is
only limited to 3/4 of physical memory so the limit on i386 is 3G, which
overflows to -1G when stored here.

>>  pid_t	pid_max = PID_MAX;
>>  long	maxswzone;			/* max swmeta KVA storage */
>>  long	maxbcache;			/* max buffer cache KVA storage */
>> @@ -270,6 +271,7 @@ init_param1(void)
>>  void
>>  init_param2(long physpages)

vm mostly uses u_int for page counts (see <sys/vmmeter.h>.  v_page_count
is only u_int).  This avoids most overflow problems.  I think alc wants
to change to longs.  But long is just bogus here.  It can't even handle
the range of u_int on 32-bit systems.  But since 32-bit systems can't
have >= 2**31 pages, this type error is only cosmetic.

>>  {
>> +	long realmem;

u_long would work up to 4GB.

>>
>>  	/* Base parameters */
>>  	maxusers = MAXUSERS;
>> @@ -293,19 +295,25 @@ init_param2(long physpages)
>>  	/*
>>  	 * The following can be overridden after boot via sysctl.  Note:
>>  	 * unless overriden, these macros are ultimately based on maxusers.
>> -	 */
>> -	maxproc = NPROC;
>> -	TUNABLE_INT_FETCH("kern.maxproc", &maxproc);
>> -	/*
>>  	 * Limit maxproc so that kmap entries cannot be exhausted by
>>  	 * processes.
>>  	 */
>> +	maxproc = NPROC;
>> +	TUNABLE_INT_FETCH("kern.maxproc", &maxproc);

The comments are misplaced in different ways than before.

>>  	if (maxproc > (physpages / 12))

Old style bug: excessive parentheses.

>>  		maxproc = physpages / 12;

I dodn't like limiting the values that can be set by tunables.  This
prevents the administrator overring bad limits, including ones caused
by overflows.

>> -	maxfiles = MAXFILES;
>> -	TUNABLE_INT_FETCH("kern.maxfiles", &maxfiles);
>>  	maxprocperuid = (maxproc * 9) / 10;
>> -	maxfilesperproc = (maxfiles * 9) / 10;
>> +
>> +	/*
>> +	 * The default limit for maxfiles is 1/12 of the number of
>> +	 * physical page but not less than 16 times maxusers.
>> +	 * At most it can be 1/6 the number of physical pages.
>> +	 */

Very wrong comment:
- 1/12 is unrelated to maxfiles.  It is for maxproc, and is used for
   that without comment above.  The code uses 1/8.
- 16 is quite wrong too, but the actual limit is hard to describe.  The
   code uses a limit of MAXFILES, where MAXFILES defaults to (maxproc * 2)
   but can be #defined; maxproc defaults to NPROC but can be clamped.
   NPROC is (20 + 16 * maxusers), and cannot be #defined.  The 16 in it
   seems to be the relevant magic numbner.  But since there is a factor
   of 2 in (maxproc * 2), 16 is just wrong and 32 is closer to being
   correct.
- 1/6 seems to be wrong too.  The code uses 1/4.  I don't like limiting
   tunable at all.

This code has too many comments, and most of them are not useful.  Most
of them just echo the magic numbers in the code (here the code's value
of 1/8 is not even echoed).  Few or none describe when these values are
what the are.  Of course, this is hard to describe since the values are
either heuristics or the result of very complicated calculations to
ensure that everything can be packed into the available physical/virtual
memory, or more likely some fuzzy combination of these.


>> +	maxfiles = imax(MAXFILES, physpages / 8);
>> +	TUNABLE_INT_FETCH("kern.maxfiles", &maxfiles);
>> +	if (maxfiles > (physpages / 4))

Style bug (excessive parentheses).

>> +		maxfiles = physpages / 4;
>> +	maxfilesperproc = (maxfiles / 10) * 9;

Style bug (excessive parentheses).

>>
>>  	/*
>>  	 * Cannot be changed after boot.
>> @@ -313,20 +321,35 @@ init_param2(long physpages)
>>  	nbuf = NBUF;
>>  	TUNABLE_INT_FETCH("kern.nbuf", &nbuf);
>>
>> +	/*
>> +	 * XXX: Does the callout wheel have to be so big?
>> +	 */
>>  	ncallout = 16 + maxproc + maxfiles;
>>  	TUNABLE_INT_FETCH("kern.ncallout", &ncallout);
>>
>>  	/*
>> +	 * The default limit for all mbuf related memory is 1/2 of all
>> +	 * available kernel memory (physical or kmem).
>> +	 * At most it can be 3/4 of available kernel memory.
>> +	 */
>> +	realmem = lmin(physpages * PAGE_SIZE,
>> +			VM_MAX_KERNEL_ADDRESS - VM_MIN_KERNEL_ADDRESS);

As well as overflowing, this has a weird continuation indentation of 2
tabs to not line up with anything, instead of the KNF 4 spaces to line
up with that level of indentation.

>> +	maxmbufmem = realmem / 2;
>> +	TUNABLE_LONG_FETCH("kern.maxmbufmem", &maxmbufmem);
>> +	if (maxmbufmem > (realmem / 4) * 3)
>> +		maxmbufmem = (realmem / 4) * 3;

Style bugs (excessive indentation; 2 instances).

You added a lot of upper clamps on tunables.  I don't like them, as above.
There are no lower clamps, so foot shooting is still easy by setting
tunables to preposterous values like 0 or negative, or to large values
that overflow to negative.  And if some other subsystem uses a tunable
or option like VM_BCACHE_SIZE_MAX to reserve lots of kva for itself,
none of the upper clams here is right.

>> +
>> +	/*
>>  	 * The default for maxpipekva is min(1/64 of the kernel address space,
>>  	 * max(1/64 of main memory, 512KB)).  See sys_pipe.c for more details.
>>  	 */
>>  	maxpipekva = (physpages / 64) * PAGE_SIZE;

Old style bug (excessive parentheses).

This divides by 64 first, so its multiplication doesn't overflow so easily. 
It works up to 64 * 2GB.  I don't know the limit for PAE, but this much
physical memory is easy to reach now.

>> +	TUNABLE_LONG_FETCH("kern.ipc.maxpipekva", &maxpipekva);
>>  	if (maxpipekva < 512 * 1024)
>>  		maxpipekva = 512 * 1024;
>>  	if (maxpipekva > (VM_MAX_KERNEL_ADDRESS - VM_MIN_KERNEL_ADDRESS) / 64)
>>  		maxpipekva = (VM_MAX_KERNEL_ADDRESS - VM_MIN_KERNEL_ADDRESS) /
>>  		    64;
>> -	TUNABLE_LONG_FETCH("kern.ipc.maxpipekva", &maxpipekva);

This used to be correctly placed, so that the administrator can set any
value.

>>  }
>>
>>  /*
>>
>> Modified: head/sys/kern/uipc_socket.c
>> ==============================================================================
>> --- head/sys/kern/uipc_socket.c	Tue Nov 27 20:22:36 2012	(r243630)
>> +++ head/sys/kern/uipc_socket.c	Tue Nov 27 21:19:58 2012	(r243631)
>> @@ -306,12 +306,9 @@ sysctl_maxsockets(SYSCTL_HANDLER_ARGS)
>>  	newmaxsockets = maxsockets;
>>  	error = sysctl_handle_int(oidp, &newmaxsockets, 0, req);
>>  	if (error == 0 && req->newptr) {
>> -		if (newmaxsockets > maxsockets) {
>> +		if (newmaxsockets > maxsockets &&
>> +		    newmaxsockets <= maxfiles) {

No need to split this line.

>>  			maxsockets = newmaxsockets;
>> -			if (maxsockets > ((maxfiles / 4) * 3)) {
>> -				maxfiles = (maxsockets * 5) / 4;
>> -				maxfilesperproc = (maxfiles * 9) / 10;
>> -			}

Bad examples with exessive parnetheses in old code.

>>  			EVENTHANDLER_INVOKE(maxsockets_change);
>>  		} else
>>  			error = EINVAL;
>>

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121129072617.T1123>