From owner-svn-src-head@FreeBSD.ORG Wed Nov 28 21:45:49 2012 Return-Path: Delivered-To: svn-src-head@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 77472F62; Wed, 28 Nov 2012 21:45:49 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail06.syd.optusnet.com.au (mail06.syd.optusnet.com.au [211.29.132.187]) by mx1.freebsd.org (Postfix) with ESMTP id C06278FC13; Wed, 28 Nov 2012 21:45:48 +0000 (UTC) Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26]) by mail06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qASLjYRB008859 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 29 Nov 2012 08:45:39 +1100 Date: Thu, 29 Nov 2012 08:45:29 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Alan Cox Subject: Re: svn commit: r243631 - in head/sys: kern sys In-Reply-To: <50B64BE8.3040708@rice.edu> Message-ID: <20121129072617.T1123@besplex.bde.org> References: <201211272119.qARLJxXV061083@svn.freebsd.org> <50B64BE8.3040708@rice.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-Cloudmark-Score: 0 X-Optus-Cloudmark-Analysis: v=2.0 cv=Zoq1sKHG c=1 sm=1 a=PcN1GEuqyV8A:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=Y5AXMyR2jB0A:10 a=ZpmuPUM5UnzFerMvZvEA:9 a=CjuIK1q_8ugA:10 a=qNwCHDfXISvSPmd_:21 a=yMgbbjGJPBlhYQA9:21 a=LyIlnvU96IiO9QpA:21 a=bxQHXO5Py4tHmhUgaywp5w==:117 Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, Andre Oppermann X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 28 Nov 2012 21:45:49 -0000 On Wed, 28 Nov 2012, Alan Cox wrote: > I'm pretty sure that the "realmem" calculation is going to overflow on > i386/PAE, where the number of bytes of physical memory is greater than > the type long can represent. It overflows on i386 even without PAE, where the number of bytes of physical memory is greater than the type long can represent (2GB). This is the usual case for new hardware. I think it changes the defaults for machines with small amounts of memory (say 32MB) in dangerous ways. Reserving half of kva for mbufs is network-centric. I reserve more than half of kva for the buffer cache in some configurations (VM_BCACHE_SIZE_MAX defaults too 200 MB, but I use 512 MB), and in the old scheme where the default for mbufs was under-sized, this may even have fitted without separate tuning for mbufs. The code has lots of style bugs. > On 11/27/2012 15:19, Andre Oppermann wrote: >> ... >> NB: Every cluster also has an mbuf associated with it. >> >> Two examples on the revised mbuf sizing limits: >> >> 1GB KVM: >> 512MB limit for mbufs >> 419,430 mbufs >> 65,536 2K mbuf clusters >> 32,768 4K mbuf clusters >> 9,709 9K mbuf clusters >> 5,461 16K mbuf clusters >> >> 16GB RAM: >> 8GB limit for mbufs >> 33,554,432 mbufs >> 1,048,576 2K mbuf clusters >> 524,288 4K mbuf clusters >> 155,344 9K mbuf clusters >> 87,381 16K mbuf clusters >> >> These defaults should be sufficient for even the most demanding >> network loads. I noticed in earlier mail that there were no examples with small amounts of physical memory. This case mixes unusually with the relatively large amount of kva. PAE gives the opposite extreme. >> Modified: head/sys/kern/kern_mbuf.c >> ============================================================================== >> --- head/sys/kern/kern_mbuf.c Tue Nov 27 20:22:36 2012 (r243630) >> +++ head/sys/kern/kern_mbuf.c Tue Nov 27 21:19:58 2012 (r243631) >> @@ -147,9 +148,11 @@ sysctl_nmbclusters(SYSCTL_HANDLER_ARGS) >> newnmbclusters = nmbclusters; >> error = sysctl_handle_int(oidp, &newnmbclusters, 0, req); >> if (error == 0 && req->newptr) { >> - if (newnmbclusters > nmbclusters) { >> + if (newnmbclusters > nmbclusters && >> + nmbufs >= nmbclusters + nmbjumbop + nmbjumbo9 + nmbjumbo16) { Style bug: long line. This bogotifies early splitting in the previous line. >> nmbclusters = newnmbclusters; >> uma_zone_set_max(zone_clust, nmbclusters); >> + nmbclusters = uma_zone_get_max(zone_clust); >> EVENTHANDLER_INVOKE(nmbclusters_change); >> } else >> error = EINVAL; >> @@ -168,9 +171,11 @@ sysctl_nmbjumbop(SYSCTL_HANDLER_ARGS) >> newnmbjumbop = nmbjumbop; >> error = sysctl_handle_int(oidp, &newnmbjumbop, 0, req); >> if (error == 0 && req->newptr) { >> - if (newnmbjumbop> nmbjumbop) { >> + if (newnmbjumbop > nmbjumbop && >> + nmbufs >= nmbclusters + nmbjumbop + nmbjumbo9 + nmbjumbo16) { Style bug, as above. >> nmbjumbop = newnmbjumbop; >> uma_zone_set_max(zone_jumbop, nmbjumbop); >> + nmbjumbop = uma_zone_get_max(zone_jumbop); >> } else >> error = EINVAL; >> } >> @@ -189,9 +194,11 @@ sysctl_nmbjumbo9(SYSCTL_HANDLER_ARGS) >> newnmbjumbo9 = nmbjumbo9; >> error = sysctl_handle_int(oidp, &newnmbjumbo9, 0, req); >> if (error == 0 && req->newptr) { >> - if (newnmbjumbo9> nmbjumbo9) { >> + if (newnmbjumbo9 > nmbjumbo9&& Style bug, not as above (no space before '&&'). >> + nmbufs >= nmbclusters + nmbjumbop + nmbjumbo9 + nmbjumbo16) { Style bug, as above. >> nmbjumbo9 = newnmbjumbo9; >> uma_zone_set_max(zone_jumbo9, nmbjumbo9); >> + nmbjumbo9 = uma_zone_get_max(zone_jumbo9); >> } else >> error = EINVAL; >> } >> @@ -209,9 +216,11 @@ sysctl_nmbjumbo16(SYSCTL_HANDLER_ARGS) >> newnmbjumbo16 = nmbjumbo16; >> error = sysctl_handle_int(oidp, &newnmbjumbo16, 0, req); >> if (error == 0 && req->newptr) { >> - if (newnmbjumbo16> nmbjumbo16) { >> + if (newnmbjumbo16 > nmbjumbo16 && >> + nmbufs >= nmbclusters + nmbjumbop + nmbjumbo9 + nmbjumbo16) { Style bug, as above. >> nmbjumbo16 = newnmbjumbo16; >> uma_zone_set_max(zone_jumbo16, nmbjumbo16); >> + nmbjumbo16 = uma_zone_get_max(zone_jumbo16); >> } else >> error = EINVAL; >> } The code surrounding these style bugs uses the technique of nested elses to make the indentation march to the right, when it could simply return after an error and after the usual case where the sysctl only does a read. For one of the sysctls, the old code is: @ static int @ sysctl_nmbclusters(SYSCTL_HANDLER_ARGS) @ { @ int error, newnmbclusters; @ @ newnmbclusters = nmbclusters; @ error = sysctl_handle_int(oidp, &newnmbclusters, 0, req); Style bug: hard-coding of the variable type (int) hard-coding sysctl_handle_int() here. @ if (error == 0 && req->newptr) { Style bug: boolean check for null pointer. Technique for marching to the right starts here. @ if (newnmbclusters > nmbclusters) { Use another nested else to get 2 indentation levels deeper than necessary altogether. @ nmbclusters = newnmbclusters; @ uma_zone_set_max(zone_clust, nmbclusters); @ EVENTHANDLER_INVOKE(nmbclusters_change); @ } else @ error = EINVAL; Only increasing nmbclusters is supported. Even null changes of it give this error. @ } @ return (error); @ } After fixing some style bugs and reducing the indentation. @@ if (error != 0 || req->newptr == NULL) @@ return (error); @@ if (newnmbclusters <= nmbclusters) @@ return (EINVAL); /* XXX preserve bogusness when equal */ @@ nmbclusters = newnmbclusters; @@ uma_zone_set_max(zone_clust, nmbclusters); @@ EVENTHANDLER_INVOKE(nmbclusters_change); @@ return (0); @@ } @ SYSCTL_PROC(_kern_ipc, OID_AUTO, nmbclusters, CTLTYPE_INT|CTLFLAG_RW, Missing spaces around binary operation '|'. @ &nmbclusters, 0, sysctl_nmbclusters, "IU", Missing indentation. @ "Maximum number of mbuf clusters allowed"); @ Back to the commit. >> @@ -221,6 +230,27 @@ SYSCTL_PROC(_kern_ipc, OID_AUTO, nmbjumb >> &nmbjumbo16, 0, sysctl_nmbjumbo16, "IU", >> "Maximum number of mbuf 16k jumbo clusters allowed"); >> >> +static int >> +sysctl_nmbufs(SYSCTL_HANDLER_ARGS) >> +{ >> + int error, newnmbufs; >> + >> + newnmbufs = nmbufs; >> + error = sysctl_handle_int(oidp, &newnmbufs, 0, req); >> + if (error == 0 && req->newptr) { >> + if (newnmbufs > nmbufs) { >> + nmbufs = newnmbufs; >> + uma_zone_set_max(zone_mbuf, nmbufs); >> + nmbclusters = uma_zone_get_max(zone_mbuf); >> + EVENTHANDLER_INVOKE(nmbufs_change); >> + } else >> + error = EINVAL; >> + } >> + return (error); >> +} >> +SYSCTL_PROC(_kern_ipc, OID_AUTO, nmbuf, CTLTYPE_INT|CTLFLAG_RW, >> +&nmbufs, 0, sysctl_nmbufs, "IU", >> + "Maximum number of mbufs allowed"); All the bugs in the old sysctl are cloned here. >> >> This is not bug for bug compatible. There used to be 3 blank lines here (2 extra). Now there are only 2 (1 extra). >> Modified: head/sys/kern/subr_param.c >> ============================================================================== >> --- head/sys/kern/subr_param.c Tue Nov 27 20:22:36 2012 (r243630) >> +++ head/sys/kern/subr_param.c Tue Nov 27 21:19:58 2012 (r243631) >> @@ -93,6 +93,7 @@ int ncallout; /* maximum # of timer ev >> int nbuf; >> int ngroups_max; /* max # groups per process */ >> int nswbuf; >> +long maxmbufmem; /* max mbuf memory */ This can overflow too, since once the other overflows are fixed, it is only limited to 3/4 of physical memory so the limit on i386 is 3G, which overflows to -1G when stored here. >> pid_t pid_max = PID_MAX; >> long maxswzone; /* max swmeta KVA storage */ >> long maxbcache; /* max buffer cache KVA storage */ >> @@ -270,6 +271,7 @@ init_param1(void) >> void >> init_param2(long physpages) vm mostly uses u_int for page counts (see . v_page_count is only u_int). This avoids most overflow problems. I think alc wants to change to longs. But long is just bogus here. It can't even handle the range of u_int on 32-bit systems. But since 32-bit systems can't have >= 2**31 pages, this type error is only cosmetic. >> { >> + long realmem; u_long would work up to 4GB. >> >> /* Base parameters */ >> maxusers = MAXUSERS; >> @@ -293,19 +295,25 @@ init_param2(long physpages) >> /* >> * The following can be overridden after boot via sysctl. Note: >> * unless overriden, these macros are ultimately based on maxusers. >> - */ >> - maxproc = NPROC; >> - TUNABLE_INT_FETCH("kern.maxproc", &maxproc); >> - /* >> * Limit maxproc so that kmap entries cannot be exhausted by >> * processes. >> */ >> + maxproc = NPROC; >> + TUNABLE_INT_FETCH("kern.maxproc", &maxproc); The comments are misplaced in different ways than before. >> if (maxproc > (physpages / 12)) Old style bug: excessive parentheses. >> maxproc = physpages / 12; I dodn't like limiting the values that can be set by tunables. This prevents the administrator overring bad limits, including ones caused by overflows. >> - maxfiles = MAXFILES; >> - TUNABLE_INT_FETCH("kern.maxfiles", &maxfiles); >> maxprocperuid = (maxproc * 9) / 10; >> - maxfilesperproc = (maxfiles * 9) / 10; >> + >> + /* >> + * The default limit for maxfiles is 1/12 of the number of >> + * physical page but not less than 16 times maxusers. >> + * At most it can be 1/6 the number of physical pages. >> + */ Very wrong comment: - 1/12 is unrelated to maxfiles. It is for maxproc, and is used for that without comment above. The code uses 1/8. - 16 is quite wrong too, but the actual limit is hard to describe. The code uses a limit of MAXFILES, where MAXFILES defaults to (maxproc * 2) but can be #defined; maxproc defaults to NPROC but can be clamped. NPROC is (20 + 16 * maxusers), and cannot be #defined. The 16 in it seems to be the relevant magic numbner. But since there is a factor of 2 in (maxproc * 2), 16 is just wrong and 32 is closer to being correct. - 1/6 seems to be wrong too. The code uses 1/4. I don't like limiting tunable at all. This code has too many comments, and most of them are not useful. Most of them just echo the magic numbers in the code (here the code's value of 1/8 is not even echoed). Few or none describe when these values are what the are. Of course, this is hard to describe since the values are either heuristics or the result of very complicated calculations to ensure that everything can be packed into the available physical/virtual memory, or more likely some fuzzy combination of these. >> + maxfiles = imax(MAXFILES, physpages / 8); >> + TUNABLE_INT_FETCH("kern.maxfiles", &maxfiles); >> + if (maxfiles > (physpages / 4)) Style bug (excessive parentheses). >> + maxfiles = physpages / 4; >> + maxfilesperproc = (maxfiles / 10) * 9; Style bug (excessive parentheses). >> >> /* >> * Cannot be changed after boot. >> @@ -313,20 +321,35 @@ init_param2(long physpages) >> nbuf = NBUF; >> TUNABLE_INT_FETCH("kern.nbuf", &nbuf); >> >> + /* >> + * XXX: Does the callout wheel have to be so big? >> + */ >> ncallout = 16 + maxproc + maxfiles; >> TUNABLE_INT_FETCH("kern.ncallout", &ncallout); >> >> /* >> + * The default limit for all mbuf related memory is 1/2 of all >> + * available kernel memory (physical or kmem). >> + * At most it can be 3/4 of available kernel memory. >> + */ >> + realmem = lmin(physpages * PAGE_SIZE, >> + VM_MAX_KERNEL_ADDRESS - VM_MIN_KERNEL_ADDRESS); As well as overflowing, this has a weird continuation indentation of 2 tabs to not line up with anything, instead of the KNF 4 spaces to line up with that level of indentation. >> + maxmbufmem = realmem / 2; >> + TUNABLE_LONG_FETCH("kern.maxmbufmem", &maxmbufmem); >> + if (maxmbufmem > (realmem / 4) * 3) >> + maxmbufmem = (realmem / 4) * 3; Style bugs (excessive indentation; 2 instances). You added a lot of upper clamps on tunables. I don't like them, as above. There are no lower clamps, so foot shooting is still easy by setting tunables to preposterous values like 0 or negative, or to large values that overflow to negative. And if some other subsystem uses a tunable or option like VM_BCACHE_SIZE_MAX to reserve lots of kva for itself, none of the upper clams here is right. >> + >> + /* >> * The default for maxpipekva is min(1/64 of the kernel address space, >> * max(1/64 of main memory, 512KB)). See sys_pipe.c for more details. >> */ >> maxpipekva = (physpages / 64) * PAGE_SIZE; Old style bug (excessive parentheses). This divides by 64 first, so its multiplication doesn't overflow so easily. It works up to 64 * 2GB. I don't know the limit for PAE, but this much physical memory is easy to reach now. >> + TUNABLE_LONG_FETCH("kern.ipc.maxpipekva", &maxpipekva); >> if (maxpipekva < 512 * 1024) >> maxpipekva = 512 * 1024; >> if (maxpipekva > (VM_MAX_KERNEL_ADDRESS - VM_MIN_KERNEL_ADDRESS) / 64) >> maxpipekva = (VM_MAX_KERNEL_ADDRESS - VM_MIN_KERNEL_ADDRESS) / >> 64; >> - TUNABLE_LONG_FETCH("kern.ipc.maxpipekva", &maxpipekva); This used to be correctly placed, so that the administrator can set any value. >> } >> >> /* >> >> Modified: head/sys/kern/uipc_socket.c >> ============================================================================== >> --- head/sys/kern/uipc_socket.c Tue Nov 27 20:22:36 2012 (r243630) >> +++ head/sys/kern/uipc_socket.c Tue Nov 27 21:19:58 2012 (r243631) >> @@ -306,12 +306,9 @@ sysctl_maxsockets(SYSCTL_HANDLER_ARGS) >> newmaxsockets = maxsockets; >> error = sysctl_handle_int(oidp, &newmaxsockets, 0, req); >> if (error == 0 && req->newptr) { >> - if (newmaxsockets > maxsockets) { >> + if (newmaxsockets > maxsockets && >> + newmaxsockets <= maxfiles) { No need to split this line. >> maxsockets = newmaxsockets; >> - if (maxsockets > ((maxfiles / 4) * 3)) { >> - maxfiles = (maxsockets * 5) / 4; >> - maxfilesperproc = (maxfiles * 9) / 10; >> - } Bad examples with exessive parnetheses in old code. >> EVENTHANDLER_INVOKE(maxsockets_change); >> } else >> error = EINVAL; >> Bruce