From owner-freebsd-hackers Fri Jul 13 12:29:26 2001 Delivered-To: freebsd-hackers@freebsd.org Received: from swan.mail.pas.earthlink.net (swan.mail.pas.earthlink.net [207.217.120.123]) by hub.freebsd.org (Postfix) with ESMTP id D718237B403 for ; Fri, 13 Jul 2001 12:29:20 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from mindspring.com (dialup-209.245.130.157.Dial1.SanJose1.Level3.net [209.245.130.157]) by swan.mail.pas.earthlink.net (EL-8_9_3_3/8.9.3) with ESMTP id MAA01035; Fri, 13 Jul 2001 12:29:10 -0700 (PDT) Message-ID: <3B4F4C2A.BF64E68D@mindspring.com> Date: Fri, 13 Jul 2001 12:29:46 -0700 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Leo Bicknell Cc: Matt Dillon , hackers@FreeBSD.ORG Subject: Re: Network performance tuning. References: <15.16ffaf54.287f3d4d@aol.com> <20010712135629.A49042@ussenterprise.ufp.org> <200107130128.f6D1SFE59148@earth.backplane.com> <3B4F36AE.857511FF@mindspring.com> <20010713140326.A23982@ussenterprise.ufp.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Leo Bicknell wrote: > > The problem is that the tcpcb's, inpcb's, etc., are all > > pre-reserved out of the KVA space map, so that they can > > be allocated safely at interrupt, or because "that's how > > the zone allocator works". > > I think the only critical resource here is MBUF's, which today are > preallocated at boot time. There are memory fragmentation concerns > with allocating/deallocating them on the fly. The tcpcb's, inpcb's, etc. are in a similar boat; see "zalloci" and "ziniti". > I am not going to even attempt to get into the world of kernel > memory allocators, that's way out of my league. That said, the > interesting cases (in increasing order of difficulty): I have an allocator that addresses the fragmentation issues; it can be jammed into a Dynix allocator (Bosko/Alfred-style), as well, pretty easily. I haven't done that because of the need to have a three tier scheme (Dynix uses a two tier) to allow recovery of the resource blocks over time to make them non-type-stable, and therefore capable of being repurposed (Dynix does this). The third tier is to grab a contiguous chunk of KVA to back the second tier, so that allocations can occur at interrupt time (as in the current zone allocator, which prereserves the page table mappings). The zone allocator also aligns to 32 byte boundaries, when really it should only be aligning to sizeof(long) boundaries (my allocator does this for internal object boundaries, and does not have wasted "partial pages"). The main problem is that, in order to do interrupt level allocations, the ziniti() expects to preallocate the page table mappings (just as the mbuf allocation does), so that it can be filled from free RAM. This is also the reason that running out of free RAM causes mbuf allocations "to do bad things": you can't overcommit pages that are going to be assigned at fault-in-interrupt time. > 1) Allowing an admin to change the number of MBUF's on the fly > (with sysctl). Presumably these would be infrequent events. This is pretty much "not a chance in hell"; even though they are sized such that page size is an even multiple of mbuf size, the allocator can't really handle the idea of the zone not being contiguous, since there are other things that end up not being size suc that page size modulo object size does not have a remainder (e.g. 192 bytes for a tcpcb). Thus, you can not get away from the KVA contiguity requirement, without seperating memory into interupt and non-interrupt zones on one axis, and high, medium, and low persistance objects on another axis, and size of object cluster objects on a third axis. This gets even more complex when you factor in per-CPU memory pools for SMP. > 2) Allowing MBUF's to be allocated/deallocated in fixed size > blocks easy for the allocator to deal with. (Eg, you always > have 128k to 4 M of MBUF's allocated in 128k chunks.) The problem with this is still that the page mappings must exist, since mbufs are allocated by drivers at interrupt out of preassigned KVA space. In a livelock situation, you will find that you will not be able to go into non-interrupt space to grab your next 4M KVA space chunk. Setting arbitrary power of two size limits is also bad, unless your allocator is very, very clever. It's impossible to be that clever with a fixed size "superallocation" target: you have to think in terms of page units. > 3) Allowing MBUF's to be fully dynamically allocated. > > I'm not sure I see any value to #3. I see huge value to #1 > (when you run low, you can say double the number on an active > server). If we get the warning I want (from another message) > #1 becomes even more useful. Can't happen, without a complete rework, so that allocations at interrupt are permissable. The major problem here is that you have a finite KVA space, and you can't reuse it without swapping, and you can't swap to disk in the middle of a network interrupt. It's a chicken-and-egg problem. I'm not aware of an OS that has solved it (not to mention that your swap may be NFS mounted). > #2 would take some study. The root question is does allocating > them in blocks eliminate the memory fragmentation concern for > the kernel allocator? If the answer is yes, it's probably something > to look into, if the answer is no, probably not. Not as it presently exists. The fragmentation concern is over the contiguity of the region, not over having fragments lying around. Realize that, in the limit, it's possible to defrag the KVA space, since as long as the data is not in the defrag code path, we're just talking about objects that are allocated in the KVA space, which isn't the physical space, and we only rarely care about physical contiguity. Doing this causes some problems, but they are problems we currently have (e.g. drivers that _do_ care about physical contiguity being unable to allocate physical contiguous space can no longer have physical memory defragged for them to make a large enough contiguous region available -- we don't defrag at all, today), since you will be carrying around physical instead of virtual addresses for your allocations, and ptov'ing them for kernel use, instead of vtop'ing them for driver use. It wouldn't take as much study as it would take a hell of a lot of work. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message