Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 13 Jul 2001 12:29:46 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Leo Bicknell <bicknell@ufp.org>
Cc:        Matt Dillon <dillon@earth.backplane.com>, hackers@FreeBSD.ORG
Subject:   Re: Network performance tuning.
Message-ID:  <3B4F4C2A.BF64E68D@mindspring.com>
References:  <15.16ffaf54.287f3d4d@aol.com> <20010712135629.A49042@ussenterprise.ufp.org> <200107130128.f6D1SFE59148@earth.backplane.com> <3B4F36AE.857511FF@mindspring.com> <20010713140326.A23982@ussenterprise.ufp.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Leo Bicknell wrote:
> > The problem is that the tcpcb's, inpcb's, etc., are all
> > pre-reserved out of the KVA space map, so that they can
> > be allocated safely at interrupt, or because "that's how
> > the zone allocator works".
> 
> I think the only critical resource here is MBUF's, which today are
> preallocated at boot time.  There are memory fragmentation concerns
> with allocating/deallocating them on the fly.

The tcpcb's, inpcb's, etc. are in a similar boat; see "zalloci"
and "ziniti".


> I am not going to even attempt to get into the world of kernel
> memory allocators, that's way out of my league.  That said, the
> interesting cases (in increasing order of difficulty):

I have an allocator that addresses the fragmentation issues;
it can be jammed into a Dynix allocator (Bosko/Alfred-style),
as well, pretty easily.  I haven't done that because of the
need to have a three tier scheme (Dynix uses a two tier) to
allow recovery of the resource blocks over time to make them
non-type-stable, and therefore capable of being repurposed
(Dynix does this).  The third tier is to grab a contiguous
chunk of KVA to back the second tier, so that allocations can
occur at interrupt time (as in the current zone allocator,
which prereserves the page table mappings).

The zone allocator also aligns to 32 byte boundaries, when
really it should only be aligning to sizeof(long) boundaries
(my allocator does this for internal object boundaries, and
does not have wasted "partial pages").

The main problem is that, in order to do interrupt level
allocations, the ziniti() expects to preallocate the page
table mappings (just as the mbuf allocation does), so that
it can be filled from free RAM.  This is also the reason
that running out of free RAM causes mbuf allocations "to do
bad things": you can't overcommit pages that are going to
be assigned at fault-in-interrupt time.


> 1) Allowing an admin to change the number of MBUF's on the fly
>    (with sysctl).  Presumably these would be infrequent events.

This is pretty much "not a chance in hell"; even though
they are sized such that page size is an even multiple of
mbuf size, the allocator can't really handle the idea of
the zone not being contiguous, since there are other things
that end up not being size suc that page size modulo object
size does not have a remainder (e.g. 192 bytes for a tcpcb).

Thus, you can not get away from the KVA contiguity requirement,
without seperating memory into interupt and non-interrupt
zones on one axis, and high, medium, and low persistance
objects on another axis, and size of object cluster objects
on a third axis.

This gets even more complex when you factor in per-CPU memory
pools for SMP.


> 2) Allowing MBUF's to be allocated/deallocated in fixed size
>    blocks easy for the allocator to deal with.  (Eg, you always
>    have 128k to 4 M of MBUF's allocated in 128k chunks.)

The problem with this is still that the page mappings must
exist, since mbufs are allocated by drivers at interrupt
out of preassigned KVA space.  In a livelock situation, you
will find that you will not be able to go into non-interrupt
space to grab your next 4M KVA space chunk.  Setting arbitrary
power of two size limits is also bad, unless your allocator
is very, very clever.  It's impossible to be that clever with
a fixed size "superallocation" target: you have to think in
terms of page units.

> 3) Allowing MBUF's to be fully dynamically allocated.
> 
> I'm not sure I see any value to #3.  I see huge value to #1
> (when you run low, you can say double the number on an active
> server).  If we get the warning I want (from another message)
> #1 becomes even more useful.

Can't happen, without a complete rework, so that allocations
at interrupt are permissable.  The major problem here is that
you have a finite KVA space, and you can't reuse it without
swapping, and you can't swap to disk in the middle of a network
interrupt.  It's a chicken-and-egg problem.  I'm not aware of
an OS that has solved it (not to mention that your swap may be
NFS mounted).


> #2 would take some study.  The root question is does allocating
> them in blocks eliminate the memory fragmentation concern for
> the kernel allocator?  If the answer is yes, it's probably something
> to look into, if the answer is no, probably not.

Not as it presently exists.  The fragmentation concern is over
the contiguity of the region, not over having fragments lying
around.  Realize that, in the limit, it's possible to defrag
the KVA space, since as long as the data is not in the defrag
code path, we're just talking about objects that are allocated
in the KVA space, which isn't the physical space, and we only
rarely care about physical contiguity.  Doing this causes some
problems, but they are problems we currently have (e.g. drivers
that _do_ care about physical contiguity being unable to allocate
physical contiguous space can no longer have physical memory
defragged for them to make a large enough contiguous region
available -- we don't defrag at all, today), since you will be
carrying around physical instead of virtual addresses for your
allocations, and ptov'ing them for kernel use, instead of vtop'ing
them for driver use.

It wouldn't take as much study as it would take a hell of a lot
of work.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3B4F4C2A.BF64E68D>