FreeBSD Mail Archives

Date:      Thu, 21 Jun 2001 01:48:57 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Bosko Milekic <bmilekic@technokratis.com>
Cc:        Terry Lambert <tlambert@primenet.com>, freebsd-alpha@FreeBSD.ORG
Subject:   Re: vx, lge, nge, dc, rl, sf, sis, sk, vr, wb users please TEST
Message-ID:  <3B31B4F9.FA2D43F1@mindspring.com>
References:  <20010619191602.A28591@technokratis.com> <200106200224.TAA24251@usr05.primenet.com> <20010619232624.A29829@technokratis.com> <3B304ADF.C5131399@mindspring.com> <20010620123029.A34452@technokratis.com> <3B30D941.6AE93443@mindspring.com> <20010620135939.A34888@technokratis.com>

Bosko Milekic wrote:
> 
>   Before I go into this [the topic seems to have diverged a
> little], has anybody gotten around to testing (or has the
> hardware to do the testing required?) - if some of you guys
> are stuck with the fact that you have the hardware but don't
> run -CURRENT, please let me know - I could generate an
> equivalent patch for -STABLE, with a little work.

It hasn't wandered too far: your motivation was to
make life easier for your allocator.

I patched 4.3-RELEASE and tried with a borrowed Intel
Gigabit card in an Intel based box, but then let the
owner of the box reboot to our standard kernel after one
FTP down.  It seemed to work.

Our standard kernel statically allocates mbufs out of
physical RAM using VALLOC() at boot time, so we pay zero
cost for the pool allocation, and O(1) cost for all
allocations and deallocations, with zero wasted bytes,
and are in no danger from the 2 byte underflow on copy.

I think you now see why I'm a bit more interested in a
generalized version of the allocator, and thing this
change is benign enough that, if Bill Paul approves, it
is OK with me.

--
> > May I offer a suggestion?  The purpose of having the
> > clusters be a fixed size, given a much larger number
> > of mbufs has never really been clear to me, given that
> > mbufs are allocated to act as cluster headers, to make
> > things a tiny bit (but not substantially) easier, when
> > it comes to freing chains, etc..
> 
>         The reason is that clusters are not used everywhere,
> but mbufs are (even when the storage type is not a cluster).

Yes, I know: tcptmpl, ring buffers for ethernet cards,
cluster headers, and mbufs.

> > It seems to me that what you really want to do is
> > allocate _different sizes_ of mbufs, and have the
> > deallocator sort them out on free.
> 
>         Believe me, I've thought about this. Alfred Perlstein
> has been pushing me to do something like this for a while now.
> There are several problems with the suggestion, but it's
> something to consider in the future. I'd rather allow mb_alloc
> to stabilize a little more after committing it, work at
> lockifying net*/, unwind Giant, and then continue to deal with
> issues such as this.

I think that you should change the code to explicitly
relinquish Giant on entry, and reacquire it on exit;
that would be a better test of whether or not the data
structures were sufficiently protected in the absence
of Giant, without needing to debug everything on the
unroll.

You could put this in INVARIANTS or DEBUG_MB_ALLOC, or whatever.

> > Do you have any interest in generalizing your allocator?
> 
>         Well, not really. See, the allocator is a specialization
> of a general allocator. One of the main design goals of the
> allocator is performance. I tried very hard to give the new
> allocator the advantages required for scalability and the
> "infrastructure" required to reclaim memory while keeping
> approximately the same allocation/deallocation standards of
> the present allocator. One important performance advantage
> of the present allocator relative to say the NetBSD or OpenBSD
> mbuf allocations/deallocations is that
> we *never* have to free pages back to the map from m_free()
> (an expensive procedure). This is precisely why I'd like to
> have freeing eventually implemented from a kproc, when it can
> be handled only when really needed without affecting network
> performance. General purpose allocations should probably be
> handled differently.

Turning the mbufs type-stable didn't seem to affect the
performance postively to any significant extent; this
might have just been balancing of the lock overhead, but
I have a hard time believing that it could be that balanced;
the spiking seen in the without/with comparison occurred on
both, so that probably wasn't it.

I also have a minor concern with the deallocation back
to the bucket locking things up; maybe a kproc per CPU
now would let you queue deallocations that were non-local,
and get rid of many of those locks (substituting locks
that only IPI two CPUs instead of all of them to do a queue
content exchange).

If you have a general version of the allocator, it would be
better to show it now, rather than a cut-down version.  My
guess is that it would not have an effect on John Lemon's
or your benchmarks, one way or the other.

Obviously, I'm a bit biased, given that I don't really
allocate mbufs any more, I only pull them off or put them
on a freelist.

> > Eventually, you will probably want to do allocations of
> > things other than mbufs.
> 
>         There *is* a general version of an allocator such
> as mb_alloc. In fact, as is mentionned in the introductory
> comments in subr_mbuf.c, mb_alloc is in part based on Alfred's
> "memcache" allocator. Although a little outdated, the source is:
> 
>         http://people.freebsd.org/~alfred/memcache/
> 
>         All it needs is a little bit of cleaning up/fixing up
> and it's ready to fly for general purpose allocations. Keep
> in mind, though, that these types of allocators have a wastage
> factor that becomes significant as the size of the objects
> being allocated approaches (falls to) the size of a pointer.
> The reason is that the free list is implemented with an
> allocated pointer array [it's done this way for very specific
> reasons which i'll keep out of this Email] and if you're
> allocating a page worth of ptr-size objects, youre spending
> a whole other page for the pointers for a one page worth
> freelist for these objects.

I'm aware of Alfred's allocator; the overhead could be
significantly reduced, but, as you say, not a topic for
this thread.

>         The whole reason I wrote mb_alloc separately from
> memcache altogether is to allow for the different type of
> freeing to occur,

I think it should be a flag in the object description, it
would add a single compare, but the code would be much more
general.

> and to allow for future very mbuf-specific alloc.
> optimizations to occur,

Since mbuf's are 1/16th page sized objects (1/32nd on
Alpha), there's not really a lot you have to do about
that to make it work.  There's actually a general trick
you could do using the paging subsystem.

> and to allow us to inline the code in the mbuf allocation
> functions,

This is not really a very good reason; a tiny compiler
change would end up doing a much better job for this.

> and ... (some other less worthy reasons) [glancing over
> subr_mbuf.c a few times should make these things obvious].

Yes.

> > Also, the buckets should probably be permitted to be
> > some multiple of the page size, in order to permit the
> > allocation of odd-sized structures, and allow them to
> > span page boundaries, if you went ahead with a more
> > general approach.
> 
>         This is done in memcache, actually.

Actually, that's broken, or what I would consider to be
broken, the same way the zone allocator is...

> > I guess the next thing to think about after that would
> > be allocation at interrupt time.  I think this can be
> > done using the ziniti() approach; but you would then
> > reserve the KVA space for use by the allocator at page
> > fault time, instead.
> 
>         This is for the general allocator, right? Yeah,
> memcache can be made to optionally reserve KVA space for
> interrupt-time allocations; mind you, it would only serve
> as an optimization. As I previously mentionned for mb_alloc,
> the KVA space needed is already reserved.

This is really a pessimization, when it isn't needed.  KVA
space is at an ungodly premium.  That's true of the current
stuff without your code, too... 8-(.

> > I have some other methods of getting around faulting,
> > when you have sufficient backing store.  Now that there
> > are systems that have the capability of having as much
> > RAM as the KVA space (e.g. the KVA space is really no
> > longer sparse), there are a number of optimizations
> > that become pretty obvious.
> 
>         Cool. Do share. Had my trip to Usenix this year not
> been cancelled due to the fact that I'm going to see family
> in Yugoslavia this Saturday, I would have gathered you and
> Alfred together, and listened over a few beers.

Boston and SF are not that close together.  8-).  Basically,
you do evil things with the paging code, which end up being
a bit complicated, but really, really fast.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-alpha" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3B31B4F9.FA2D43F1>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation