Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 19 Jul 2000 16:51:50 -0700
From:      Alfred Perlstein <bright@wintelcom.net>
To:        Bosko Milekic <bmilekic@dsuper.net>
Cc:        net@FreeBSD.ORG
Subject:   Re: kern/19866: The mbuf subsystem refcount stuff.
Message-ID:  <20000719165150.A13979@fw.wintelcom.net>
In-Reply-To: <Pine.BSF.4.21.0007191829400.41534-100000@jehovah.technokratis.com>; from bmilekic@dsuper.net on Wed, Jul 19, 2000 at 07:02:16PM -0400
References:  <20000719092114.S13979@fw.wintelcom.net> <Pine.BSF.4.21.0007191829400.41534-100000@jehovah.technokratis.com>

next in thread | previous in thread | raw e-mail | index | archive | help
* Bosko Milekic <bmilekic@dsuper.net> [000719 16:03] wrote:
> 
> On Wed, 19 Jul 2000, Alfred Perlstein wrote:
> 
> > > > when you want to make a copy your code looks something like this:
> > > > 
> > > > /* copying m into n */
> > > > 
> > > > if (m->rp == NULL && n->rp == NULL) {
> > > > 	m->rp = n->rp = mclrefcnt_alloc();
> >         \--------------------------------/ 
> > This is a race condition. --/
> > 
> > There's a chance that two cpu threads may issue a m_copym on a mbuf
> > chain on seperate CPUs, if the mbuf chain's refcounts aren't yet
> > allocated and NULL, there's a race where _both_ CPUs will call this
> > code.
> > 
> > The effect will be an incorrect refcount and the leaking of a
> > refcount structure.  Pretty fatal.
> > 
> > the code will have to change to something like this:
> > 
> >   mclrefcnt_free(n->rp);
> >   n->rp = m->rp;
> >   atomic_inc(m->rp.mextr_refcnt);
> > 
> > which is actually simpler. :)
> 
> 	This seems unnecessary, as far as I can see. If you are "copying m
>   into n" the idea is that "m" is already allocated and, if it is an M_EXT
>   mbuf, it already has a mextrfcnt structure allocated for it, and is
>   pointing to it. How this works is that when you allocate an mbuf and
>   assign it external storage, the first mbuf to be assigned this ext_buf
>   will be the one that will have to perform the locking and get the
>   mextrefcnt allocated and detatched from the list of mextrefcnt's. Let's
>   say this "first" mbuf is called `m,' then later on you can copy `m' into
>   `n' by giving `n' the same m_ext struct as `m,' and then having
>   n->m_ext->mextr_refcnt be incremented (++), which doesn't need any
>   locking. You should clearly be able to have mbufs that are not at all
>   associated to mextrfcnts for the simple reason that it would be useless
>   if they don't have external storage associated to them.
>   	If, when you are copying m into n, n already has a reference pointer,
>   then you're in big trouble and what you're doing is incorrect.

Yes, you can look at it like that, I agree that the refcnt structs should
be allocate at MEXTADD time as you state below.

> 
> > The solution I see is to never have it NULL.. See the comments on
> > the free function.
> 
> 	It's safe to have it null when you're not using m_ext (when the mbuf
>   doesn't refer to an ext_buf). I don't see why you would check BOTH m and
>   n's refcnt pointers when you should take for granted that if you are
>   copying something into n, then n surely isn't referencing anything else
>   because if it is, you shouldn't be copying anything into it unless you
>   free what was previously there first.

perhaps a good place to KASSERT? :)

> 
> > 
> > > > } else if (m->rp == NULL) {
> > > > 	m->rp = n->rp;
> > > > } else if (n->rp == NULL) {
> > > > 	n->rp = m->rp;
> > > > } else {
> > > > 	mclrefcnt_free(n->rp);
> > > > 	n->rp = m->rp;
> > > > }
> > > > atomic_inc(m->rp.mextr_refcnt);
> > > > 
> > > > 
> > > > /* freeing m */
> > > > 
> > > >      /* x must get the value I reduced it to */
> > > > x = atomic_dec_and_fetch(m->rp.mextr_refcnt); 
> > > > if (x == 0) {
> > > > 	/* do extfree callback */
> > > > } else {
> > > > 	m->rp = NULL;
> > 
> > This is incorrect, the code should be:
> >     m->rp = mclrefcnt_alloc();
> > 
> > that will ensure that we don't have mbufs where the m->rp isn't
> > initialized to a structure, it also allows us to simplify the copy
> > code above.
> > 
> > the mbufs startup code will have to pre-assign them all refcount
> > structures.
> 
> 	Not all mbufs should have an assigned mextrefcnt, because not all
>   mbufs refer to external storage, so it would be a waste of memory.
> 	When you free an mbuf, you'll also have to be under splimp() most
>   likely (unless we find those atomic queue manip. constructs). The reason
>   is that you'll be placing the mbuf back on the list, which requires
>   blocking against interrupts, and also against other CPUs in the SMP case
>   (unless we decide to split the list in the future). So what you do in
>   this case should be more like this:
>   
>   struct mextrefcnt *our_ref = mbuf_to_be_freed->m_ext->mext_refcnt;
> 
>   int s = splimp();
>   our_ref->counter--;
>   if (our_ref->counter == 0) {
>   	/* call (mbuf_to_be_freed->m_ext->ext_free)(blah, blah) */
> 	/* free mextrefcnt structure (our_ref) to the mextrefcnt freelist */
>   }
>   mbuf_to_be_freed->m_ext->m_ext_refcnt = NULL;
>   /* actually, all of the mbuf's m_ext should be zero'd, but the next
>   allocation should take care of that anyway, so we can skip it here and do
>   it during the allocating of the mbuf. */
>   free_the_mbuf(mbuf_to_be_freed);
> 
>   Clearly, when you are freeing the mbuf, you have to splimp(), but this is
>   normal for now. The idea is though that you _don't_ have to splimp() when
>   you just increment the ref count, unless it's the first incrementation.
>   This is the reason, as I understand it, that your idea makes sense (as
>   opposed to my linked list idea which always needed to be splimp()ed).
> 
>   Naturally, MEXTADD() will also have to be modified to not only setup the
>   mbuf for the ext_buf type (e.g. setup ext_free, and other m_ext fields)
>   but to also splimp() and allocate the mextrefcnt structure and increment
>   it by one. This is expected.

Yes, the problem in the first proposal was that the refcnt allocation
could be delayed until m_copym time, that's no good.

As long as you make it so that when a m_ext mbuf is created it
always has a refcount struct associated with it, then you don't
need to always have a refcnt struct with each mbuf as the creator
has exclusive access and you'll always "be" ok, in face it ought
to speed up the code a bit not to have to worry about freeing/allocating
whenever doing shallow copies.

> > The race condition is outlined above and I think the solution ought
> > to work.
> > 
> > The refcount list certainly does not need to remain global, there's
> > nothing wrong with per-cpu pools of them and when your own pool is
> > exhausted you can try to lock against another CPU's pool and steal
> > structures.
> 
> 	Well, now that I've thought about it more extensively, this does make
>   sense, but only if you consider that just because an mbuf was originally
>   allocated from CPU#1's freelist, doesn't mean that it necessarily has to
>   be freed to CPU#1's freelist, such that it may end up on any of the CPU's
>   freelists eventually; same goes for the mextrefcnt list elements.

Actually, remeber, we are locking against ourselves for interrupts,
there's no reason why a cpu couldn't free a mbuf back into another
mbuf's pool, if we implement decent processor binding (affinity
(sp?)) of processes then you're most likely to allocate from your
own pool and free to it keeping your locking to yourself, but you
can still lock against other CPUs to free into thier pools. :)
The per cpu pool is a Dynix thing, the freeing back into another
CPU's pool is something pioneered by the people doing Horde.

We'd possibly help our cache by doing this, but it would have to
be profiled heavily to figure it all out.

> > The CPUs will have to lock thier own queues against themselves for
> > protection against interrupts, doing so is cheap, when we have a
> > shortage we can take the performance hit to lock against another
> > CPU in order to steal refcount structures.
> 
> 	Agreed. But if we do go with this, we'll REALLY be in trouble when it
>   comes to freeing physical memory, especially if you consider that each
>   CPU's freelists can now contain mbufs from pretty much anywhere. In other
>   words, I can now have 16 mbufs making up a page spread out across 4
>   different CPU's free lists, and I would have to giant lock in order to
>   free them, or even track all of them at that (say I want to move them to
>   a less used list, etc.). Poul-Henning, do you see what I mean, given
>   discussions we had regarding this?

If you want to do this you can free back into other CPU's pools,
the locking is already established to protect from interrupts, it
can also function to lock against other CPUs.

> > And if that code to do queue/dequeue wasn't just a figment of my
> > imagination we may not need to lock at all.
> 
> 	I think Mike blurted out that there are some atomic queue
>   manipulation constructs, although I am personally not familiar with them
>   (yet). If we do find them and they do apply, we will see a dramatic
>   advantage not only in the SMP area, but overall. It almost seems to good
>   to be applicable. :-)

I swear i've seen it on paper somewhere, I'll probably be in my
lab tonight tearing things up looking for the stuff.

-Alfred


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20000719165150.A13979>