Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 19 Jul 2000 09:21:15 -0700
From:      Alfred Perlstein <bright@wintelcom.net>
To:        Bosko Milekic <bmilekic@dsuper.net>
Cc:        net@FreeBSD.ORG
Subject:   Re: kern/19866: The mbuf subsystem refcount stuff.
Message-ID:  <20000719092114.S13979@fw.wintelcom.net>
In-Reply-To: <Pine.BSF.4.21.0007190941260.40059-100000@jehovah.technokratis.com>; from bmilekic@dsuper.net on Wed, Jul 19, 2000 at 09:59:22AM -0400
References:  <20000718184511.O13979@fw.wintelcom.net> <Pine.BSF.4.21.0007190941260.40059-100000@jehovah.technokratis.com>

next in thread | previous in thread | raw e-mail | index | archive | help
* Bosko Milekic <bmilekic@dsuper.net> [000719 07:00] wrote:
> 
> On Tue, 18 Jul 2000, Alfred Perlstein wrote:
> 
> > moved over to -net.
> 
> [...]
> 
> > Ok here's an idea that I was tossing around:
> > 
> > you have a freelist of these guys, one for each mbuf header,
> > you can make less and deal with failure by sleeping or failing
> > as you want, but for the sake of example one is available for
> > each mbuf at any given time.
> > 
> > struct mclrefcnt {
> > 	struct mclrefcnt *mextr_next;  /* XXX: use list macros  :) */
> > 	atomic_t	mextr_refcnt;
> > };
> 
> 	Okay, following our discussion yesterday, I can understand why your
>   suggestion is valid, especially given the brief SMP outline and
>   future changes that are to come.
>   	There is something that I recently thought of, following a brief
>   beginning in modifying and implementing this, though: the
>   mclrefcnt/mextrefcnt/whatever we want to call it is on a singly linked
>   list to which additions and removes will be made via the head. Therefore,
>   one will have to splimp(), or lock multiple CPUs in the SMP case, when
>   tampering with this list, and this would be during certain reference
>   increments. It's not as bad as the linked list case, which will have to
>   do this during every single additional reference, because it will only
>   need to do it during the first reference call, which will probably come
>   from the allocation macro, which will be under splimp() (or SMP
>   equivalent multiple CPU lock) anyway. Correct?

Yes the freelist of refcount structures must be protected, at a
later point we may be able to move them to per-processor buckets
and lock against ourselves (interrupt) for the most part.  Although
I could have sworn I'd seen a way to manage a singly linked queue
without locks in one of the texts I own.

>   	Also, I'm curious at this point as to how some of this stuff is
>   done in BSD/OS, but I don't have access/can't see the source. Any
>   pointers/explanations/brief outlines? I'm mainly curious as to the SMP
>   stuff vs. mbuf pools, whether they are per CPU, etc. Alfred, now that I
>   think about it, I would eventually like to try the per-CPU list idea, to
>   see how well it will scale on an SMP machine with relatively heavy network
>   load, and I'll probably want to do this before I finalize the resource
>   freeing of mbuf-used pages outline.

From what I can see they do it our way for the most part, the only
difference is that they only have a single function pointer for
ref/deref, when you're increasing ref you get 2 pointers to mbufs,
when decreasing you get 1 and the other is NULL.  It's called on
_every_ ref, just like what we have now.

> > and a pointer to one in each mbuf.
> > 
> > when you want to make a copy your code looks something like this:
> > 
> > /* copying m into n */
> > 
> > if (m->rp == NULL && n->rp == NULL) {
> > 	m->rp = n->rp = mclrefcnt_alloc();
        \--------------------------------/ 
This is a race condition. --/

There's a chance that two cpu threads may issue a m_copym on a mbuf
chain on seperate CPUs, if the mbuf chain's refcounts aren't yet
allocated and NULL, there's a race where _both_ CPUs will call this
code.

The effect will be an incorrect refcount and the leaking of a
refcount structure.  Pretty fatal.

the code will have to change to something like this:

  mclrefcnt_free(n->rp);
  n->rp = m->rp;
  atomic_inc(m->rp.mextr_refcnt);

which is actually simpler. :)

The solution I see is to never have it NULL.. See the comments on
the free function.

> > } else if (m->rp == NULL) {
> > 	m->rp = n->rp;
> > } else if (n->rp == NULL) {
> > 	n->rp = m->rp;
> > } else {
> > 	mclrefcnt_free(n->rp);
> > 	n->rp = m->rp;
> > }
> > atomic_inc(m->rp.mextr_refcnt);
> > 
> > 
> > /* freeing m */
> > 
> >      /* x must get the value I reduced it to */
> > x = atomic_dec_and_fetch(m->rp.mextr_refcnt); 
> > if (x == 0) {
> > 	/* do extfree callback */
> > } else {
> > 	m->rp = NULL;

This is incorrect, the code should be:
    m->rp = mclrefcnt_alloc();

that will ensure that we don't have mbufs where the m->rp isn't
initialized to a structure, it also allows us to simplify the copy
code above.

the mbufs startup code will have to pre-assign them all refcount
structures.

> > }
> > /* free mbuf header */
> > 
> > Are there problems you see with that?
> 
> 	Not by just glancing at it, although the copying of the mbuf code
>   above can be more concise, as you know you're copying say m->m_ext into
>   n->m_ext and incrementing the ref count, so it's not really that tough,
>   as far as I can see right now. To "copy m into n" you just do something
>   like:
>   	n->m_ext = m->m_ext;
> 	n->m_ext->mextrefcnt++;
> 
>   ...and you're on your way. As we discussed this, you don't have to worry
>   about something else freeing a reference and consequently freeing the
>   mextrefcnt node and ext_buf, because you know that here you have control
>   of both m and n, both of which refer to the same ext_buf, so the refcnt
>   cannot drop below 1, while you're going to atomically increment.
> 
> 	You mentionned a race condition in the free code above, yesterday,
>   but I don't exactly recall what. If the refcnt reaches zero above though,
>   you'll have to free the mextrefcnt/mclrefcnt/whatever structure too, at
>   least back to its respective free list. Note that if in the SMP case we
>   decide to split up the free lists, this is one list that will have to
>   remain global, as we may have to deal with mbufs from different CPU free
>   lists referring the same ext_buf's as some allocated from different
>   CPU mbuf free lists.

The race condition is outlined above and I think the solution ought
to work.

The refcount list certainly does not need to remain global, there's
nothing wrong with per-cpu pools of them and when your own pool is
exhausted you can try to lock against another CPU's pool and steal
structures.

The CPUs will have to lock thier own queues against themselves for
protection against interrupts, doing so is cheap, when we have a
shortage we can take the performance hit to lock against another
CPU in order to steal refcount structures.

And if that code to do queue/dequeue wasn't just a figment of my
imagination we may not need to lock at all.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20000719092114.S13979>