Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 2 Jul 2002 08:56:40 -0700
From:      Luigi Rizzo <rizzo@icir.org>
To:        Bosko Milekic <bmilekic@unixdaemons.com>
Cc:        net@FreeBSD.ORG
Subject:   Re: Mbuf allocator performance (was Should we keep a cache of mbuf+cluster ready for use ?)
Message-ID:  <20020702085640.B5854@iguana.icir.org>
In-Reply-To: <20020702090354.A78632@unixdaemons.com>; from bmilekic@unixdaemons.com on Tue, Jul 02, 2002 at 09:03:54AM -0400
References:  <20020702001050.B2250@iguana.icir.org> <20020702080423.A69693@unixdaemons.com> <20020702090354.A78632@unixdaemons.com>

next in thread | previous in thread | raw e-mail | index | archive | help

Hi,
you can read the numbers I posted as you like, but in the non-SMP
case performance is not "seemingly" better, it IS better.

Before people think differently:

    the code i posted is _perfectly safe_ to use under the assumption
    that I used -- i.e.  that you hold Giant when manipulating the
    free list -- on both non-SMP (where i tested it, and where it
    does improve performancw) and SMP systems (where I have no idea
    about performance, but i might have some numbers by tomorrow),
    under the assumption that I used -- i.e.  that you hold Giant
    when manipulating the free list.

The above assumption holds in the device drivers i patched, and
also (at the time of this writing, and I suspect for quite a bit,
although I and hopefully others are working on this) in all the
network stack.
And of course you can #ifdef the code out for the SMP case if you
don't trust it so you won't pessimize the SMP case.

Also:

    I am not violating _any_ interface.
    I am just holding apart a small set of buffers for future use,
    which is exactly the same thing that each and every network
    device driver in FreeBSD does when it links them in the receive
    ring -- no guarantee when/if they will be freed, certainly none
    of them is freed even if the system runs out of mbufs (unless
    the network interface is brought down, which i doubt it happens).

This said, a detailed answer to your points:

#1  I think your comment on the sleepers and mbuf exhaustion is
    not correct.
    I am only talking of holding apart a small set of buffers+clusters,
    in the order of ~50 blocks, and you can restrict the consumers
    of the free pool to those who do not want to sleep, so there
    is not need to special-case the code for the sleepers.
    Or, if you really want to special case it, the first sleeper
    which finds the free pool empty and needs to go to sleep will
    reset max_pool_size to 0 forcing the same behaviour we have now
    with correct handling of wakeup and such.

#2  given that all i am asking for is defining an interface for this
    type of optimization, when the time comes that Giant goes, and
    assuming I am not retired yet, it will still change nothing for
    the non-SMP case, and for the SMP case we can run our performance
    tests and if it proves to be useful we can put in whatever
    is necessary.

#3  easy to fix (assuming this is really a problem) -- just have the
    free pool organized as a list of pointers to the mbufs, rather than
    using the m_nextpkt field as the link field. Because the free
    list is short and fixed size you can even use an array to
    hold it.

#4  same answer of #2: if/when... we can easily fix things in just
    one place. Until then, we are just pessimizing things for no
    good reason.

In terms of API, since we both agree that we need one (and we do
need one, otherwise driver writers will just go ahead and do their
own, replicating work in each driver, as I am sure I^hthey will
start doing now), may I suggest that we export two-- one where the
caller is assumed to already hold a lock on the CPU it is running
on, and one which also does the necessary locking upfront and calls
the latter afterwards ? This way the former will be usable right
now and efficiently from all contexts which run under Giant, and
easy to upgrade when/if we need it.

And, I am sorry to say... you raise some valid points, but without
performance numbers to justify them, it is hard to tell how relevant
they are in practice.

I have made too many mistakes of this kind myself to believe my
intuition.

	cheers
	luigi

On Tue, Jul 02, 2002 at 09:03:54AM -0400, Bosko Milekic wrote:

> >  There are several problems with your "simple" code that make it simpler
> >  than what's in -CURRENT and in -STABLE and therefore yield [obviously]
> >  seemingly better times.
> > 
> >  1) sleeping in the case of a system exhausted of mbufs and clusters is
> >  now broken because code that frees to this "combined" list will not
> >  generate wake ups;  even if you build this functionality in, you would
> >  have to code the wakeup for a "special" case where one obtains both a
> >  cluster and an mbuf; it'll get complicated.
> > 
> >  2) no per-CPU caches, you bring contention (in -CURRENT) from per-CPU
> >  back to global, which you probably don't care about now, but which will
> >  matter when Giant goes completely.
> > 
> >  3) you violate write/read producer/consumer model.  In -CURRENT, we try
> >  to not have the code doing the freeing write to any part of the mbuf,
> >  thus hopefully avoiding cache invalidation from occuring during a free
> >  that happens from a thread that doesn't usually write to the mbuf at
> >  all.
> 
>    Forgot, also, in -CURRENT:
> 
>    4) breaks framework in mb_alloc that groups pages of mbufs and
>    clusters together allowing them to be freed in the future, if/when we
>    decide to implement lazy freeing;  it would do the same thing if we
>    mvoed mbuf allocations to UMA, because grouping together mbufs and
>    clusters on a separate list will violate that scheme.
> 
> >  I think that because of (1) and (2), especially, and because the kernel
> >  in -CURRENT is the way it is right now, you're getting better
> >  performance numbers.  Try exhausting the system and seeing how far you
> >  get.  I suspect you'll eventually just hang (because you'll allocate a
> >  bunch of stuff violating the real interface and then won't be able to
> >  wakeup and Do The Right Thing).
> > 
> >  While, as I said before, I agree that we should have an interface that
> >  allocates both an mbuf and a cluster, I don't think that this is the
> >  right approach.  I believe that the correct approach would be to couple
> >  the mbuf and cluster allocation under a single per-CPU lock/unlock
> >  operation, which isn't too difficult to do.
> 1

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020702085640.B5854>