From owner-freebsd-current Mon Nov 25 13: 2:44 2002 Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9354237B401; Mon, 25 Nov 2002 13:02:41 -0800 (PST) Received: from tesla.distributel.net (nat.MTL.distributel.NET [66.38.181.24]) by mx1.FreeBSD.org (Postfix) with ESMTP id AE45543EA9; Mon, 25 Nov 2002 13:02:40 -0800 (PST) (envelope-from bmilekic@unixdaemons.com) Received: (from bmilekic@localhost) by tesla.distributel.net (8.11.6/8.11.6) id gAPL1Ms75728; Mon, 25 Nov 2002 16:01:22 -0500 (EST) (envelope-from bmilekic@unixdaemons.com) Date: Mon, 25 Nov 2002 16:01:22 -0500 From: Bosko Milekic To: Andrew Gallatin Cc: Julian Elischer , Robert Watson , Luigi Rizzo , current@freebsd.org Subject: Re: mbuf header bloat ? Message-ID: <20021125160122.A75673@unixdaemons.com> References: <15840.8629.324788.887872@grasshopper.cs.duke.edu> <15841.17237.826666.653505@grasshopper.cs.duke.edu> <20021125130005.A75177@unixdaemons.com> <15842.27547.385354.151541@grasshopper.cs.duke.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <15842.27547.385354.151541@grasshopper.cs.duke.edu>; from gallatin@cs.duke.edu on Mon, Nov 25, 2002 at 01:27:39PM -0500 Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Mon, Nov 25, 2002 at 01:27:39PM -0500, Andrew Gallatin wrote: > Bosko Milekic writes: > > On Sun, Nov 24, 2002 at 04:23:33PM -0500, Andrew Gallatin wrote: > > > If we're going to nitpick the mbuf system, a much, much worse problem > > > is that you cannot allocate an mbuf chain w/o holding Giant, which > > > stems from the mbuf system eventually calling kmem_malloc(). This > > > effectively prevents any network driver from being giant-free. When > > > mbufs are low, mb_alloc() calls mb_pop_cont(). This, in turn, calls > > > kmem_malloc() which requires Giant... > > > > This is not entirely true. You can allocate an mbuf chain without > > holding Giant if the caches are well populated - and they should be > > in the common/general case. You can in fact modify the allocator to > > just not do a kmem_malloc() if called with M_DONTWAIT, but I don't > > think you'd want to do this at this point. > > Unfortunately, the common case is not good enough when figuring out > locking for network drivers and the uncommon case forces all network > drivers under Giant. I was thinking about doing what you describe, > but my misconception about ref counters being malloced was holding me Firstly, it should be noted that the behavior of calling kmem_malloc() when its caches are empty is an old property that has been carried over from the original allocator - in other words, it is not something that I arbitrarily introduced. With that said, we may want to think about changing it or at least introducing a third case that would modify the semantics so that Giant would never be called. For instance, in addition to M_TRYWAIT and M_DONTWAIT, you could invent M_CACHEONLY that would only attempt a quick cache allocation. However, I'm unsure as to the advantages this would bring - if any. Why is it so much of a problem if your driver wants to grab Giant? Are you afraid of lock order reversals somewhere down the line? Perhaps there is a misconception that we can clear up here before any work is done. > back. We'll also need a kproc that can wake up every now and then to > expand the pool if allocations at interrupt time failed. Or do you > already have a mechanism for that? The intended mechanism is the kproc and when the allocator was first designed and written, I had taken that into account although I have not implemented the kproc yet. The idea was to have a kproc maintain a balance of the caches without interfering with the general allocation cases. The way this would be accomplished is actually fairly elegant, although we would have to be careful when doing the implementation: (i) Have the kproc make sure that if we're under a low watermark, do a block allocation from the mbuf and/or cluster maps and populate the general global cache (thereby not interfering with simultaneous frees to the per-CPU caches). (ii) Have the kproc check if we're above a high watermark on the general global cache and if so free some pages back to VM without interfering with simultaneous allocations on the per-CPU caches. (iii) Have the kproc do some bookkeeping and auto-tune its high and low watermarks. The key words in all three goals above is "not interfering with simultaneous allocations/frees." This is a big advantage and is one of the reasons that the mbuf allocator exists. What it means is that you can muck with the VM and indirectly populate the mbuf and cluster caches without interfering with the common {mbuf,cluster} free and allocation case. Graphically: [ VM ] <--> [ global cache ] <--> [ per-CPU caches ] ^ ^ | | [ only ] [ common case allocations and ] [ done by ] [ frees ] [ kproc (to ] [ be written) ] [ and when ] [ certain types of ] [ allocations fail ] (I hope that came out OK). As you can see, the concept is very simple and with the current infrastructure should be fairly easy to implement. If you're wondering, I was going to do is as a 5.1 feature at some point because I've been so swamped with other things right now and so I do not foresee being able to do this in time for 5.0. > > > The mbuf system calls malloc in other ways too. The first time you > > > use a cluster, m_ext.ext_ref_cnt is malloc()'ed, and malloc is called > > > when the mbuf map is expanded... I assume malloc will eventually > > > call kmem_malloc(), leading to the same locking problems. > > > > Actually, that has been changed a while ago. The ref_cnt for clusters > > is no longer malloc()'d. A contiguous space is used from which > > cluster ref counters are "allocated" (just like in the old/original > > design). This modification was made a while ago as an optimisation. > > Cool. The following code confused me into making the above statement. Is it > out of date? > > #define _mext_init_ref(m, ref) do { \ > (m)->m_ext.ref_cnt = ((ref) == NULL) ? \ > malloc(sizeof(u_int), M_MBUF, M_NOWAIT) : (u_int *)(ref); \ > if ((m)->m_ext.ref_cnt != NULL) { \ > *((m)->m_ext.ref_cnt) = 0; \ > MEXT_ADD_REF((m)); \ > } \ > } while (0) It is not out of date. The code means: "If you've given me a counter then I'll use it otherwise I'll try to allocate one with malloc()." And the cluster allocation code will provide a counter by indexing into a linear contiguous page array (note that this is also why clusters are required to come from a seperate contiguous address map - otherwise this wouldn't work). > Thanks, > > Drew -- Bosko Milekic * bmilekic@unixdaemons.com * bmilekic@FreeBSD.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message