From owner-freebsd-net  Fri Jul 12 11: 3:54 2002
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 6ABF237B400
	for <net@freebsd.org>; Fri, 12 Jul 2002 11:03:51 -0700 (PDT)
Received: from wall.polstra.com (wall-gw.polstra.com [206.213.73.130])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 1477D43E4A
	for <net@freebsd.org>; Fri, 12 Jul 2002 11:03:50 -0700 (PDT)
	(envelope-from jdp@polstra.com)
Received: from vashon.polstra.com (vashon.polstra.com [206.213.73.13])
	by wall.polstra.com (8.11.3/8.11.3) with ESMTP id g6CI3kT24919;
	Fri, 12 Jul 2002 11:03:46 -0700 (PDT)
	(envelope-from jdp@vashon.polstra.com)
Received: (from jdp@localhost)
	by vashon.polstra.com (8.12.4/8.12.4/Submit) id g6CI3je9008944;
	Fri, 12 Jul 2002 11:03:45 -0700 (PDT)
	(envelope-from jdp)
Date: Fri, 12 Jul 2002 11:03:45 -0700 (PDT)
Message-Id: <200207121803.g6CI3je9008944@vashon.polstra.com>
To: net@freebsd.org
From: John Polstra <jdp@polstra.com>
Cc: bmilekic@unixdaemons.com
Subject: Re: mbuf external buffer reference counters
In-Reply-To: <20020711162026.A18717@unixdaemons.com>
References: <20020711162026.A18717@unixdaemons.com>
Organization: Polstra & Co., Seattle, WA
Sender: owner-freebsd-net@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-net.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-net>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-net>
X-Loop: FreeBSD.org

In article <20020711162026.A18717@unixdaemons.com>,
Bosko Milekic  <bmilekic@unixdaemons.com> wrote:
> 
>   Right now, in -CURRENT, there is this hack that I introduced that
>   basically just allocates a ref. counter for external buffers attached
>   to mbufs with malloc(9).  What this means is that if you do something
>   like allocate an mbuf and then a cluster, there's a malloc() call that
>   is made to allocate a small (usually 4-byte) reference counter for it.
> 
>   That sucks,

Eeek, it sure does!

>   and even -STABLE doesn't do this. I changed it this way
>   a long time ago for simplicity's sake and since then I've been meaning
>   to do something better here.  The idea was, for mbuf CLUSTERS, to
>   stash the counter at the end of the 2K buffer area, and to make
>   MCLBYTES = 2048 - sizeof(refcount), which should be more than enough,
>   theoretically, for all cluster users.  This is by far the easiest
>   solution (I had it implemented about 10 months ago) and it worked
>   great.
> 
>   The purpose of this Email is to find out if anyone has concrete
>   information on why this wouldn't work (if they think it wouldn't).

I've been out of town and I realize I'm coming into this thread late
and that it has evolved a bit.  But I still think it's worthwhile to
point out a very big problem with the idea of putting the reference
count at the end of each mbuf cluster.  It would have disastrous
consequences for performance because of cache effects.  Bear with me
through a little bit of arithmetic.

Consider a typical PIII CPU that has a 256 kbyte 4-way set-associative
L2 cache with 32-byte cache lines.  4-way means that there are 4
different cache lines associated with each address.  Each group of 4
is called a set, and each set covers 32 bytes of the address space
(the cache line size).

The total number of sets is:

    256 kbytes / 32 bytes per line / 4 lines per set = 2048 sets

and as mentioned above, each set covers 32 bytes.

The cache wraps around every 256 kbytes / 4-way = 64 kbytes of address
space.  In other words, if address N maps onto a given set, then
addresses N + 64k, N + 128k, etc. all map onto the same set.

An mbuf cluster is 2 kbytes and all mbuf clusters are well-aligned.
So the wrap around of the cache occurs every 64 kbytes / 2 kbytes per
cluster = 32 clusters.  To put it another way, all of the reference
counts would be sharing (i.e., competing for) the same 32 cache sets
and they would never utilize the remaining 2061 sets at all.  Only
1.56% of the cache (32 sets / 2048 sets) would be usable for the
reference counts.  This means there would be a lot of cache misses as
reference count updates caused other reference counts to be flushed
from the cache.

These cache effects are huge, and they are growing all the time as CPU
speeds increase while RAM speeds remain relatively constant.

It is much better to have the reference counts laid out as they are
in -stable, i.e., one big contiguous block of counts.  That way, the
counts are spread out through the entire cache and they don't compete
with each other nearly so much.  That is the underlying principle of
slab allocators, by the way.

John
-- 
  John Polstra
  John D. Polstra & Co., Inc.                        Seattle, Washington USA
  "Disappointment is a good sign of basic intelligence."  -- Chögyam Trungpa


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message