From owner-freebsd-hackers  Tue Jul  3  4: 0:46 2001
Delivered-To: freebsd-hackers@freebsd.org
Received: from swan.mail.pas.earthlink.net (swan.mail.pas.earthlink.net [207.217.120.123])
	by hub.freebsd.org (Postfix) with ESMTP
	id 05A6637B401; Tue,  3 Jul 2001 04:00:34 -0700 (PDT)
	(envelope-from tlambert2@mindspring.com)
Received: from mindspring.com (dialup-209.247.139.34.Dial1.SanJose1.Level3.net [209.247.139.34])
	by swan.mail.pas.earthlink.net (EL-8_9_3_3/8.9.3) with ESMTP id EAA10518;
	Tue, 3 Jul 2001 04:00:05 -0700 (PDT)
Message-ID: <3B41A5CD.7F5FF288@mindspring.com>
Date: Tue, 03 Jul 2001 04:00:29 -0700
From: Terry Lambert <tlambert2@mindspring.com>
Reply-To: tlambert2@mindspring.com
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: "E.B. Dreger" <eddy+public+spam@noc.everquick.net>
Cc: "Michael C . Wu" <keichii@peorth.iteration.net>,
	Matthew Rogers <matt@accelnet.net>, freebsd-smp@FreeBSD.ORG,
	freebsd-hackers@FreeBSD.ORG
Subject: Re: CPU affinity hinting
References: <Pine.LNX.4.20.0106300344310.18448-100000@www.everquick.net>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-hackers.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo?subject=subscribe%20freebsd-hackers>
List-Unsubscribe: <mailto:majordomo?subject=unsubscribe%20freebsd-hackers>
X-Loop: FreeBSD.ORG

"E.B. Dreger" wrote:
> 
> > Date: Fri, 29 Jun 2001 21:44:43 -0500
> > From: Michael C . Wu <keichii@iteration.net>
> >
> > The issue is a lot more complicated than what you think.
> 
> How so?  I know that idleproc and the new ipending / threaded INTs
> enter the picture... and, after seeing the "HLT benchmark" page, it
> would appear that simply doing nothing is sometimes better than
> doing something, although I'm still scratching my head over that...

HLT'ing reduces the overall temperature and power
consumption.  The current SMP-aware scheduler can't
really HLT because the processors have to spin on
the acquisition of the lock.


> > This actually is a big issue in our future SMP implementation.
> 
> I presumed as much; the examples I gave were trivial.
> 
> I also assume that memory allocation is a major issue... to
> not waste time with inter-CPU locking, I'd assume that memory
> would be split into pools, a la Hoard.  Maybe start with
> approx. NPROC count equally-sized pools, which are roughly
> earmarked per hypothetical process.

Yes, though my personal view of that Horde allocator is that
it's not nice, and I don't want to see "garbage collection"
in the kernel.

The mbuf allocator that has been bandied around is a
specialization of the allocator that Alfred has been
playing with, which is intended to address this issue.

The problem with the implementations as they currently
exists is that they end up locking a lot, in what I
consider to be unnecessary overhead, to permit one CPU
to free back to another's pool ("buckets"); this is
actually much better handled by having a "dead pool"
on a per CPU basis, which only gets linked onto when the
free crosses a domain boundary.

The actual idea for per-CPU resource pools comes from
Dynix; it's described in their Usenix paper (1991), and
in Vahalia's book, in chapter 12 (I actaully disagreed
with his preference for the SLAB allocator, when I was
doing the technical review on the book for Prentice-Hall,
prior to its publication, because of this issue; most of
the rest of the book, we agreed on everything else, and
it was just minor nits about language, additional references,
etc..

So there's a lot of prior art by a lot of smart people
that FreeBSD can and has drawn upon.


> I'm assuming that memory allocations are 1:1 mappable wrt
> processes.  Yes, I know that's faulty and oversimplified,
> particularly for things like buffers and filesystem cache.

FreeBSD has a unified VM and buffer cache.  VM _is_ FS
cache _is_ buffers.

But actually your assumption is really wrong.  That's
because if you have a single process with multiple threads,
then the threads want negaffinity -- they want to try to
ensure that they are not running on the same CPU, so that
they can optimize the amount of simultaneous compute
resources.


> > There are two types of processor affinity: user-configurable
> > and system automated.  We have no implementation of the former,
> 
> Again, why not "hash(sys_auto, user_config) % NCPU"?  Identical
> processes would be on same CPU unless perturbed by user_config.
> Collisions from identical user_config values in unrelated
> processes would be less likely because of the sys_auto pertubation.
> 
> Granted:  It Is Always More Complicated. (TM)  But for a first pass...

The correct way to handle this is to have per CPU run
queues, and only migrate processes between the queues
under extraordinary circumstances (e.g. intentionally,
for load balancing.  Thus KSEs tend to stay put on the
CPU they are run on.

You also want negaffinity, as noted above.  In the simple
case, this can be achieved by having a 32 bit value (since
you can have at most 32 processors because of the APIC ID
limitation) in the proc struct; you start new KSEs on the
processors whose bits are still set in the value; when a
process is started initially, a bitmap of the existing CPUs
is copied in as part of the startup.  Bits are cleared as
a process gets KSEs on each seperate CPU.  Migration tries
to keep KSEs on different CPUs.

Each CPU has an input queue as well, which lets another
CPU "hand off" processes to it, based on load.  The
input queue is locked for a handoff, and for a read, if
the queue head is non-null, on entry to the per CPU copy
of the scheduler.  Thus under normal circumstances, when
there is nothing in the queue, there are zero locks to deal
with.  Doing it this way also lets us put the HLT back into
the scheduler idle loop, without losing on interrupts, since
the HLT was only taken out in order to cause the CPU that
didn't currently have access to the scheduler to spin on the
lock until the other CPU went to user space to do work.

A final piece of the puzzle is a figure of merit for
guaging the CPU load for a given processor, to decide
when to migrate.  This can be an unlocked read-only value
for other processors to decide whether to shed load to
your processor, or not, based on their load being much
higher than yours.  To avoid barrier instructions, it's
probably worth putting this information in a per CPU
data page that can be seen by other CPUs, which also
contains the queue head for the handoff queue (the input
queue, above); barriers are avoided by marking these
pages as non-cacheable.


> > and alfred-vm has a semblance of the latter.  Please wait
> > patiently.....
> 
> Or, if impatient, would one continue to brainstorm, not expect a
> response (i.e., not get disappointed when something basic is posted),
> and track -current after the destabilization? :-)

I've had a number of conversations with Alfred on the
ideas outlined briefly, above, and on his thoughts on
the subject (he and I work at the same place).

Alfred has experimental code which does per CPU run
queues, as described above, and he has some other code
which lets him "lock" a process onto a particular CPU
(I personally don't think that's terrifically useful,
in the grand scheme of things, but you can get the same
effect by having a "don't migrate this process" bit, and
simply not shedding it to another CPU, regardless of load.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message