From owner-freebsd-hackers Tue Jul 3 4: 0:46 2001 Delivered-To: freebsd-hackers@freebsd.org Received: from swan.mail.pas.earthlink.net (swan.mail.pas.earthlink.net [207.217.120.123]) by hub.freebsd.org (Postfix) with ESMTP id 05A6637B401; Tue, 3 Jul 2001 04:00:34 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from mindspring.com (dialup-209.247.139.34.Dial1.SanJose1.Level3.net [209.247.139.34]) by swan.mail.pas.earthlink.net (EL-8_9_3_3/8.9.3) with ESMTP id EAA10518; Tue, 3 Jul 2001 04:00:05 -0700 (PDT) Message-ID: <3B41A5CD.7F5FF288@mindspring.com> Date: Tue, 03 Jul 2001 04:00:29 -0700 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: "E.B. Dreger" Cc: "Michael C . Wu" , Matthew Rogers , freebsd-smp@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG Subject: Re: CPU affinity hinting References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG "E.B. Dreger" wrote: > > > Date: Fri, 29 Jun 2001 21:44:43 -0500 > > From: Michael C . Wu > > > > The issue is a lot more complicated than what you think. > > How so? I know that idleproc and the new ipending / threaded INTs > enter the picture... and, after seeing the "HLT benchmark" page, it > would appear that simply doing nothing is sometimes better than > doing something, although I'm still scratching my head over that... HLT'ing reduces the overall temperature and power consumption. The current SMP-aware scheduler can't really HLT because the processors have to spin on the acquisition of the lock. > > This actually is a big issue in our future SMP implementation. > > I presumed as much; the examples I gave were trivial. > > I also assume that memory allocation is a major issue... to > not waste time with inter-CPU locking, I'd assume that memory > would be split into pools, a la Hoard. Maybe start with > approx. NPROC count equally-sized pools, which are roughly > earmarked per hypothetical process. Yes, though my personal view of that Horde allocator is that it's not nice, and I don't want to see "garbage collection" in the kernel. The mbuf allocator that has been bandied around is a specialization of the allocator that Alfred has been playing with, which is intended to address this issue. The problem with the implementations as they currently exists is that they end up locking a lot, in what I consider to be unnecessary overhead, to permit one CPU to free back to another's pool ("buckets"); this is actually much better handled by having a "dead pool" on a per CPU basis, which only gets linked onto when the free crosses a domain boundary. The actual idea for per-CPU resource pools comes from Dynix; it's described in their Usenix paper (1991), and in Vahalia's book, in chapter 12 (I actaully disagreed with his preference for the SLAB allocator, when I was doing the technical review on the book for Prentice-Hall, prior to its publication, because of this issue; most of the rest of the book, we agreed on everything else, and it was just minor nits about language, additional references, etc.. So there's a lot of prior art by a lot of smart people that FreeBSD can and has drawn upon. > I'm assuming that memory allocations are 1:1 mappable wrt > processes. Yes, I know that's faulty and oversimplified, > particularly for things like buffers and filesystem cache. FreeBSD has a unified VM and buffer cache. VM _is_ FS cache _is_ buffers. But actually your assumption is really wrong. That's because if you have a single process with multiple threads, then the threads want negaffinity -- they want to try to ensure that they are not running on the same CPU, so that they can optimize the amount of simultaneous compute resources. > > There are two types of processor affinity: user-configurable > > and system automated. We have no implementation of the former, > > Again, why not "hash(sys_auto, user_config) % NCPU"? Identical > processes would be on same CPU unless perturbed by user_config. > Collisions from identical user_config values in unrelated > processes would be less likely because of the sys_auto pertubation. > > Granted: It Is Always More Complicated. (TM) But for a first pass... The correct way to handle this is to have per CPU run queues, and only migrate processes between the queues under extraordinary circumstances (e.g. intentionally, for load balancing. Thus KSEs tend to stay put on the CPU they are run on. You also want negaffinity, as noted above. In the simple case, this can be achieved by having a 32 bit value (since you can have at most 32 processors because of the APIC ID limitation) in the proc struct; you start new KSEs on the processors whose bits are still set in the value; when a process is started initially, a bitmap of the existing CPUs is copied in as part of the startup. Bits are cleared as a process gets KSEs on each seperate CPU. Migration tries to keep KSEs on different CPUs. Each CPU has an input queue as well, which lets another CPU "hand off" processes to it, based on load. The input queue is locked for a handoff, and for a read, if the queue head is non-null, on entry to the per CPU copy of the scheduler. Thus under normal circumstances, when there is nothing in the queue, there are zero locks to deal with. Doing it this way also lets us put the HLT back into the scheduler idle loop, without losing on interrupts, since the HLT was only taken out in order to cause the CPU that didn't currently have access to the scheduler to spin on the lock until the other CPU went to user space to do work. A final piece of the puzzle is a figure of merit for guaging the CPU load for a given processor, to decide when to migrate. This can be an unlocked read-only value for other processors to decide whether to shed load to your processor, or not, based on their load being much higher than yours. To avoid barrier instructions, it's probably worth putting this information in a per CPU data page that can be seen by other CPUs, which also contains the queue head for the handoff queue (the input queue, above); barriers are avoided by marking these pages as non-cacheable. > > and alfred-vm has a semblance of the latter. Please wait > > patiently..... > > Or, if impatient, would one continue to brainstorm, not expect a > response (i.e., not get disappointed when something basic is posted), > and track -current after the destabilization? :-) I've had a number of conversations with Alfred on the ideas outlined briefly, above, and on his thoughts on the subject (he and I work at the same place). Alfred has experimental code which does per CPU run queues, as described above, and he has some other code which lets him "lock" a process onto a particular CPU (I personally don't think that's terrifically useful, in the grand scheme of things, but you can get the same effect by having a "don't migrate this process" bit, and simply not shedding it to another CPU, regardless of load. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message