From owner-freebsd-arch Mon Dec 13 18:38:50 1999 Delivered-To: freebsd-arch@freebsd.org Received: from ns1.yes.no (ns1.yes.no [195.204.136.10]) by hub.freebsd.org (Postfix) with ESMTP id 2559B152D9 for ; Mon, 13 Dec 1999 18:38:48 -0800 (PST) (envelope-from eivind@bitbox.follo.net) Received: from bitbox.follo.net (bitbox.follo.net [195.204.143.218]) by ns1.yes.no (8.9.3/8.9.3) with ESMTP id DAA20382 for ; Tue, 14 Dec 1999 03:38:46 +0100 (CET) Received: (from eivind@localhost) by bitbox.follo.net (8.8.8/8.8.6) id DAA59606 for freebsd-arch@freebsd.org; Tue, 14 Dec 1999 03:38:45 +0100 (MET) Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135]) by hub.freebsd.org (Postfix) with ESMTP id 12E68152D9 for ; Mon, 13 Dec 1999 18:38:24 -0800 (PST) (envelope-from tlambert@usr08.primenet.com) Received: (from daemon@localhost) by smtp05.primenet.com (8.9.3/8.9.3) id TAA09051; Mon, 13 Dec 1999 19:37:54 -0700 (MST) Received: from usr08.primenet.com(206.165.6.208) via SMTP by smtp05.primenet.com, id smtpdAAAZEaGyr; Mon Dec 13 19:37:42 1999 Received: (from tlambert@localhost) by usr08.primenet.com (8.8.5/8.8.5) id TAA01908; Mon, 13 Dec 1999 19:37:47 -0700 (MST) From: Terry Lambert Message-Id: <199912140237.TAA01908@usr08.primenet.com> Subject: Re: Recent face to face threads talks. To: eischen@vigrid.com (Daniel M. Eischen) Date: Tue, 14 Dec 1999 02:37:47 +0000 (GMT) Cc: adsharma@sharmas.dhs.org, arch@freebsd.org In-Reply-To: <3853F3C0.EE96FB16@vigrid.com> from "Daniel M. Eischen" at Dec 12, 99 02:13:04 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > > > Wouldn't this be akin to what other systems call a kernel thread or LWP? > > > > A kernel thread or a LWP has a 1-to-1 relationship with a kernel stack. > > In our discussion, a KSE has a 1-to-1 relationship with a kernel stack. The > > purpose of the Q structure was to enforce "scheduling classes" - to > > represent a set of threads having the same priority and to ensure > > fairness in the scheduler. > > > > Apart from this distinction, KSE is the closest to a kernel thread/LWP > > and could possibly be used to implement blocking interrupt threads. > > A KSE can't be placed in the run queue (or isn't in our implementation). > A LWP or kernel thread can be. A LWP can have it's own priority. > I'd argue that our Q much closer resembles a LWP than does a KSE. This is probably more detail than is appropriate, since we agreed to defer the seperation of the "Q" until later, but... "Q" is a structure that acts as a scheduler reservation. It's purpose is to create an association between KSE's, a process, and a quantum at a certain priority. It also provides the basis for getting multiple processors into user space at the same time. In a multithreaded process on an SMP machine, where all threads are at the same priority, you will normally have one "Q" per CPU per multithreaded process, unless you have more CPUs than threads, in which case youll have a number equal to the number of threads. A traditional UNIX process is the same as one "Q" and one KSE. The "Q" is the structure against which quantum is accounted, and is the container for the priority of the process. A "Q" is a close approximation of the Solaris "LWP", but it is not the same, and using the same terminology should probably be avoided, since it will be confusing. The difference is that an LWP is a kernel thread in the strictest formal sense, and it is an explicit backing object, and is mapped to one or more user space processes. Solaris as of 2.5 made this mapping less explicit, by allocating another LWP to run the UTS (userspace threads scheduler) when the kernel ran out of LWPs, but there wre more user space threads that were runnable. This extra LWP signalled the UTS, which would then call down to on-demand create more backing LWPs for the runnable user space threads. As an unfortunate historical note, SunOS 4.1.x supported something called "liblwp", which was effectively a wrapper for turning a blocking I/O call into a non-blocking I/O call plus a threads context switch. Sun added aio_read/aio_write/aio_wait/aio_cancel system calls, as well as a helper system call for dealing with the user stack, and which knew about SPARC register windows; the University of Washingtom technical report "User space threads and SPARC Register Windows" (1992) describes the implementation. In terms of the scheduler, "Q"s go onto run queues, while KSEs go onto sleep queues. KSEs have become what I would call an async system call blocking call context, or "blocking call context" -- BCC, for short. They are intended to be allocated out of a pool, or preferrably, out of a per CPU resource pool, so as to avoid requiring The Big Giant Lock(tm), except when draining or refilling the pool at the high water mark or low water mark on the pool, respectively (this is precisely what Dynix did to achive much higher granularity, at the cost of maintaining a two-layer malloc() system, with the possibility of some memory being "lost" to the non-empty pools; it is a brilliant design, IMO). KSEs are container objects for sleeping on their addresses and for the kernel stack per outstanding blocked system call. The stack _may_ be given back prior to the KSE itself becoming unreferenced. A KSE must also have queue pointers for putting it on the sleep queues, and for maintaining a linked list in the pools. Editorial note: A "Q" is what I would have called the KSE. A kernel schedulable entity implies an interaction with the scheduler, but we are probably going to get stuck with the current definitions. If so, I suggest the term SR for "Scheduler Reservation" is a better fit. In terms of the scheduler (also deferred work), the main idea is to have a per-CPU scheduler that grabs a per-CPU scheduler lock each time it wants to manipulate the data. When migrating "Q"s between CPUs, the CPU migrating will have to grab both its own scheduler lock ("migrate from") and the target CPUs scheduler lock ("migrate to"). Common use would be grabbing its own lock, not in the same contention domain as the other CPUs, so there would be no stalling, as there is with The Big Giant Lock(tm). Like the per processor KSE pools, this enables great concurrency: a migration of a "Q" from one CPU to another on an 8 CPU system will utilize 1 CPU, potentially stalling another CPU, and leaving 6 CPUs unimpacted. There are only three cases where a "Q" is created: o On fork(2) (obviously). o When asking to be scheduled on additional CPUs in an SMP system. It is likely that this particular case will not require priviledges, or that soft limits requring priviledges to violate will be implemented (e.g. an administrative limit of reservations on 4 CPUs in a 16 CPU system, controlled by the users login class). o When asking for an additional "Q" in order to effect sheduling priorities of PTHREAD_SCOPE_SYSTEM. It is likely that dropping priority will be unpriviledged, and merely result in a UTS designation change, since creating an additional "Q" would permit the process to compete as multiple processes against other processes for quantum (in other words, it would pretend to run the affected threads at a lower system priority, but in reality, would not). Raising priority is already a priviledged instruction; it is assumed that raising priority, by virtue of the fact that it requires priviledge, will in fact create another "Q". To be able to effectively manage multiple priority scheduler reservations, except in the face of priority lending to deal with inversion, there is a new requirement for an explicit yield(2) call to give unused higher priority quantum back to the system. Because CPU migration is an exceptional event for a process, and is a matter of kernel scheduler policy, there is natural CPU affinity for each "Q" in this approach. This is many times better than the current situation in both FreeBSD and Linux, as I think some experiments by Mike Smith would support. As a simple algorithm for achieving initial balancing, adding: u_int32_t p_cpumask; /* CPUs with scheduler reservations*/ u_int32_t p_Q_count; /* number of "Q"s in process*/ To the proc structure, and then setting a bit for each CPU that there is currently a "Q" outstanding on (and moving the bit if it gets migrated) will ensure that each new "Q" can be created on a different CPU. Counting the bits and keeping a running "Q" count is a cheap way of knowing if multiple "Q"s are on the same CPU as a result of migration; you would need to traverse the list in order to identify and move these around between CPUs, assuming a load balancing vs. concurrency algorithm in the kernel scheduler. > > It was also agreed that it doesn't make sense to do everything at once, > > but do it in some sensible order, without precluding elements of the > > design. Some of the tasks - break the proc structure into a proc and KSE, > > per CPU scheduling queues, implementing SA. My own take on this was that implementing the scheduler changes would be a heck of a lot easier than the other work. 8-). On the other hand, the general consensus was that it would be easier to do the other work than the scheduler changes, so the scheduler changes are going to need to wait. > > My personal opinion is that it might be easier to go in steps - > > rfork/clone based implementation (aka linuxthreads), then a > > Solaris like m x n implementation and then scheduler activations. > > Implementation wise, they're in the increasing order of difficulty. > > From what I heard from Terry, this is also the path that Solaris > > took. This also lets application writers pick the right model for > > their work. > > I dunno. It seems easier to me to implement SA from the start, > concentrating on eliminating much of the extraneous code in libc_r. > I'd start with making it only work for one process or Q. The > userland part of this seems really easy to me. Once that works, > I'd add the ability to have multiple Q's. This is what looked right to me, and I think that Jason Evans agreed. The intent was to not break out "Q" until later, as part of the scheduler work. For one thing, we already have the Linux threads implemented (it can be loaded as a kernel module). For another, kernel threads backing user space threads is a horrible implementation, and has all sorts of problems with thread group (process) affinity when deciding who to run next on a partially consumed quantum. Adding thread group affinity to a kernel scheduler is a sure way to shoot your leg off. I think that this is at least natural, if not the order I would choose. Much of the hardest work is allowing the process to be on a multiple run queues (or the same run queue, multiple times). This work could even be debugged on a UP machine, where you stuff the same process into a single scheduler several times (indeed, this is necessary for PTHREAD_SCOPE_SYSTEM thread priorities on a uniprocessor machine). This seems to me to most resemble the shared library issues, where it's better to do it right the first time than it is to take a bogus implementation (e.g. the System V shared library style implementation, which ate a chunk of address space for each version of a single library, and the same for each library, as they had to be non-PIC code mapped to the same location in all processes). Waiting on Jeffrey Hsu's patches for PIC in GCC (actually, released in support of LKMs, which needed PIC) was The Right Thing To Do(tm). > > If we stick to the POSIX threads interface strictly, we shouldn't be > > afraid of getting stuck with one of the implementations. > > > > Other concerns that were expressed - crossing protection boundaries too > > often to pass messages about blocking and rescheduling. Matt Dillon > > responded that these messages could be batched and made more efficient. > > I agree. It seemed like the SA paper went to extremes in notifying > the UTS of unblocked threads. I think we can do this at more opportune > times, like when a Q is resumed or on a user->kernel crossing. I don't > know that we want to be interrupting a Q while it is running a thread > just to notify the UTS that another thread has unblocked in the kernel. Julian had a nice discussion with Jason (and later Matt and me) on this topic. The basic idea was to run the threads up to the user/kernel boundary, but not farther. At that time, the call has effectively returned all but the CPU on which it is running to user space, and a nice optimization is that the kernel stack associated with the KSE can be released. This would allow hundreds of thousands of the things, a possibility on a heavily loaded system. Matt suggested a shared memory segment, rather than a message, where the status could be reflected to user space. This strikes me as perhaps premature optimization, but would give us a place to put information about the "Q" of a process returning to user space, which is something that the UTS needs in order to allow it to seperate quanta of different priority for assignment to a new user space thread (or an explicit yield(2), if the high priority threads are all blocked -- though a priority lend to the lower priority thread holding the resource would probably be a better idea, it will require additional work in user space to implement). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message