From owner-freebsd-hackers Mon Jan 4 19:16:13 1999 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id TAA18311 for freebsd-hackers-outgoing; Mon, 4 Jan 1999 19:16:13 -0800 (PST) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from smtp01.primenet.com (smtp01.primenet.com [206.165.6.131]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id TAA18305 for ; Mon, 4 Jan 1999 19:16:11 -0800 (PST) (envelope-from tlambert@usr05.primenet.com) Received: (from daemon@localhost) by smtp01.primenet.com (8.8.8/8.8.8) id UAA14900; Mon, 4 Jan 1999 20:15:45 -0700 (MST) Received: from usr05.primenet.com(206.165.6.205) via SMTP by smtp01.primenet.com, id smtpd014856; Mon Jan 4 20:15:39 1999 Received: (from tlambert@localhost) by usr05.primenet.com (8.8.5/8.8.5) id UAA02641; Mon, 4 Jan 1999 20:15:38 -0700 (MST) From: Terry Lambert Message-Id: <199901050315.UAA02641@usr05.primenet.com> Subject: Re: question about re-entrancy. To: bright@hotjobs.com (Alfred Perlstein) Date: Tue, 5 Jan 1999 03:15:38 +0000 (GMT) Cc: hackers@FreeBSD.ORG In-Reply-To: from "Alfred Perlstein" at Dec 20, 98 04:36:51 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > I'm a bit confused. I've been watching the discussions going on about > SMP. The issue of kernel re-entrancy has been brought up many times. > > So here's the question: > > I thought the kernel was re-entrant, when a process makes a syscall it's > thread of execution enters the kernel where its arguments are validated > and careful checks are done to ensure security then the syscall is done. > > If during the system call the process is put to sleep or rescheduled, > another process may enter the kernel. As long as the splxxx() calls do > not interfere with each other you have 2 processes excuting in the kernel > at that point. (more can enter as well) > > Or.... what i'm now thinking is that once a process enters the kernel it > must run until it exits the system call. I have a problem with this > though, processes such as nfsiod/nfsd enter the kernel through a syscall > and never exit (according to design and implementation) how then are other > processes allowed into the kernel? > > Can anyone explain this a bit better? The kernel is process reentrant, but not processor reentrant. It helps if you think of CPU cycles as a consumable resource, instead of thinking of processes as work to do. There are three ways to enter the kernel: 1) System call 2) Interrupt 3) Fault Only one CPU can be in the kernel at a time. If a second CPU attempts to enter the kernel while another CPU is already in the kernel, it spins on The Big Giant Lock (tm) ("We didn't make The Big Giant Lock, we only made it Big and Giant"). In general, there are two cases to consider: 1) A CPU is already in the kernel 2) No CPU is in the kernel (all CPU's are in user space) In the first case, a fault or interrupt is given to the CPU already in the kernel (the interrupts run in virtual wire mode on the APIC's, making interrupt processing truly symmetric, unlike other OS's...); in the second, it's given to whatever CPU it's given to. In general, it's possible to modify the interrupt processing to allow both processors in at the same time. To do this, you have to make sure that the subset of the kernel runnable at interrupt time is reentrant. While not entirely trivial, this is a pretty easy task, except for entries via the system call interrupt (a call gate is much better than an interrupt for system calls from an SMP perspective in this case, since there's still the need to grab The Big Giant Lock until you discriminate the interrupt source). This hasn't been done. The tangent into spl is a red herring. The spl mechanism is basically a method of blocking operations that will "take too long" from interrupting the current operation. In other words, interrupts and faults, since that's the only thing that can take the processor away from its assigned task once in the kernel. The reason that multiple processors can't be in the kernel at the same time is a bit more complicated. Consider a hard RT (RealTime) system. In order to meet hard RT requirements, you have to be able to do kernel preemption. Basically, the idea of kernel preemption is exactly the same problem that you face trying to run two processors in the kernel at the same time, or trying to run kernel threads. Especially when you are trying to implement soloutions for RT priority inversion or trying to IPC between kernel threads. The issue raised is how you protect kernel structures that may be accessed by two processors at the same time... you have to lock against the other processor. Basically, the only preemption mechanism is faults (which can only occur from user space, except for the Pentium lockup bug workaround) and interrupts (which occur as a result of external events, like another machine sending a packet to your ethernet interface). There are a number of serious architectural issues you have to address to be able to get multiple processes in the kernel at the same time. Fortunately, there are ways to cheat: 1) Address the interrupt handler reentrancy by doing the minimal amount of work at interrupt time, and queueing the interrupt and context for the work you did for later processing (by a highest-priority kernel thread). Don't ACK (reenable) the interrupt to the controller (or APIC) until you have done the minimal processing and requeue. For shared (PCI/EISA/MCA/SBUS/Other) interrupts, process all of the devices minimal code for each device that's asserting the interrupt before you ACK it. This is the best granularity, since most of this is device bus work, and you aren't going to be able to multiply access the device bus simultaneously (you will need a Device Bus Lock for INB/OUTB/INW/OUTW). 2) Divorce the kernel stack from the proc struct. This allows you to reenter using an "entrancy instance" -- allocating a kernel stack from a pool of kernel stacks not explicitly tied to a process. This also allows you to preempt a processor running in the kernel to have it go do something else (you will need an Inactive Stack List lock and an Active Stack List Lock to allow you to grab an existing stack, or, if you are all out, allocate a new stack and add it to the stack list). Combine this with an async call gate (basically a VMS style "event flag" or "AST" mechanism), and you don't need kernel threads to spread process load over multiple processors. 3) Create per-CPU resource pools. This process is described in excruciating detail in the Dynix VM Allocator Paper (Dynix, if you will remember, sclaed to 32 CPU's without serious degradation). You allocate (and free) resources from the per CPU pool based on low (and high) watermarks, instead of contending in a global pool (you will need a Global Resource Lock for high and low watermark transfers to and from the global pool, and you will need to implement an IPI [Inter Processor Interrupt] mechanism for low resource conditions to override the watermarks during low resource conditions. Typically, there will be little or no inter-CPU contention under normal operation). This still leaves the need to have Big Giant Subsystem Locks (tm) to push down locks through the system call (trap) mechanism to each subsystem, until you eliminate context using object locks (I don't like object locks), or, preferrably, critical section locks. Object locks are a necessary evil on lists of objects that must be maintained, but lists of objects are themselves an evil, and contention in these lists can be drastically reduces by implementing a hierarchical intention mode lock manager, and breaking the lists into sections and pushing the actual objects to be locked down the hierarchy so that it's unlikely to cause contention between execution contexts (statistically, the more subnodes, the less likely you are to run into contention between any two objects, taken at random). Mostly, object lists should be done away with. Cases where this isn't possible (like the system open file table, the process list, the directory name lookup cache, etc.) gain obvious concurrency benefit from intention mode ("SIX") style locks. For objects themselves, it's easy to either make them an extension of one more level in the hierarchy (for example, directory vnodes intended to be written). You maintain an active transitive closure calculation on the lock vectors from the root as locks are added; this allows you to do deadlock detection by traversing the two nodes in the lock relationship to the root of the lock graph; otherwise, deadlock detection would be tantamount to soving the travelling salesman problem). In general, write locks are necessary, but read locks only need to be held in cases where there are coherency issues. For example, a read lock across a directory entry block while it's being copied to a user buffer (for a "snapshot" of director block contents, as is provided by the getdents(2) system call). The final issue is avoidance of cache-busting behaviour. This pretty much boils down to marking lock pages as non-cacheable, and then assuring CPU affinity for processes (or threads -- in effect, whetever you're calling your kernel execution contexts) to make sure that the L1 cache has value and that the MESI (All Intel SMP is based on MESI architecture) stays away from the "I" (Invalidate) portion of the acronym. It's a lot of work, but it's not rocket science. The most complicated piece is the lock manager; to do it right, you need a five dimensional vector relationship (an arc describing the intersection of three two dimensional vectors: the lock list held by a locking entity, the hierarchical list of lockable objects, and the relationship between multiple locking entities within a single process -- threads within a process). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message