From owner-freebsd-hackers Sat Mar 6 11:13:42 1999 Delivered-To: freebsd-hackers@freebsd.org Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133]) by hub.freebsd.org (Postfix) with ESMTP id E63B614C13 for ; Sat, 6 Mar 1999 11:13:39 -0800 (PST) (envelope-from tlambert@usr06.primenet.com) Received: (from daemon@localhost) by smtp03.primenet.com (8.8.8/8.8.8) id MAA27811; Sat, 6 Mar 1999 12:13:22 -0700 (MST) Received: from usr06.primenet.com(206.165.6.206) via SMTP by smtp03.primenet.com, id smtpd027792; Sat Mar 6 12:13:21 1999 Received: (from tlambert@localhost) by usr06.primenet.com (8.8.5/8.8.5) id MAA08788; Sat, 6 Mar 1999 12:13:21 -0700 (MST) From: Terry Lambert Message-Id: <199903061913.MAA08788@usr06.primenet.com> Subject: Re: lockf and kernel threads To: jplevyak@inktomi.com (John Plevyak) Date: Sat, 6 Mar 1999 19:13:21 +0000 (GMT) Cc: tlambert@primenet.com, jplevyak@inktomi.com, hackers@FreeBSD.ORG In-Reply-To: <19990305080618.B22589@tsdev.inktomi.com> from "John Plevyak" at Mar 5, 99 08:06:18 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > > > Why is it too late after that? In the patch I did the wait in exit1() > > > right after the 'kill' of the peers. > > > > It's too late if you've gone to user space with the signal, because > > it's an untrappable signal. That's the trampoline I was referring > > to. > > I understand about the trampoline, but I don't see why you can't send > the signal out of user space, mostly because that is what 3.0+ > currently does! (near the beginning of exit1() to all peers of a p_leader.) Assume a kill -9 of a PID that is a "thread" in a "thread group" (process). The exit call is not explicit. The SIGKILL is mandatorily not caught. The kill comes from the kernel, with no user space notification. > > Right. Such a structure is essentially identical in content to > > an async call context for an async call gate, BTW. 8-). > > I am missing context on this, however if it was a mechanism for > generalized async calls w/o signals/polling but 'wait for one > of N events' I am interested. Yes, that's exactly it. Basically, you use an alternate entry point to the system call trap code, and immediately return after queuing the trap to the kernel. The context you need is a kernel stack for the call, a kernel program counter, etc.. Basically the call runs as far as it can (potentially to completion), and instead of putting the process to sleep, puts the call context to sleep, returning the remainder of the quantum to user space. This provides SMP scalability (call contexts can be serviced by multiple CPU's) without requiring strict kernel threads to implement. You can program with the calls directly (like the current AIO code) to get operation interleave, or you can implement a user space call conversion scheduler, which converts standard calls into async calls plus a thread context switch. The advantage to doing this method of implementation of call conversion is that the call wrapping code is then generic, and no additional work needs to be done for threads, other than wrapping the conversion. In this way, the implementation is drastically simpler, and drastically lower overhead than the current call conversion scheduler, which must do wrapping and locking in user space. In addition, this implementation has significantly smaller context switch overhead, compared to traditional kernel threading models, since making a blocking system call is not the same thing as an explicit scheduler yield, as it is with kernel threads. When the scheduler assigns a quantum, it's your quantum. You can deal with multiple CPU's being in user space simultaneously in the same process trivially: you just allow more than one CPU to return up with the code from the converted call, and enter your call conversion scheduler -- the real reason for threads is to provide procedural work-to-do for a quantum as a scheduled resource. There are some obvious optimizations, as well. You could easily classify system calls with another entry in the struct sysent[] structure element for a given call, and act accordingly, by type: o call may block Treat as the unoptimized case, and return when you hit a tsleep that blocks the call. o call never blocks Do not allocate a call context, and trivially return a static context for completion. o call always blocks Obtain parameters and immediately return so that the long latency event (e.g., tty I/O) does not detract from overall throughput. Another optimization which is obvious after consideration, is to modify trap entry for SMP architectures, and immediately return all but the calls which never block. The calls which never block are treated as if they are static data in user space, and merely referenced via accessor functions for convenience and abstraction. Onece the context is in the kernel, it is given to the scheduler, where it competes for a quantum on whatever CPU becomes next available. This is a bit counterintuitive, in that we are attempting to reduce scheduler context switch overhead. However, it doesn't matter that the quantum is given back to the system (or, alternately, given to the threads scheduler in the process that made the request, based on some administrative fiat) because these events are intrinsically short duration. In other words, they are expected to run to sleep, then run to completion following a wakeup, both in less than a quantum. Since the sleep discontinuity is impossible to overcome, and the completion discontinuity is also impossible to overcome, this is the minimal implementation of "non-interrupt scheduled kernel work-to-do". The possbilities are really pretty great. The architecture bypasses most of the issues for SMP scalability, without taking a hit for having ignored rather than dealt with the issues. Additionally, the architecture in the second optimization can, in fact, be used to proxy the request in a distributed computing environment. Obviously, looking at this architecture, one of my hot buttons is massively parallel distrbuted systems with completion agents that operate with incomplete knowledge. I think these types of systems are going to be THE Most Important(tm) in the long to medium term future -- e.g., they are exactly what is needed for operating a fleet of hundereds of thousands of nanometer scale semiautonomous machines doing organ repair. > > You might be able to get someone to commit this, at least as a short > > term soloution to the problem. It would get you over the "blessed" > > hump so you can concentrate on more pressing issues. > > Hummm... any guesses as to whom might be most sympathetic? Matt Dillion, David Greenamn, Julian Elisher, Peter Wemm all spring immediately to mind. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message