From owner-freebsd-chat  Sat Jul 17  9:22: 2 1999
Delivered-To: freebsd-chat@freebsd.org
Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133])
	by hub.freebsd.org (Postfix) with ESMTP id 2915714C80
	for <chat@FreeBSD.ORG>; Sat, 17 Jul 1999 09:21:38 -0700 (PDT)
	(envelope-from tlambert@usr08.primenet.com)
Received: (from daemon@localhost)
	by smtp03.primenet.com (8.8.8/8.8.8) id JAA19245;
	Sat, 17 Jul 1999 09:21:37 -0700 (MST)
Received: from usr08.primenet.com(206.165.6.208)
 via SMTP by smtp03.primenet.com, id smtpd019220; Sat Jul 17 09:21:32 1999
Received: (from tlambert@localhost)
	by usr08.primenet.com (8.8.5/8.8.5) id JAA12578;
	Sat, 17 Jul 1999 09:21:26 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199907171621.JAA12578@usr08.primenet.com>
Subject: Re: Known MMAP() race conditions ... ?
To: davids@webmaster.com (David Schwartz)
Date: Sat, 17 Jul 1999 16:21:26 +0000 (GMT)
Cc: tlambert@primenet.com, unknown@riverstyx.net, chat@FreeBSD.ORG
In-Reply-To: <000101becfec$605cd280$021d85d1@youwant.to> from "David Schwartz" at Jul 16, 99 05:36:14 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-chat@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> 	I was under the impression that disk I/O was still blocking in FreeBSD's
> libc_r. I was also under the impression that the resolver was blocking.
> 
> 	If disk I/O really is non-blocking, I would expect the performance to
> suffer because thread context switches in a user-space threads
> implementation are generally more expensive than a kernel thread blocking on
> I/O.

Why is this?

In a kernel threads implementation, each thread competes as a seperate
process with all other kernel threads.  A "process" is a group of one
or more kernel threads.

For all legacy code ("processes"), one kernel thread competes with all
other kernel threads, including those in threaded programs ("multithreaded
processes") based on the number of processes out there.  For the sake
of avoiding confusion, we should talk about the competition for quantum
in terms of "Kernel Schedulable Entities" or KSE's.

For a given program which is backed by multiple KSE's, there is no
guarantess of affinity or adjacency.

Any attempt to ensure affinity in blocking operations could result
in starvation for other KSE's -- that is, KSE's are difficult to
group, such that when you go from one KSE to another in the kernel
scheduler, as a result of sleeping, instead of an involuntary
context switch, the chance that you are going to have to do a full
address map and register reload ("context switch between ''processes''")
is equal to:

	total KSE's : (total KSE's - process KSE's)

In other words, context switch overhead is generally equal to the
same overhead as if you were running multiple processes instead of
multiple KSE's per process.

Furthermore, even if you were to do preferential scheduling KSE's
from a single group (equivalent to a thread group in a proces, or
a "multithreaded process"), and you used a "quantum counting"
technique to ensure against starvation of other KSE's, such that
if the scheduler activation was the result of a sleep rather than
a quantum clock tick ("LBOLT"), you will, statitically, achieve
only a best case average utilization of quantum/2 before a context
switch that requires a full task switch.

Comparatively, using a user space scheduler, the quantum is fully
utilized (a factor of two improvement in reduction of context switch
overhead), and the context switch between threads is the same as the
best case for kernel threads, which is a register reload.

One need only apply this same metric to cache busting, TLB shootdown,
processor migration, and other dangers, and a strong picture favoring
user space threading appears.

Your claim that user space threads are definitionally more expensive
than kernel space threads is verifiably false.

Further, in the very best possible case, we see that overhead (not
including protection domain crossing) for kernel threads only
begins to approach user space theads as:

	(total KSE's - process KSE's) /  total KSE's

approaches 1.  This approach is asymptotic at best, since we have
the minimum system support daemons in the process queue cometing
for quantum.


> 	User-space threads are not inherently bad, they just have different
> tradeoffs than kernel threads.

Yes.  Better tradeoffs.


> > Yes.  The NFS code can return "EWOULDBLOCK", if the operation would
> > block.
> 
> But does libc_r do this? As I see it, there are two answers, and both are
> at least somewhat bad:
> 
> 	1) Yes. Which means that a significant fraction of disk I/O will
>          require extra user-space thread context switches.
> 
> 	2) No. Which means that slow I/O will stall all the threads.

Conversion does not have to be to blocking calls on non-blocking
descriptors.  It can be to non-blocking calls on descriptors.

No, this is not currently done, but the effect of #1 is not so poor
as you would think; it is an additional overhead of 6 uS, if the
data is not in cache, and an additional overhead of 0 uS, if it is.

Smart programmers will organize their code to trigger predictive
read-ahead so that the data will be in cache.  As the "NULL"
system call latency degreases, so does the overhead.

Compare this to a:

	(total KSE's - process KSE's) /  total KSE's * 100

percent risk of taking a fll context switch overhead in the kernel
case, and the management overhead of ensuring that there are no
user space threads in ready-to-run state that are stalled for lack
of kernel space threads to back them, the additional scheduler
overhead in the more complex scheme, and the assymetric CPU
availability associated with differential CPU load when you
attempt to implement CPU affinity.


> > No.  Both cases should result in an EWOULDBLOCK and a threads context
> > switch, pending the data being present to be read, since non-blocking
> > I/O is being substituted.
> 
> 	Which means unneccesary context switches, when simply waiting
>       would be better.

Threads context switches, not process context switches.


> The problem is, if you want to avoid the occasional long delay, you have to
> accept extra context switches all the time. Not necessarily the worst thing
> in the world, but it's a tradeoff.

You are confusing "pool retention time" (latency) with stalling.

The problem you are not addressing is that latency merely implies
that the I/O requests are interleaved, and satisfied after some
delay, while allowing multiple outstanding requests.  Stalling, on
the other hand, means that no scheduled work is occurring.

You aren't stalled if you are waiting for a scheduled DMA to complete,
only if you are waiting to schedule a DMA.


> > > I see them all the time. 'gethostbyname' is a good example.
> >
> > Are you forcing the use of TCP for this?  This results in a spin
> > loop.
> >
> > Please obtain and compile the libresolver from bind 8.x, which is
> > reentrant, and link it before you link libc_r.
> 
> 	I am calling 'gethostbyname'. Is that wrong?

Yes.

> Bind's license, unfortunately, prohibits me from linking to it. Once I
> wrote my own resolver library, this problem goes away. But not everyone can
> spend the time to do that to optimize for a platform.

It's the same license on the 4.x resolver in libc.  I don't see how
you are prevented from linking with one, but not the other.  The ISC
wrote (and licensed) both.


> That's not what I'm saying. I'm saying it's a painful tradeoff. What you
> want is a thread to block if the I/O takes too long. You don't have that
> choice.

If a kernel thread sleeps, for any reason, you have, with a high
statistical rpobability, lost your quantum and taken a full context
switch overhead. Disk wait queue completions do not run in the
context of the kernel threads making the call, they run at interrupt
level.


> > If it's "not that bad", then it won't take 10 years to fix.
> 
> Yes, I've been waiting for fixes in FreeBSD's threads implementation for
> more than a year now. The vast majority of them have taken place, and I'm
> fairly happy with the current state of FreeBSD's threads support.

I can't speak for the FreeBSD developement process.


> 	However, it is really not as good as the threads support on many other
> operating systems, including NT. If you need stellar threads support,
> FreeBSD is not the operating system you probably want to use.

I still think this is based on a false premise.


> At the current rate of progress though, this could change in a few months.

Well, that's something, I suppose.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message