From owner-freebsd-hackers  Wed Aug  8  0:26:57 2001
Delivered-To: freebsd-hackers@freebsd.org
Received: from robin.mail.pas.earthlink.net (robin.mail.pas.earthlink.net [207.217.120.65])
	by hub.freebsd.org (Postfix) with ESMTP id CD93437B401
	for <freebsd-hackers@freebsd.org>; Wed,  8 Aug 2001 00:26:52 -0700 (PDT)
	(envelope-from tlambert2@mindspring.com)
Received: from mindspring.com (dialup-209.245.139.128.Dial1.SanJose1.Level3.net [209.245.139.128])
	by robin.mail.pas.earthlink.net (EL-8_9_3_3/8.9.3) with ESMTP id AAA14369;
	Wed, 8 Aug 2001 00:26:42 -0700 (PDT)
Message-ID: <3B70E9DB.B16F409C@mindspring.com>
Date: Wed, 08 Aug 2001 00:27:23 -0700
From: Terry Lambert <tlambert2@mindspring.com>
Reply-To: tlambert2@mindspring.com
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: void <float@firedrake.org>
Cc: freebsd-hackers@freebsd.org
Subject: Re: Allocate a page at interrupt time
References: <200108070739.f777dmi08218@mass.dis.org> <3B6FB0AE.8D40EF5D@mindspring.com> <20010807221509.A24999@firedrake.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-hackers.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-hackers>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-hackers>
X-Loop: FreeBSD.ORG

void wrote:
> > Can you name one SMP OS implementation that uses an
> > "interrupt threads" approach that doesn't hit a scaling
> > wall at 4 (or fewer) CPUs, due to heavier weight thread
> > context switch overhead?
> 
> Solaris, if I remember my Vahalia book correctly (isn't that a favorite
> of yours?).

As usual, IMO...

Yes, I like the Vahalia book; I did technical review of
it for Prentice Hall before its publication.

Solaris hits the wall a little later, but it still hits the
wall.  On Intel hardware, it has historically hit it at the
same 4 CPUs where everyone else tends to hit it, for the same
reasons; as of Solaris 2.6, they have adopted the hybrid per
CPU pool model recommended in Vahalia (Chapter 12).

While I'm at it, I suppose I should recommend reading the
definitive Solaris internals book, to date:

	Solaris Internals, Core Kernel Architecture
	Jim Mauro, Richard McDougall
	Prentice Hall
	ISBN: 0-13-022496-0

Solaris does use interrupt threads for some interrupts; I
don't like the idea, for the reasons stated previously.

Solaris claims to scale to 64 processors while maintaining
SMP, rather than real or virtual NUMA.  It's been my own
experience that this scaling claim is not entirely accurate,
if what you are doing is a lot of kernel processing.  On the
other hand, if you are running a lot of non-intersecting
user space code (e.g. JVM's or CGI's), it's not as bad (and
realized that FreeBSD is not that bad in the same situation,
either: it's just not as common in practice as it is in
theory).

It should be noted that Solaris Interrupt threads are only
used for interrupts of priority 10 and below: higher priority
interrupts are _NOT_ handled by threads (interrupts at a
priority level from 11 to 15).  10 is the clock interrupt.

It should also be noted that Solaris maintains a per processor
pool of interrupt threads for each of the lower priority
interrupts, with a global thread that is used for handling of
the clock interrupt.  This is _very_ different than taking an
interrupt thread, and rescheduling it on an arbitrary CPU,
and as others have pointed out, the hardware used to do the
scheduling is very different.

In the 32 processor Sequent boxes, the actual system bus was
different, and directly supported message passing.

There is also specific hardware support for handling interrupts
via threads, which is really not applicable to x86 or even the
Alpha architectures on which FreeBSD currently runs, nor to the
IA64 architecture (port in progress).  In particular, there is
a single system wide table, introduced with the UltraSPARC, that
doesn't need to be locked to support interrupt handling.

Also, the Sun system is still an IPL system, using level based
blocking, rather than masking, and these threads can find
themselves blocks on a mutex or condition variable for a
relatively long time; if this happens, it resumes the previous
thread _but does not drop its IPL below that of the suspended
thread_, which is basically the Djikstra Banker's Algorithm
method of avoiding priority inversion on interrupts (i.e. ugly).

Finally, the Sun system "borrows" the context of the interrupted
process (thread) for interrupt handling (the LWP).  This is very
similar to the technique employed with kernel vs. user space
thread associations within the Windows kernels (this was one of
the steps I was referring to when I said that NT had dealt with
a number of scaling issues before it needed to, so that they
would not turn into problems on 8-way and higher systems).

Personally, I think that the Sun system is extremely succeptible
to receiver livelock (Network interrupts are at 7, and disk
interrupts are at 5, which means that so long as you are getting
pounded with network interrupts for e.g. NFS read or write
requests, you're not going to service the disk interrupts that
will let you dispose of the traffic, nor will you run the user
space code for things like CGI's or Apache servers trying to
service a heavy load of requests for content).

I'm also not terrifically impressed with their callout mechanism,
when applied to networking, which has a preponderance of fixed,
known interval timers, but FreeBSD's isn't really any better,
which it comes to huge numbers of network connections, since it
will end up hashing 2/4/6/8/... into the same bucket, unordered,
which means traversing a large list of timers which are not
going to end up expiring (callout wheels are not a good thing to
mix with fixed interval timers of relatively long durations,
like the 2MSL timers that live in the networking code, or most
especially the TIME_WAIT timers).

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message