Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 09 Aug 2001 03:13:58 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Greg Lehey <grog@FreeBSD.org>
Cc:        void <float@firedrake.org>, freebsd-hackers@FreeBSD.org
Subject:   Re: Allocate a page at interrupt time
Message-ID:  <3B726266.8026211C@mindspring.com>
References:  <200108070739.f777dmi08218@mass.dis.org> <3B6FB0AE.8D40EF5D@mindspring.com> <20010807221509.A24999@firedrake.org> <3B70E9DB.B16F409C@mindspring.com> <20010809115742.E73579@wantadilla.lemis.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Greg Lehey wrote:
> > Solaris hits the wall a little later, but it still hits the
> > wall.
> 
> Every SMP system experiences performance degradation at some point.
> The question is a matter of the extent.

IMO, 16 processors is not unreasonable, even with standard APIC
based SMP.  32 is out of the question, but that's mostly because
you can't have more than 32 APIC ID's in the current 32 bit
processors, and still give one or more away to an IO APIC.  8-).


> > On Intel hardware, it has historically hit it at the same 4 CPUs
> > where everyone else tends to hit it, for the same reasons;
> 
> This is a very broad statement.  You contradict it further down.

I contradict it for SPARC; I don't think I contradicted it
for Intel, but am wiling to take correction...


> > Solaris claims to scale to 64 processors while maintaining SMP,
> > rather than real or virtual NUMA.  It's been my own experience that
> > this scaling claim is not entirely accurate, if what you are doing
> > is a lot of kernel processing.
> 
> I think that depends on how you interpret the claim.  It can only mean
> that adding a 64th processor can still be of benefit.

The "4 processors" Intel claim is a "point of diminishing"
returns, and is well enough known that it is almost passed
into folklore (which might not bode well for finding people
building boards with more, which would be unfortunate).  My
SPARC experience is likewise a "diminshing returns", where
it becomes cheaper to buy another box to get the performance
increment, than to stick more processors in the same box.
It's definitely anecdotal on my part.


> > On the other hand, if you are running a lot of non-intersecting user
> > space code (e.g. JVM's or CGI's), it's not as bad (and realized that
> > FreeBSD is not that bad in the same situation, either: it's just not
> > as common in practice as it is in theory).
> 
> You're just describing a fact of life about UNIX SMP support.

"Practice vs. Theory"?  Or the inevitability of UNIX SMP support
having the performance characteristics it has most places?  I don't
buy the "we must live with the performance because it's UNIX"
argument, if you meant the latter.


> > It should be noted that Solaris Interrupt threads are only
> > used for interrupts of priority 10 and below: higher priority
> > interrupts are _NOT_ handled by threads (interrupts at a
> > priority level from 11 to 15).  10 is the clock interrupt.
> 
> FreeBSD also has provision for not using interrupt threads for
> everything.  It's clearly too early to decide which interrupts should
> be left as traditional interrupts, and we've done some shifting back
> and forth to get things to work.  Note that the priority numbers are
> noise.  In this statement, they're just a convenient way to
> distinguish between threaded and non-threaded interrupts.

FreeBSD masks, Solaris IPLs.  In context, this was meant to
show why Solaris' approach is not directly translatable to
FreeBSD.

I really can't buy the idea that interrupt threads are a good
idea for anything that can flood your bus or interrupt bandwidth,
or have tiny/non-existant FIFOs, relative to the speeds they are
being pushed; right now that means "might be OK for disks; not OK
for really fast network controllers, not OK for sorta fast network
controllers without a lot of adapter RAM, not OK for serial ports
and floppies", at least in my mind.



> I think somebody else has pointed out that we're very conscious of CPU
> affinity.

I think affinity isn't enough; I've expressed this to Alfred on a
number of occasions already, when I see him in the hallway or at
lunch.  Dealing with the problem is kind of an all-or-nothing bid.


> > In the 32 processor Sequent boxes, the actual system bus was
> > different, and directly supported message passing.
> 
> Was this better or worse?

For the intent, much better.  It meant that non-intersecting
CPU/peripheral paths could run simultaneously.  The Harris
H1000 and H1200 had a similar thing (big iron real time
systems used on Navy ships and at the college where Wes and
I went to school).


> > Also, the Sun system is still an IPL system, using level based
> > blocking, rather than masking, and these threads can find
> > themselves blocks on a mutex or condition variable for a
> > relatively long time; if this happens, it resumes the previous
> > thread _but does not drop its IPL below that of the suspended
> > thread_, which is basically the Djikstra Banker's Algorithm
> > method of avoiding priority inversion on interrupts (i.e. ugly).
> 
> So you're saying we're doing it better?

Long term priority lending is the real problem I'm noting; this
is an artifact of context "borrowing", more than anything else
(more below).

I think the FreeBSD use of masking is better than IPL'ing, and is
an obvious win in the case of multiple cards, since you can run
multiple interrupt handlers at the same time, but wonder what will
happen when/if it gets to UltraSPARC hardware.  I think the Djikstra
algorithm, in which contended resources are prereserved based on an
anticipated need, tends to result in a lot of phantom contention:
the resource may not end up being needed in the particular execution
path (e.g. you grab the source and target directory vnode and the
vnode of the file being renamed between the two, only to find a
permission problem on a rename, etc.).  Even success cases tend to
have overly agressive resource grabbing, using this approach.  It's
like grabbing the BGL to do a "getpid" call.


> > Finally, the Sun system "borrows" the context of the interrupted
> > process (thread) for interrupt handling (the LWP).  This is very
> > similar to the technique employed with kernel vs. user space thread
> > associations within the Windows kernels (this was one of the steps I
> > was referring to when I said that NT had dealt with a number of
> > scaling issues before it needed to, so that they would not turn into
> > problems on 8-way and higher systems).
> 
> This is also the method we're planning to use, as I'm sure you're
> aware from previous messages on the -smp list.

It's similar, but not identical to the NT approach.  The NT
approach actually necessitates marshalling locally heap and
stack allocated objects between threads, rather than simply
passing a pointer in a shared address space.  This avoids the
Sun problem with contention (on Sun, this is solved by special
hardware help).  On Intel, borrowing like this is potentially
a big source of contention, unless you do what NT did; their
solution is "a solution", but is probably not the best you
could do; it introduces the need to be very careful about
timer outcalls and externalized synchronization interfaces,
since you might grab a resource, block, have a timer go off,
have it "borrow" your context to execute, and then discover that
it was permitted to grab a recursive mutex because it was the
"same" kernel thread.  We fought this problem when we ported
the UFS/FFS code, and added our own Soft Updates implementation
to Windows back in 1995, when running the syncd as a timer
event.

I know that it makes interrupt threads much less costly to
use, but it seems fraught with peril.


> > Personally, I think that the Sun system is extremely succeptible to
> > receiver livelock (Network interrupts are at 7, and disk interrupts
> > are at 5, which means that so long as you are getting pounded with
> > network interrupts for e.g. NFS read or write requests, you're not
> > going to service the disk interrupts that will let you dispose of
> > the traffic, nor will you run the user space code for things like
> > CGI's or Apache servers trying to service a heavy load of requests
> > for content).
> 
> Can you point to a system which does better, and explain why?

LRP, from Rice University, for one.  The first two approaches
in the Mogul paper, for others (interrupt threads were the third
and final approach in the Mogul paper).

I've posted a brief analysis of the Mogul paper's use of interrupt
threads as one of several potential solutions to the problem it was
trying to addres,in a different message.  What it boils down to in
the context of FreeBSD on current and anticipated hardware is that
livelock can come from bus contnetion, not just interrupt
contention, and that the latency from using a polling thread
(interrupt thread run after setting a "needs service" bit) in
a situtation with something fast enough to cause problems is
going to be unacceptably high, and the cure for that is fixed
percentage scheduling, which will end up being non-optimal.


> > I'm also not terrifically impressed with their callout mechanism,
> > when applied to networking, which has a preponderance of fixed,
> > known interval timers, but FreeBSD's isn't really any better, which
> > it comes to huge numbers of network connections, since it will end
> > up hashing 2/4/6/8/... into the same bucket, unordered, which means
> > traversing a large list of timers which are not going to end up
> > expiring (callout wheels are not a good thing to mix with fixed
> > interval timers of relatively long durations, like the 2MSL timers
> > that live in the networking code, or most especially the TIME_WAIT
> > timers).
> 
> I haven't looked at this issue.  How does it differ from the original
> System V implementation.  Are you implying that we should use a
> different hash algorithm?

Their implementation is hierarchical and partitioned; if you crack
the book, it's covered in detail starting around page 50.

The main problem in FreeBSD comes when you look at the callout
wheel size necessary to handle a lot of network connection
timers at the same time.  The idea behind the doubly linked list
is predicated on the premise that timers are removed before firing
in most cases: they are generally things like retransmit timers,
etc..

This falls down in many of important places:

o	Even timers that are removed take six memory operations,
	compared to the one that was used for the previous TCP
	timer method.

o	Load that prevents other work (e.g. transmitter starvation
	or receiver livelock) exacerbates the situation, leading
	to catastrophic failure (work up to load X, then fall off
	a cliff to 0).

o	With a huge number of events, even if you are going to
	remove them later, you have to make the wheel very large,
	or the number of events in one bucket is going to get very
	large very fast, and since they are not ordered in
	increasing time-to-fire, you end up scanning all entries
	in a bucket

o	Making the wheel bigger takes more memory.

o	Making the entry list bigger increases the clock interrupt
	latency linearly (above some fixed cost for zero entries);
	if the list gets too big, you take more time scanning than
	there is until the next clock interrupt (meaning you _MUST_
	increase the number of buckets).

o	If the number of buckets gets very large, the timer
	resolution is degraded.

o	The use of the timeouts is non-uniform for all subsystems;
	mostly networks end up sourcing (or sinking) traffic,
	which can be thought of as work-to-do requests; but this
	also means that there will be other timers in the list,
	which aren't so nicely removing themselves before they
	ever fire.

o	You could decide to "hint" about timers that were "expected"
	to be removed before firing, and stick them at the top of
	the buckets, but this still fails under load, and you are
	using a hash, and without knowledge of the other timer
	characteristics, you could easily put it in "the next bucket
	to be processed at the next clock interrupt".

I think this can be addressed by setting aside a small set of lists
of fixed interval timers: timers on these lists will all be the
same duration, or they won't be on the list, they'll be sent to
the callout wheel, which will remain in place.  Lists might be 1MSL,
2MSL, TIME_WAIT timeout, FIN_WAIT_2 timeout, etc..  Events will be
inserted via tail insertion...

What this does is inherently time order the lists in increasing time
to fire for a given interval.  You can traverse the list heads in
most cases very quickly, on the assumption that the timers will be
removed before they can fire, and thus the head of the lists will
not be ready to fire.  If they are ready, you only need to traverse
to the depth of the first one that is ready to fire: an (N/N+1)%
"cache hit rate", for a depth of N.  If not, you get a "miss" on
the first entry, and go on to the next fixed interval list.

This increases a tiny amount the fixed amount of overhead that
you have in the timer code, while at the same time removing the
network timeout processing overhead from the general case, where
you have to scan _every_ bucket entry to get at the ones which
are in the process of firing.

I haven't really explored opportunistic timers in much depth,
but it seems that they would benefit from this approach as well;
for network processing, this approach pretty much negates the
overall benefit of opportunistic timers, I think... criticisms
welcome, of course.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3B726266.8026211C>