From owner-freebsd-alpha  Fri Sep  6  5:37:59 2002
Delivered-To: freebsd-alpha@freebsd.org
Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 08E4D37B400; Fri,  6 Sep 2002 05:37:53 -0700 (PDT)
Received: from harrier.mail.pas.earthlink.net (harrier.mail.pas.earthlink.net [207.217.120.12])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 454D443E6A; Fri,  6 Sep 2002 05:37:52 -0700 (PDT)
	(envelope-from tlambert2@mindspring.com)
Received: from pool0015.cvx21-bradley.dialup.earthlink.net ([209.179.192.15] helo=mindspring.com)
	by harrier.mail.pas.earthlink.net with esmtp (Exim 3.33 #1)
	id 17nIMl-00052K-00; Fri, 06 Sep 2002 05:37:39 -0700
Message-ID: <3D78A148.F25A8F27@mindspring.com>
Date: Fri, 06 Sep 2002 05:36:24 -0700
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: ticso@cicely.de
Cc: John Baldwin <jhb@FreeBSD.ORG>,
	Andrew Gallatin <gallatin@cs.duke.edu>, freebsd-alpha@FreeBSD.ORG
Subject: Re: ithread preemption
References: <XFMail.20020905163105.jhb@FreeBSD.org> <3D78098B.CEBF13EC@mindspring.com> <20020906090517.GI13050@cicely9.cicely.de> <3D78925A.DAA13463@mindspring.com> <20020906120011.GO13050@cicely9.cicely.de>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-alpha@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-alpha.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-alpha>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-alpha>
X-Loop: FreeBSD.org

Bernd Walter wrote:
> > > Interrupts are disabled globaly on alpha too.
> > > The only platform where we disable on the CPU is the PC164 as
> > > a workaround, but this system is UP.
> >
> > This misses the point.  The point is whether or not ithreads
> > run to completion on the processor to which the interrupt is
> > delivered.
> 
> What I missed is why you think this is different with APICs on i386.
> Well I have to say that I don't know much about APICs.

Ah.  Because even though Intel recommends SMP systems run in
virtual wire mode, FreeBSD does not run in virtual wire mode.
Instead, grabbing the giant grabs the interrupt (in older SMP
code).  In the SMPng code, the interrupt routing is explicitly
managed, and it's still not an issue.


> > The problem in this case on the Alpha is that interrupts are
> > routed through the PAL code on a particular processor, and
> > so the return has to be to the PAL code on the same processor,
> > because there is a context cons'ed up for it that have to be
> > destructed on the same CPU where it was cons'ed.
> 
> That's the most logical (if not the only) theorie so far.

It's the only one I've seen.  Even if it turns out not to be
the cause of the particular problem, it *could* cause problems,
according to the PAL documentation (what there is of it) online.

> > > I expect ithreads to be one of the less critical points on NUMA.
> >
> > The problem in this case is that you have to do completion
> > counting on the ISR's for any given interrupt.  For example,
> > on an SMP box running two CPUs where IRQ A is a shared interrupt
> > for two devices, then if you want to dispatch for one device to
> > the first CPU, and the second device to the second, then when
> > you reenable interrupts depends on all ISRs having run to
> > completion.  So you have to set a global count to "2", for the
> > number of ISRs that have to run, and then run each one on a CPU,
> > decrement the count, and when the count goes from 1->0, reenable
> > the interrupt.
> 
> If you want more than one ithread per intline I agree.

I think it may be a requirement in the future.  I don't think
spreading them out across intlines is going to work, unless you
can either wire them down (the PCI code allows it, I guess),
and/or have some config way of specifying an assignment preference
(e.g. "Whatever you do, don't share an IRQ between the two Gigabit
Etherenet cards", or "Whatever you do, don't wshare an interrupt
between the ethernet card and the disck controller", etc.).

I guess there is always card shuffling.  8-(.


> > For NUMA, it really depends on the cluster architecture.  If you
> > have devices associated based on CPU clusters, that's one thing;
> > it's an easy call.  If you have it on the basis of adjacency, it
> > is not so easy a call, because the adjacency det for two different
> > devices can be only partially intersecting (i.e. dev 1 is associated
> > with CPU's 1,2,3,4, dev 2 with CPU's 5,6,7,8, and dev 3 with CPU's
> > 3,4,5,6).  This gets into the same issue as the Alpha, where the
> > CPU to take the interrupt has to complete the interrupt, only in
> > this case, you are talking abbout the associativity set.
> 
> I see.

I are a geek.  8-) 8-).  Actually, I have a real interest in seeing
FreeBSD make it onto NUMA hardware, because (IMO), it's a hop, skip,
and a jump to distributed processing, and I think the same problems
will need solving.

Of course, not everyone agrees with me, so assign a weighting
factor below 1.0 to my opinion on what's important in this respect.


> > Yeah, most likely this won't be a problem, but then that's likely
> > the same thing that was thought when the current Alpha problem
> > was introduced.
> 
> Ack.

I didn't mean it was on purpose, just that it was probably not
something that someone really spent time thnking about before
changing the code.


> > > Currently shared interrupts also share an ithread.
> >
> > Yeah; this isn't very efficient with only 4 interrupts and a lot
> > of PCI cards.
> 
> After all you are writing to the alpha list :)
> On alphas we typically have 4 intlines per slot on the primary busses.
> Only small machines like LCA share 4 lines for all slots.
> As typical chips take only one intline we even don't share intlines
> over PCI-PCI bridges with up to 4 chips.
> But generally I see the point that when the handling for one device
> is blocked the service for devices sharing the same intline are also
> blocked - there is a good reason that blocking in device drivers has
> to be short timed.

I think the code is going to end up shared, even if you are
running the good DEC PCI chipsets, instead of the less able
Intel ones.

What that means to me is that when there is performance pressure,
this is the type of change that will be made for the Intel side,
and the Alpha will quit working (again).


> > Thread affinity as an explicit hard-coded attribute is probably
> > not the correct fix for the current Alpha probems.  It will make
> > it harder to do it right later (just like it's harder to fix a
> > foundation after you've built a house on it).
> 
> I can't ague on that yet, because I understand the reason why returning
> back to PAL could be on a different CPU only since a few hours.

My problem with hard-coding is that it will leave artifacts; my
own answer to this would be to set a "don't migrate" flag, rather
than a "run only on CPU X" flag.  This will work if you have some
scheduler cooperation, and will fall out naturally without having
to change a lot of code.  It requires per CPU run queues, though
(gives you natural affinity anyway, where migration has to be done
explicitly if it's to happen).  The plus side is that you can get
rid of all the global scheduler locks, and even on migration, if
you push processes, rather than pull them, you can check for an
empty push queue without a lock, and any locking you do will end
up giving you at most a 2 CPU contention domain, instead of an N
CPU contention (minor details like a "figure of merit boost" while
in the scheduler, etc. can be handled later).

If you wanted an initial "don't run on other CPU" flag, you could
get to 32 CPU's pretty fast with a "run on" bitmap, and this would
not be painful to migrate, like changing a lot of code would be,
as long as you had a CPU ID to use as a shift index (I would use
a 32 element array to get the bit value out with one add and
compare instead of a shift, but that's probably premature
optimization.  8-)).  Basically, you add an "int" to the proc
struct, and set the bit for the CPU you want to run on, and the
current scheduler leaves your process at the head and skips over
it.  Inefficient, but effictive, for a proof-of-concept.  I don't
have a multiple CPU system that will run -current, or I'd send a
patch.  8-(.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-alpha" in the body of the message