From owner-freebsd-smp  Fri Apr 28 18:20:29 2000
Delivered-To: freebsd-smp@freebsd.org
Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135])
	by hub.freebsd.org (Postfix) with ESMTP id 8525437B824
	for <freebsd-smp@FreeBSD.ORG>; Fri, 28 Apr 2000 18:20:24 -0700 (PDT)
	(envelope-from tlambert@usr08.primenet.com)
Received: (from daemon@localhost)
	by smtp05.primenet.com (8.9.3/8.9.3) id SAA19968;
	Fri, 28 Apr 2000 18:20:19 -0700 (MST)
Received: from usr08.primenet.com(206.165.6.208)
 via SMTP by smtp05.primenet.com, id smtpdAAARcai_M; Fri Apr 28 18:20:12 2000
Received: (from tlambert@localhost)
	by usr08.primenet.com (8.8.5/8.8.5) id SAA11745;
	Fri, 28 Apr 2000 18:20:14 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <200004290120.SAA11745@usr08.primenet.com>
Subject: Re: hlt instructions and temperature issues
To: dillon@apollo.backplane.com (Matthew Dillon)
Date: Sat, 29 Apr 2000 01:20:13 +0000 (GMT)
Cc: tlambert@primenet.com (Terry Lambert),
	jgowdy@home.com (Jeremiah Gowdy), smp@csn.net (Steve Passe),
	jim@thehousleys.net (James Housley), freebsd-smp@FreeBSD.ORG
In-Reply-To: <200004282240.PAA14200@apollo.backplane.com> from "Matthew Dillon" at Apr 28, 2000 03:40:33 PM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-smp@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> :Others have complained about the "air gap" between the "sti" and
> :the "hlt".  I think that this is not really an issue, but it's
> :very easy to rectify this, if it were.  It's clearly not an issue
> :if the TPR claims are correct, and the new code merely removes
> :the "#ifdef SMP/#endif" directives.
> 
>     This is definitely an issue.  If both cpu's go idle the interrupt
>     that's supposed to wake one of them up (e.g. some event that causes
>     a process to be woken up) can get lost in the air-gap.

The reason I said that is that I think that access to that code
might be serialized, until the priority is dropped.


>     It's trivial to rectify, so we should just do it :-).  Besides, you
>     can't mess with the APIC stuff with interrupts enabled anyway because,
>     again, an interrupt might occur that alters the state of the system
>     just before or just after you modify the APIC priority.  The sequence
>     of events should be:  (1) mess with apic (2) sti (3) hlt.

Right, very easy to rectify.


>     These windows are small, but as we have seen an ample number of times
>     even one-instruction windows can get hit when the code in
>     question is being run thousands of times a second.

Ouch!  Still smarting from the lock stuff.  8-) 8-).


>     I like the HLT + IPI idea, but none of the patches to date
>     really cover the bases and switching performance is not going
>     to be as good as when you don't have the HLT due to the
>     overhead of sending the IPIs

This is a non-issue, I think.  The IPIs will be sent at a time
that the sending processor would otherwise be going idle.  The
need to do this is no more of a hit, I think, than the hit FreeBSD
normally takes from "hlt" in the non-SMP case.


>     and having to keep track of which cpu's are in a HLT'd state
>     and which are not (so you don't send IPI's to all cpu's
>     gratuitously).  

A trivial gross approximation here would be to have a 32 bit
bitmap, one bit per processor, which did an XOR with memory
of only its own personal bit.

The only danger here is a window in which someone (holding the
BGL) leaving the scheduler send a spurious IPI between the
wakeup and the XOR operation.

You could fix this by having the bit set when it is going to
sleep, and unset based on the IPI about to be sent.


>     This is not a trivial problem because we cannot afford to
>     have N cpu's all trying to do locked bitset instructions on
>     the same memory location in order to go idle -- that alone
>     will create big latencies. 

There is a lot of current SMP code that assumes MESI cache
coherency.  Adding to this will not be an issue.  The XOR
instructions will not need to be locked, I believe, since the
cache coherency notifications should handle synchronization.

As I said, the bit will only ever be being cleared in the
BGL case.

If you want to get gross, you can set the bit in the scheduler
with the BGL held on the processor that's about to go idle,
which would take care of your objection: the bitmap is only
ever manipulated with the BGL held, and the manipulation is
done opportunistically, so there is not additional locking
overhead.  You would, of course, have to undo this hack when
you went to per CPU ready-to-run queues.  But realize that
per CPU ready-to-run queues already magically have an IPI
call location reserved in the code which migrates processes
from one CPU ready-to-run queue to another.  8-).


>     We should consider testing other possible solutions, such as having a
>     really tight idle loop that stays in the same cacheline and thus does
>     not greatly exercise the cpu's circuitry, resulting in less heat without
>     having to HLT.  

I think that going outside is the least of the heat dissipation
workies; it strikes me that line drivers are not where the heat
is coming from, and that running over the same cache line would
be a very bad thing.

The other problem with this idea is that you rely on a shootdown
notification for a data change in order to exit the loop, and
that is, defacto, an IPI in all but name.


>     For example, if we can remove *ALL* memory writes from the
>     best-case idle loop it should make a huge difference in heat
>     dissipation without having to resort to HLT!  Right now we
>     make a number of subroutine calls (such as to procrunnable())
>     which will result in external bus cycles.  If those can be
>     inlined it should have a noticeable effect.

You can give it a try, but I don't think it will have the effect
you think that it will.

I think the numbers for a 2 CPU system with Loqui's patch were
extremely exagerated by the CPU stalling-until-interrupt issue,
and that the heat numbers will not be nearly so good, even on a
totally "correct" solotuion because of this.  I expect your
approach would result in temeperatures nearly as high, if not
downright indistinguishable from, the measured numbers for an
unmodified system.


This is really not an issue, anyway, except for power consumption
and heat dissipation critical environments, but that said, if it's
for an SMP box going into a colocation server room rack somewhere
in a 1U case, this could be significant for some percentage of
users, so maybe it's worth still talking about.  8-).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message