From owner-freebsd-current  Wed Sep 20  6:59: 9 2000
Delivered-To: freebsd-current@freebsd.org
Received: from gidora.zeta.org.au (gidora.zeta.org.au [203.26.10.25])
	by hub.freebsd.org (Postfix) with SMTP id ACACF37B443
	for <current@FreeBSD.ORG>; Wed, 20 Sep 2000 06:59:00 -0700 (PDT)
Received: (qmail 23775 invoked from network); 20 Sep 2000 13:58:51 -0000
Received: from unknown (HELO bde.zeta.org.au) (203.2.228.102)
  by gidora.zeta.org.au with SMTP; 20 Sep 2000 13:58:51 -0000
Date: Thu, 21 Sep 2000 00:58:47 +1100 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-Sender: bde@besplex.bde.org
To: John Baldwin <jhb@FreeBSD.ORG>
Cc: current@FreeBSD.ORG, "Andrey A. Chernov" <ache@nagual.pp.ru>,
	smp@FreeBSD.ORG
Subject: Re: recent kernel, microuptime went backwards
In-Reply-To: <XFMail.000920004424.jhb@FreeBSD.org>
Message-ID: <Pine.BSF.4.21.0009210008050.3475-100000@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-current@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Wed, 20 Sep 2000, John Baldwin wrote:

> On 19-Sep-00 Bruce Evans wrote:
> > It really does go backwards.  This is caused by the giant lock preventing
> > the clock interrupt task from running soon enough.  The giant lock can
> > also prevent the clock interrupt task from running often enough even
> > after booting.  E.g., "dd if=/dev/random of=/dev/null bs=large" does
> > several bad things.
> 
> It's not the Giant lock that is at fault.  We give up Giant during mi_switch().
>  Then scheduling problem is in the way that the top-level scheduler runs.

Then the scheduler is more broken than I thought :-).

Initially there may be a locking problem as well as a scheduling problem.
Giving up Giant in the first mi_switch() is a bit late.  mi_switch() uses
microuptime(), and the clock task needs to be run before then to finish
initialization of the timecounter.

> We
> decide to schedule another process due to the timeslice ending during the clk
> interrupt thread.

How can this work?  The timeslice accounting stuff doesn't get updated until
the clock task runs.

> In the past, this was not run as a thread, so it ran, set
> the AST_* constant for needing a resched and then exited.  During doreti, we
> notice an AST is pending and call ast(). ast() calls userret() which notices
> that a resched is needed and calls mi_switch().  In the New World Order, when
> the clock interrupt occurs, we set the AST_* constant for every interrupt
> before returning from sched_ithd().  This results in the actual interrupt
> threads being schedule from ast().

What should happen is for ast() to normally schedule the clock interrupt
(and other interrupts) immediately (unless they are blocked).  This doesn't
seem to be working, and I can't see how it can work, since there is nothing
except the giant lock to tell us whether interrupts are blocked, and the
giant lock is held most of the time in system mode.  Previously, cpl told
us, but cpl is no longer maintained.


> However, when the clk ithread finishes, it
> simply calls mi_switch() to enter the next process in ithd_loop().  The
> need_resched() that it sets isn't handled until the next call to userret()
> either via a hardware interrupt or a syscall return.  Thus, the problem isn't
> due to Giant, but rather to interrupt threads.

I think this is a different problem.  It is similar to a problem for
scheduling netisrs from non-interrupt context.  schednetisr() sets the
AST flag and some other flags.  Nothing looks at these flags until an
interrupt occurs or the process sleeps.  Previously, this was handled in
splx():

	s = splnet();		/* s == 0 in process context. */
	queue_net_output(...);
	schednetisr(...);
	splx(s);		/* Since s == 0, the netisr gets run here. */

Will this work again as soom as splx() is replaced by mtx_exit(), etc?
We only have a few thousand spls to change :(.

> As for the micruptime()
> messages on boot, they only occur here on a UP kernel.  On an SMP kernel I
> don't get them.  Also, they always occur during mi_switch() when an interrupt
> thread is finishing and going back to sleep.  The first such thread to be run
> to generate thet error message is the irq0: clk ithread, so the clk ithread is
> running fine.

They are very timing dependent, and probably also very task-mix
dependent.  The primary cause of microuptime() going backwards is
tv_nsec overflowing if the system takes longer than 2^32 nsec (about
4 seconds) between the initialization of the timecounter and the
timecounter maintenance for the first clock interrupt.  On one of my
systems, the first thread to call mi_switch() is the generic thread
(proc0?) that executes run_interrupt_driven_hooks().  mi_switch() is
called for the first time when the ata hook goes to sleep.  Things
would be a little different for SMP.  Hopefully another cpu handles
the clock interrupt.

Bruce


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message