From owner-freebsd-current  Fri Mar 21 13: 7:35 2003
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 5882337B405
	for <current@FreeBSD.ORG>; Fri, 21 Mar 2003 13:07:25 -0800 (PST)
Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 4C95044305
	for <current@FreeBSD.ORG>; Fri, 21 Mar 2003 13:00:26 -0800 (PST)
	(envelope-from bde@zeta.org.au)
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailman.zeta.org.au (8.9.3/8.8.7) with ESMTP id HAA30248;
	Sat, 22 Mar 2003 07:58:23 +1100
Date: Sat, 22 Mar 2003 07:58:22 +1100 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@gamplex.bde.org
To: Jake Burkholder <jake@locore.ca>
Cc: FreeBSD current users <current@FreeBSD.ORG>
Subject: Re: question on profiling code
In-Reply-To: <20030217173211.G81102@locore.ca>
Message-ID: <20030322062729.C3753@gamplex.bde.org>
References: <20030217021512.O63597@locore.ca> <20030218075519.Y7132-100000@gamplex.bde.org>
 <20030217173211.G81102@locore.ca>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-current@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-current.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-current>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-current>
X-Loop: FreeBSD.ORG

Long ago, On Mon, 17 Feb 2003, Jake Burkholder wrote:

[I quoted everything since this thread is so old.  My main point is at
the end.  It restarts and older part of this thread.]

> > Can you explain how fuswintr() and suswintr() work on sparc64's?  They
> > seem to cause traps if the user counter is not mapped, and I can't see
> > where the traps are handled.  ia64/trap.c has a comment about where these
> > traps are handled, but has dummies for fuswintr() and suswintr() so the
> > traps never occur.  Depending on traps in fast interrupt handlers is
> > a bug IMO.  It extends the scope of the fast interrupt handler to the
> > trap handler, and it is difficult to limit this scope and verify the
> > locking for it.
>
> Ok.  Sparc64 uses "lazy trap handling", similar to how I saw you'd done
> in your sys.dif.  The functions that access user space are delimited by
> labels, and in trap and trap_pfault we check to see if the pc is inside
> of the labels.  fuswintr and suswintr and bracketed by fs_nofault_intr_begin
> and fs_nofault_intr_end, which trap_pfault checks for specifically before
> doing much of anything:
>
>         if (ctx != TLB_CTX_KERNEL) {
>                 if ((tf->tf_tstate & TSTATE_PRIV) != 0 &&
>                     (tf->tf_tpc >= (u_long)fs_nofault_intr_begin &&
>                      tf->tf_tpc <= (u_long)fs_nofault_intr_end)) {
>                         tf->tf_tpc = (u_long)fs_fault;
>                         tf->tf_tnpc = tf->tf_tpc + 4;
>                         return (0);
>                 }
> 		... handle fault
>
> ctx != TLB_CTX_KERNEL is akin to va < VM_MAXUSER_ADDRESS (the address spaces
> overlap on sparc64 so we can only rely on tlb context numbers).  Note that
> the range bracketed by the fs_nofault_intr_* is included in the fs_nofault
> range, which handles alignment or data access exception faults.

I see.  This is much cleaner than my version, since I use lots of equality
checks instead of the range check, and need labels for them all.

> It does take some special care in trap() and trap_pfault() not to access
> important data, or acquire any locks before this test.  Non-trivial
> interrupts are still masked here, which buys us something.  Probably the
> vmspace pointer should not be counted on in this context, but I don't think
> it will ever be invalid for the current process, especially since the original
> interrupt occured in usermode.

The i386 trap() wants to enable interrupts early since this reduces its
complications, but this is wrong even for its existing lazy trap handling.
My version has even more complications in an attempt to only mask interrupts
earlier in the non-lazy trap handling cases.

> The only locking that's required that I can see is that PS_PROFIL not be
> set when the profiling buffer is invalid.  But all that will happen is that
> attempts to update the profiling buffer will be ignored.  Technically the
> process should get a signal but addupc_task doesn't check the return value
> of copyin/out (oops).

addupc_task still doesn't check.  It wouldn't hurt for profil() to
check the buffer up front using useracc().  But silly processes could
still unmap the buffer, so copyin/out should check too.  Strictly, we
need enough locking to prevent invalid profiling buffers being used
once the process has seen that they have been invalidated.  Even silly
processes can reasonably expect to use their profiling buffers for
something else after they have turned off profiling.

Back to an older part of this thread:

> On Mon, 17 Feb 2003, Jake Burkholder wrote:
>
> > Apparently, On Mon, Feb 17, 2003 at 05:35:09PM +1100,
> > > Now there is a stronger reason: clock interrupt handling is "fast",
> > > so it normally returns to user mode without going near ast(), and the
> > > counter is not updated until the next non-fast interrupt or syscall.
> >
> > In freebsd "fast" interrupts do handle asts, on i386 they return through
> > doreti
>
> Oops.

Actually, this doesn't always help.  ast() is only called on return to user
mode.  doreti doesn't check TDF_NEEDRESCHED on return to kernel mode.  Thus
the following code in ithread_schedule() only works right if the interrupt
occurred in user mode:

% 	if (TD_AWAITING_INTR(td)) {
% 		CTR2(KTR_INTR, "%s: setrunqueue %d", __func__, p->p_pid);
% 		TD_CLR_IWAIT(td);
% 		setrunqueue(td);
% 		if (do_switch &&
% 		    (ctd->td_critnest == 1) ) {
% 			KASSERT((TD_IS_RUNNING(ctd)),
% 			    ("ithread_schedule: Bad state for curthread."));
% 			ctd->td_proc->p_stats->p_ru.ru_nivcsw++;
% 			if (ctd->td_kse->ke_flags & KEF_IDLEKSE)
% 				ctd->td_state = TDS_CAN_RUN; /* XXXKSE */
% 			mi_switch();
% 		} else {
% 			curthread->td_flags |= TDF_NEEDRESCHED;
  			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
% 		}
% 	} else {

There is only a problem when we didn't switch immediately because
ctd->td_critnest != 1 (apart from the 3 style bugs in the td_critnest
test).  The most common case is when a fast interrupt handler calls
swi_sched(), as happens quite often when hardclock() schedules
softclock().  Fast interrupt handlers bump td_critnest (this is part
of their locking) and ithread_schedule() bumps it again.  Thus td_critnest
is always >= 2, and swi_sched() never works right within fast interrupt
handlers.  It mainly wastes a lot of time to do little more than set
TDF_NEEDRESCHED as above, and this setting doesn't work right either
unless the fast interrupt occurred in user mode -- then the switch is
done in ast() on return to user mode, but if the return is to kernel
mode then doreti just returns.

At least on i386's, the unpend() stuff at the end of at least the i386
Xfastintr fixes up some of the problems accidentally: if a normal
interrupt happened to occur while td_critnest was bumped, then we
call i386_unpend() to handle it, and i386_unpend() handles switching
to higher-priority runnable tasks as a side effect of switching for
the normal interrupts although its support for software interrupts
is limited to forwarding clock interrupts in the SMP case.

The latencies caused by this bug are probably not easy to measure
properly.  I just hacked up the following to measure the latency
from hardclock() to softclock():

%%%
Index: kern_clock.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_clock.c,v
retrieving revision 1.155
diff -u -2 -r1.155 kern_clock.c
--- kern_clock.c	4 Mar 2003 23:19:54 -0000	1.155
+++ kern_clock.c	5 Mar 2003 11:55:15 -0000
@@ -70,4 +70,7 @@
 #endif

+int softclock_sched_pending;
+struct timespec softclock_sched;
+
 #ifdef DEVICE_POLLING
 extern void hardclock_device_poll(void);
@@ -225,6 +228,9 @@
 	 * callout_lock held; incorrect locking order.
 	 */
-	if (need_softclock)
+	if (need_softclock) {
+		softclock_sched_pending = 1;
+		nanotime(&softclock_sched);
 		swi_sched(softclock_ih, 0);
+	}
 }

Index: kern_timeout.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_timeout.c,v
retrieving revision 1.75
diff -u -2 -r1.75 kern_timeout.c
--- kern_timeout.c	1 Feb 2003 10:06:40 -0000	1.75
+++ kern_timeout.c	6 Mar 2003 14:33:27 -0000
@@ -47,4 +47,7 @@
 #include <sys/mutex.h>

+extern int softclock_sched_pending;
+extern struct timespec softclock_sched;
+
 /*
  * TODO:
@@ -139,4 +142,18 @@
 	static uint64_t maxdt = 18446744073709551LL;	/* 1 msec */
 #endif
+
+	mtx_lock_spin(&sched_lock);
+	if (softclock_sched_pending) {
+		struct timespec ts;
+
+		nanotime(&ts);
+		timespecsub(&ts, &softclock_sched);
+		softclock_sched_pending = 0;
+		mtx_unlock_spin(&sched_lock);
+		if (ts.tv_sec > 0 || ts.tv_nsec > 1000000)
+			printf("large softclock lag %d.%09ld\n",
+			    ts.tv_sec, ts.tv_nsec);
+	} else
+		mtx_unlock_spin(&sched_lock);

 #ifndef MAX_SOFTCLOCK_STEPS
%%%

This gives results like:

% Mar 22 06:19:00 besplex kernel: large softclock lag 0.012118505
% Mar 22 06:19:00 besplex kernel: large softclock lag 0.018373392
% Mar 22 06:20:03 besplex kernel: large softclock lag 0.005993315
% Mar 22 06:20:03 besplex kernel: large softclock lag 0.003627271
% Mar 22 06:20:03 besplex kernel: large softclock lag 0.050219103
% Mar 22 06:20:03 besplex kernel: large softclock lag 0.022892766
% Mar 22 06:20:03 besplex kernel: large softclock lag 0.002526595
% Mar 22 06:20:03 besplex kernel: large softclock lag 0.006122069
% Mar 22 06:20:03 besplex kernel: large softclock lag 0.011606568
% Mar 22 06:20:03 besplex kernel: large softclock lag 0.007590287

softclock() may be delayed legitimately by higher-priority threads
running first, but I think all lags shown in the above are caused
by the bug since none are reported for my kernel where I think the
only significant difference is that clock interrupt handlers are
non-fast.

Possible fixes:
- doing scheduling in fast interrupt handlers is a bug IMO.  It is
  not permitted in my kernel (%fs == 0).  (mtx_lock*() is not permitted
  either.)
- requests for things to be done in doreti and/or at the end of X*intr*
  basically need to set a simple condition that is checked there.  In
  RELENG_4, the condition was (ipending & ~cpl).  The condition shouldn't
  involve searching queues.
- crtitical_exit() should run interrupts that it unblocked.  It does this
  for hardware interrupts, but not for SWIs that were scheduled by calling
  swi_sched() while in the critical region.  Hopefully this is never done.

Bruce

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message