From owner-freebsd-current Fri Mar 21 13: 7:35 2003 Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5882337B405 for ; Fri, 21 Mar 2003 13:07:25 -0800 (PST) Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4C95044305 for ; Fri, 21 Mar 2003 13:00:26 -0800 (PST) (envelope-from bde@zeta.org.au) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailman.zeta.org.au (8.9.3/8.8.7) with ESMTP id HAA30248; Sat, 22 Mar 2003 07:58:23 +1100 Date: Sat, 22 Mar 2003 07:58:22 +1100 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: Jake Burkholder Cc: FreeBSD current users Subject: Re: question on profiling code In-Reply-To: <20030217173211.G81102@locore.ca> Message-ID: <20030322062729.C3753@gamplex.bde.org> References: <20030217021512.O63597@locore.ca> <20030218075519.Y7132-100000@gamplex.bde.org> <20030217173211.G81102@locore.ca> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Long ago, On Mon, 17 Feb 2003, Jake Burkholder wrote: [I quoted everything since this thread is so old. My main point is at the end. It restarts and older part of this thread.] > > Can you explain how fuswintr() and suswintr() work on sparc64's? They > > seem to cause traps if the user counter is not mapped, and I can't see > > where the traps are handled. ia64/trap.c has a comment about where these > > traps are handled, but has dummies for fuswintr() and suswintr() so the > > traps never occur. Depending on traps in fast interrupt handlers is > > a bug IMO. It extends the scope of the fast interrupt handler to the > > trap handler, and it is difficult to limit this scope and verify the > > locking for it. > > Ok. Sparc64 uses "lazy trap handling", similar to how I saw you'd done > in your sys.dif. The functions that access user space are delimited by > labels, and in trap and trap_pfault we check to see if the pc is inside > of the labels. fuswintr and suswintr and bracketed by fs_nofault_intr_begin > and fs_nofault_intr_end, which trap_pfault checks for specifically before > doing much of anything: > > if (ctx != TLB_CTX_KERNEL) { > if ((tf->tf_tstate & TSTATE_PRIV) != 0 && > (tf->tf_tpc >= (u_long)fs_nofault_intr_begin && > tf->tf_tpc <= (u_long)fs_nofault_intr_end)) { > tf->tf_tpc = (u_long)fs_fault; > tf->tf_tnpc = tf->tf_tpc + 4; > return (0); > } > ... handle fault > > ctx != TLB_CTX_KERNEL is akin to va < VM_MAXUSER_ADDRESS (the address spaces > overlap on sparc64 so we can only rely on tlb context numbers). Note that > the range bracketed by the fs_nofault_intr_* is included in the fs_nofault > range, which handles alignment or data access exception faults. I see. This is much cleaner than my version, since I use lots of equality checks instead of the range check, and need labels for them all. > It does take some special care in trap() and trap_pfault() not to access > important data, or acquire any locks before this test. Non-trivial > interrupts are still masked here, which buys us something. Probably the > vmspace pointer should not be counted on in this context, but I don't think > it will ever be invalid for the current process, especially since the original > interrupt occured in usermode. The i386 trap() wants to enable interrupts early since this reduces its complications, but this is wrong even for its existing lazy trap handling. My version has even more complications in an attempt to only mask interrupts earlier in the non-lazy trap handling cases. > The only locking that's required that I can see is that PS_PROFIL not be > set when the profiling buffer is invalid. But all that will happen is that > attempts to update the profiling buffer will be ignored. Technically the > process should get a signal but addupc_task doesn't check the return value > of copyin/out (oops). addupc_task still doesn't check. It wouldn't hurt for profil() to check the buffer up front using useracc(). But silly processes could still unmap the buffer, so copyin/out should check too. Strictly, we need enough locking to prevent invalid profiling buffers being used once the process has seen that they have been invalidated. Even silly processes can reasonably expect to use their profiling buffers for something else after they have turned off profiling. Back to an older part of this thread: > On Mon, 17 Feb 2003, Jake Burkholder wrote: > > > Apparently, On Mon, Feb 17, 2003 at 05:35:09PM +1100, > > > Now there is a stronger reason: clock interrupt handling is "fast", > > > so it normally returns to user mode without going near ast(), and the > > > counter is not updated until the next non-fast interrupt or syscall. > > > > In freebsd "fast" interrupts do handle asts, on i386 they return through > > doreti > > Oops. Actually, this doesn't always help. ast() is only called on return to user mode. doreti doesn't check TDF_NEEDRESCHED on return to kernel mode. Thus the following code in ithread_schedule() only works right if the interrupt occurred in user mode: % if (TD_AWAITING_INTR(td)) { % CTR2(KTR_INTR, "%s: setrunqueue %d", __func__, p->p_pid); % TD_CLR_IWAIT(td); % setrunqueue(td); % if (do_switch && % (ctd->td_critnest == 1) ) { % KASSERT((TD_IS_RUNNING(ctd)), % ("ithread_schedule: Bad state for curthread.")); % ctd->td_proc->p_stats->p_ru.ru_nivcsw++; % if (ctd->td_kse->ke_flags & KEF_IDLEKSE) % ctd->td_state = TDS_CAN_RUN; /* XXXKSE */ % mi_switch(); % } else { % curthread->td_flags |= TDF_NEEDRESCHED; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ % } % } else { There is only a problem when we didn't switch immediately because ctd->td_critnest != 1 (apart from the 3 style bugs in the td_critnest test). The most common case is when a fast interrupt handler calls swi_sched(), as happens quite often when hardclock() schedules softclock(). Fast interrupt handlers bump td_critnest (this is part of their locking) and ithread_schedule() bumps it again. Thus td_critnest is always >= 2, and swi_sched() never works right within fast interrupt handlers. It mainly wastes a lot of time to do little more than set TDF_NEEDRESCHED as above, and this setting doesn't work right either unless the fast interrupt occurred in user mode -- then the switch is done in ast() on return to user mode, but if the return is to kernel mode then doreti just returns. At least on i386's, the unpend() stuff at the end of at least the i386 Xfastintr fixes up some of the problems accidentally: if a normal interrupt happened to occur while td_critnest was bumped, then we call i386_unpend() to handle it, and i386_unpend() handles switching to higher-priority runnable tasks as a side effect of switching for the normal interrupts although its support for software interrupts is limited to forwarding clock interrupts in the SMP case. The latencies caused by this bug are probably not easy to measure properly. I just hacked up the following to measure the latency from hardclock() to softclock(): %%% Index: kern_clock.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_clock.c,v retrieving revision 1.155 diff -u -2 -r1.155 kern_clock.c --- kern_clock.c 4 Mar 2003 23:19:54 -0000 1.155 +++ kern_clock.c 5 Mar 2003 11:55:15 -0000 @@ -70,4 +70,7 @@ #endif +int softclock_sched_pending; +struct timespec softclock_sched; + #ifdef DEVICE_POLLING extern void hardclock_device_poll(void); @@ -225,6 +228,9 @@ * callout_lock held; incorrect locking order. */ - if (need_softclock) + if (need_softclock) { + softclock_sched_pending = 1; + nanotime(&softclock_sched); swi_sched(softclock_ih, 0); + } } Index: kern_timeout.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_timeout.c,v retrieving revision 1.75 diff -u -2 -r1.75 kern_timeout.c --- kern_timeout.c 1 Feb 2003 10:06:40 -0000 1.75 +++ kern_timeout.c 6 Mar 2003 14:33:27 -0000 @@ -47,4 +47,7 @@ #include +extern int softclock_sched_pending; +extern struct timespec softclock_sched; + /* * TODO: @@ -139,4 +142,18 @@ static uint64_t maxdt = 18446744073709551LL; /* 1 msec */ #endif + + mtx_lock_spin(&sched_lock); + if (softclock_sched_pending) { + struct timespec ts; + + nanotime(&ts); + timespecsub(&ts, &softclock_sched); + softclock_sched_pending = 0; + mtx_unlock_spin(&sched_lock); + if (ts.tv_sec > 0 || ts.tv_nsec > 1000000) + printf("large softclock lag %d.%09ld\n", + ts.tv_sec, ts.tv_nsec); + } else + mtx_unlock_spin(&sched_lock); #ifndef MAX_SOFTCLOCK_STEPS %%% This gives results like: % Mar 22 06:19:00 besplex kernel: large softclock lag 0.012118505 % Mar 22 06:19:00 besplex kernel: large softclock lag 0.018373392 % Mar 22 06:20:03 besplex kernel: large softclock lag 0.005993315 % Mar 22 06:20:03 besplex kernel: large softclock lag 0.003627271 % Mar 22 06:20:03 besplex kernel: large softclock lag 0.050219103 % Mar 22 06:20:03 besplex kernel: large softclock lag 0.022892766 % Mar 22 06:20:03 besplex kernel: large softclock lag 0.002526595 % Mar 22 06:20:03 besplex kernel: large softclock lag 0.006122069 % Mar 22 06:20:03 besplex kernel: large softclock lag 0.011606568 % Mar 22 06:20:03 besplex kernel: large softclock lag 0.007590287 softclock() may be delayed legitimately by higher-priority threads running first, but I think all lags shown in the above are caused by the bug since none are reported for my kernel where I think the only significant difference is that clock interrupt handlers are non-fast. Possible fixes: - doing scheduling in fast interrupt handlers is a bug IMO. It is not permitted in my kernel (%fs == 0). (mtx_lock*() is not permitted either.) - requests for things to be done in doreti and/or at the end of X*intr* basically need to set a simple condition that is checked there. In RELENG_4, the condition was (ipending & ~cpl). The condition shouldn't involve searching queues. - crtitical_exit() should run interrupts that it unblocked. It does this for hardware interrupts, but not for SWIs that were scheduled by calling swi_sched() while in the critical region. Hopefully this is never done. Bruce To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message