Date: Sun, 14 Jun 1998 11:00:01 -0700 (PDT) From: Matthew Dillon <dillon@backplane.com> To: freebsd-bugs@FreeBSD.ORG Subject: Re: i386/6944: bug in i386/isa/icu_ipl.s - AST gets lost, causes extreme network slowdown when cpu-bound processes present, possibly other problems Message-ID: <199806141800.LAA11833@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
The following reply was made to PR i386/6944; it has been noted by GNATS.
From: Matthew Dillon <dillon@backplane.com>
To: Bruce Evans <bde@zeta.org.au>
Cc: FreeBSD-gnats-submit@FreeBSD.ORG
Subject: Re: i386/6944: bug in i386/isa/icu_ipl.s - AST gets lost, causes extreme network slowdown when cpu-bound processes present, possibly other problems
Date: Sun, 14 Jun 1998 10:54:42 -0700 (PDT)
:> cmpl $SWI_AST,%ecx
:> je splz_nextx /* "can't happen" */
:>
:> Actually can happen. I'm not exactly sure how it happens, but the
:> result is that that AST gets cleared from ipending without being run.
:
:It "can't happen" because SWI_AST_MASK is "always" set in `cpl' until
:the kernel is about to return to user mode. Something must be clearing
:SWI_AST_MASK in `cpl' or in the cpl to be "restored". The typo spl(0)
:instead of spl0() would do this. Please look for whatever does it.
:This may be as simple as looking at the stack trace to see spl(0) and
:verifying that SWI_AST_MASK is set (you can't trust the latter since
:ddb doesn't mask interrupts).
:
:Bruce
Well, I spent 6 hours from 9p.m. to 3a.m. just find this :-) I'm going
to leave the finding of the broken spl to someone else, but there ARE
several places where $0 is loaded into the cpl in the assembly, and
other places where the interrupt nesting count is manually reset to 1.
I'm not sure it's necessary to 'reset' the cpl states, the standard
interrupt context push/pop ought to do that inherently so if things are
being left dangling there's definitely something wrong elsewhere in the
code that these manual resets are 'covering up'. It could be anywhere.
The spl0()/splz() stuff is a mess and should probably be removed entirely.
The problem is extremely reproducable... just NFS mount / and /usr from
a server to a workstation, run a for (;;); process on the server, and
try to run xterm on the workstation and, poof.
When I did this, vmstat showed the number of context switches never
exceeded 100. Hmm... suspicious! Without ./x (the for (;;); process)
running, the number of context switches went to 600+/sec for two seconds
to load xterm via NFS. With ./x running the number of context switches
was around 50/sec and running xterm on the client increased it to only
100/sec, and xterm took forever to load via nfs.
With the fix and ./x running, xterm took only 2 seconds to load via
NFS and was completely uneffected by the existance of the cpu-bound
task.
-
I'd suggest changing the assembly to do a sanity check of the cpl rather
then simply save/restore it around an SWI (or normal interrupt for that
matter)... if the cpl isn't in the state it left it before the call
to the handler, printf() a warning.
I also noticed that the fast interrupt code doesn't save/restore the cpl
around the call to the interrupt handler, but the 'normal' interrupt
code does. I believe the code thinks this is ok because it's leaving
the cpu CLI'd through the call, but I actually think the slow interrupt
handler results in faster operation because the interrupt context doesn't
get popped & repushed through a ring change if a nested interrupt occurs.
I also submit that the fast interrupt code doesn't make the system any
more responsive... the two critical time-sensitive interrupts are the
ethernet rx and the serial rx and neither is able to keep up as it
stands... our 100BaseTX boards almost universally get knocked back into
store and forward mode due to rx overruns after the machine's been up
for a while, and anyone with a digital camera can tell you that the
serial interrupt sucks rocks in terms of being able to process exceptions
at a high rate in unhandshaked mode without overrunning. Running a
'fast' interrupt with interrupts disabled isn't a hot idea when the 'fast'
interrupt isn't the serial or network receive interrupt!
But whatever the case, the core assembly shouldn't be gratuitiously
clearing the AST from ipending if it doesn't intend to run the AST
trap :-)
-Matt
Matthew Dillon Engineering, BEST Internet Communications, Inc.
<dillon@backplane.com>
[always include a portion of the original email in any response!]
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199806141800.LAA11833>
