From owner-freebsd-current@FreeBSD.ORG  Mon Nov 17 18:24:55 2003
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 74D0A16A4CE
	for <freebsd-current@freebsd.org>;
	Mon, 17 Nov 2003 18:24:55 -0800 (PST)
Received: from earl-grey.cloud9.net (earl-grey.cloud9.net [168.100.1.1])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 94FD143FB1
	for <freebsd-current@freebsd.org>;
	Mon, 17 Nov 2003 18:24:54 -0800 (PST)
	(envelope-from zhaoc@cloud9.net)
Received: by earl-grey.cloud9.net (Postfix, from userid 15177)
	id 6E01F2AA22; Mon, 17 Nov 2003 21:24:53 -0500 (EST)
Date: Mon, 17 Nov 2003 21:24:53 -0500
From: fbsd-lists@nixwiz.com
To: freebsd-current@freebsd.org
Message-ID: <20031117212453.A98400@earl-grey.cloud9.net>
Mail-Followup-To: freebsd-current@freebsd.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
Subject: psm (pr kern/59067) and irq 16 rate, some observations
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 18 Nov 2003 02:24:55 -0000

Sorry to start a new thread on this; I didn't keep any of the previous 
messages related to these issues to reply to.  I'm hoping these
observations can benefit someone and aid them in tracking this down.
Please bear with the length.

I have noticed the following after rebooting into today's -current:

> vmstat -i
interrupt                          total       rate
irq1: atkbd0                        1733          0
irq6: fdc0                             4          0
irq8: rtc                        1101524        127
irq12: psm0                        66164          7
irq13: npx0                            1          0
stray irq13                            1          0
irq14: ata0                           11          0
irq15: ata1                           30          0
irq16: uhci0                   763728185      88733
irq23: ehci0                      141466         16
irq24: em0                         17154          1
irq50: mpt0                       110516         12
irq0: clk                         860593         99
Total                          766027382      89000

It's clear that I am seeing the same out of control irq16 that some
others have seen in the past few days.  This is on a Dell Precision
650 with dual Xeon 3.06 HTT processors, 1gig ram, bios A00.

I was in fact seeing this from -current a few days ago, but I got around
it by taking the usb options out of my kernel config and ensuring that
usbd did not start upon system boot.  I put the ps/2 adapter on my M$
trackball, and all seemed well.  With no usb devices, the usb kernel
module didn't load, and nothing showed up on irq16.

The cursor was behaving a little erratically (just a few times, and
almost imperceptibly in the few days I was running since the last cvsup), 
so I wrote it off to the difference between ps/2 and usb.

I didn't realize until today, when I did another cvsup and build world 
while running two instances of dnetc that underload the mouse was
almost unusable, with the kernel spitting out the infamous 
'psmintr: out of sync (0000 != 0008).' errors to console.

Not only was the mouse not usable, the build world  (make -j8) was bombing
out at random points.  I eventually got everything built by not moving
the mouse at all and leaving the system alone until the process finished
(of course also removing the -j option).

After rebooting I tried the trackball on usb again, and found that the
irq16 problem had not been fixed so I went back to using the ps/2 port,
which of left me with the out of sync problem and random mouse events.
(It seemed to get worse, as it was noticeable with just the dnetc running.)

After some research, I found the patch in pr 59067, and gave that a shot.
In the first five minutes it seemed to fix my problems, until I really
pushed the system by doing usual make -j 8, loading multiple pages in
mozilla, and rolling the trackball around wildly.  Then the cursor froze
along with my keyboard, so I had to ssh in, rebuild the kernel, and reboot.
As a bit of feedback to the author of the patch, thanks, but didn't work
for me.

What I did find, was that if I ran usbd and force the irq16 problem to
surface, my trackball worked fine whether on psm0 or ums0, whether X
was set to use sysmouse or the device directly.  This isn't a scientific
test, just an observation that seems to be true for me.  I also found
that the irq rate doesn't increase as quickly if I have the trackball
actually attached to a(n?) usb port.

I should clarify the last in case you haven't actually observed this.
When I reboot the system, with no usb probing, irq16 doesn't appear
in the vmstat output.  If I start usbd or reboot with the usb probed,
whether the trackball is connected to the usb port or not, irq16 doesn't 
seem to appear (I could swear on this, but I could be wrong).  I can move
the mouse around in console mode with moused running and it wouldn't 
make a difference: no irq16.

It's only when I start X that uhci0 becomes active and the rate starts 
in the low thousands.  As time passes, the rate steadily (quickly) 
increases.  This increase does not appear to be related to mouse activity.
The rate appears to increase much faster if there are no usb devices
connected to the ports.  If the trackball is connected, the rate
appears to increase at a much slower pace.

After some point, the rate slows down a bit and sometimes goes backwards
by a few tens or hundreds at a time.  However, system activity and the
mere act of running the vmstat may change this behavior so I mostly
see the number going up and don't often see it go down.  It seems to
be hover at around 91000-92000, give or take a few hundred.

I will make a final note that I was originally using the ULE scheduler, 
but after the second reboot with today's cvsup (the first was into single 
user to installworld and mergemaster), starting up dnetc and a buildworld
hung the system hard.  No mouse, keyboard, or ping response.  After
powercycling and a non-stressful kernel build with the 4BSD scheduler,
I have not had any lockups (when the mouse and keyboard hung after I
tried the patch, I was able to ssh in).

I have also seen the following strange behavior, but I only mention
it in passing because I think it has more to do with Dell's hardware
or the A00 bios than -current, and also because I didn't bother to
take down the messages.

When I first put -current on this machine, either 5.0 or 5.1-release,
a soft reboot (shutdown -r) sometimes would not bring back all the
hyperthreaded processors when the system rebooted.  I would get just
the two physical processors.  A power cycle would bring them back.

I thought something in the bios wasn't cleared and then wasn't probed
correctly in a soft reboot, so didn't bother me.  Today, after 
rebooting several times, I realized that all the HTT processors were 
being recognized after the reboots (maybe this has something to do
with the interrupt routing changes).  

However, in one instance, the system probed the cpus and came back 
with a message and a question, which I didn't think to right down.  
It said something to the effect of 'Can't find AP #2, panic? [y/n]?'  
I didn't bother answering, powercycled, and the system came back fine.  
I haven't seen the message since.  

This is not important, as it's easily resolved, but I was just wondering 
if maybe someone who has a Dell Precision 650 knows whether the A03 bios 
will fix the problem.  I don't reboot much normally, and dislike bios 
flashing even if Dell makes it easy.  There's always the chance that 
something more important will break.

Hope this long message was clear enough to understand.  I'm just trying
to get some observations down in the hope that it helps someone narrow
down the issues.  Hopefully this doesn't confuse people more.  Thanks
for a great OS.