From owner-freebsd-current@FreeBSD.ORG Mon Nov 17 18:24:55 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 74D0A16A4CE for ; Mon, 17 Nov 2003 18:24:55 -0800 (PST) Received: from earl-grey.cloud9.net (earl-grey.cloud9.net [168.100.1.1]) by mx1.FreeBSD.org (Postfix) with ESMTP id 94FD143FB1 for ; Mon, 17 Nov 2003 18:24:54 -0800 (PST) (envelope-from zhaoc@cloud9.net) Received: by earl-grey.cloud9.net (Postfix, from userid 15177) id 6E01F2AA22; Mon, 17 Nov 2003 21:24:53 -0500 (EST) Date: Mon, 17 Nov 2003 21:24:53 -0500 From: fbsd-lists@nixwiz.com To: freebsd-current@freebsd.org Message-ID: <20031117212453.A98400@earl-grey.cloud9.net> Mail-Followup-To: freebsd-current@freebsd.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i Subject: psm (pr kern/59067) and irq 16 rate, some observations X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Nov 2003 02:24:55 -0000 Sorry to start a new thread on this; I didn't keep any of the previous messages related to these issues to reply to. I'm hoping these observations can benefit someone and aid them in tracking this down. Please bear with the length. I have noticed the following after rebooting into today's -current: > vmstat -i interrupt total rate irq1: atkbd0 1733 0 irq6: fdc0 4 0 irq8: rtc 1101524 127 irq12: psm0 66164 7 irq13: npx0 1 0 stray irq13 1 0 irq14: ata0 11 0 irq15: ata1 30 0 irq16: uhci0 763728185 88733 irq23: ehci0 141466 16 irq24: em0 17154 1 irq50: mpt0 110516 12 irq0: clk 860593 99 Total 766027382 89000 It's clear that I am seeing the same out of control irq16 that some others have seen in the past few days. This is on a Dell Precision 650 with dual Xeon 3.06 HTT processors, 1gig ram, bios A00. I was in fact seeing this from -current a few days ago, but I got around it by taking the usb options out of my kernel config and ensuring that usbd did not start upon system boot. I put the ps/2 adapter on my M$ trackball, and all seemed well. With no usb devices, the usb kernel module didn't load, and nothing showed up on irq16. The cursor was behaving a little erratically (just a few times, and almost imperceptibly in the few days I was running since the last cvsup), so I wrote it off to the difference between ps/2 and usb. I didn't realize until today, when I did another cvsup and build world while running two instances of dnetc that underload the mouse was almost unusable, with the kernel spitting out the infamous 'psmintr: out of sync (0000 != 0008).' errors to console. Not only was the mouse not usable, the build world (make -j8) was bombing out at random points. I eventually got everything built by not moving the mouse at all and leaving the system alone until the process finished (of course also removing the -j option). After rebooting I tried the trackball on usb again, and found that the irq16 problem had not been fixed so I went back to using the ps/2 port, which of left me with the out of sync problem and random mouse events. (It seemed to get worse, as it was noticeable with just the dnetc running.) After some research, I found the patch in pr 59067, and gave that a shot. In the first five minutes it seemed to fix my problems, until I really pushed the system by doing usual make -j 8, loading multiple pages in mozilla, and rolling the trackball around wildly. Then the cursor froze along with my keyboard, so I had to ssh in, rebuild the kernel, and reboot. As a bit of feedback to the author of the patch, thanks, but didn't work for me. What I did find, was that if I ran usbd and force the irq16 problem to surface, my trackball worked fine whether on psm0 or ums0, whether X was set to use sysmouse or the device directly. This isn't a scientific test, just an observation that seems to be true for me. I also found that the irq rate doesn't increase as quickly if I have the trackball actually attached to a(n?) usb port. I should clarify the last in case you haven't actually observed this. When I reboot the system, with no usb probing, irq16 doesn't appear in the vmstat output. If I start usbd or reboot with the usb probed, whether the trackball is connected to the usb port or not, irq16 doesn't seem to appear (I could swear on this, but I could be wrong). I can move the mouse around in console mode with moused running and it wouldn't make a difference: no irq16. It's only when I start X that uhci0 becomes active and the rate starts in the low thousands. As time passes, the rate steadily (quickly) increases. This increase does not appear to be related to mouse activity. The rate appears to increase much faster if there are no usb devices connected to the ports. If the trackball is connected, the rate appears to increase at a much slower pace. After some point, the rate slows down a bit and sometimes goes backwards by a few tens or hundreds at a time. However, system activity and the mere act of running the vmstat may change this behavior so I mostly see the number going up and don't often see it go down. It seems to be hover at around 91000-92000, give or take a few hundred. I will make a final note that I was originally using the ULE scheduler, but after the second reboot with today's cvsup (the first was into single user to installworld and mergemaster), starting up dnetc and a buildworld hung the system hard. No mouse, keyboard, or ping response. After powercycling and a non-stressful kernel build with the 4BSD scheduler, I have not had any lockups (when the mouse and keyboard hung after I tried the patch, I was able to ssh in). I have also seen the following strange behavior, but I only mention it in passing because I think it has more to do with Dell's hardware or the A00 bios than -current, and also because I didn't bother to take down the messages. When I first put -current on this machine, either 5.0 or 5.1-release, a soft reboot (shutdown -r) sometimes would not bring back all the hyperthreaded processors when the system rebooted. I would get just the two physical processors. A power cycle would bring them back. I thought something in the bios wasn't cleared and then wasn't probed correctly in a soft reboot, so didn't bother me. Today, after rebooting several times, I realized that all the HTT processors were being recognized after the reboots (maybe this has something to do with the interrupt routing changes). However, in one instance, the system probed the cpus and came back with a message and a question, which I didn't think to right down. It said something to the effect of 'Can't find AP #2, panic? [y/n]?' I didn't bother answering, powercycled, and the system came back fine. I haven't seen the message since. This is not important, as it's easily resolved, but I was just wondering if maybe someone who has a Dell Precision 650 knows whether the A03 bios will fix the problem. I don't reboot much normally, and dislike bios flashing even if Dell makes it easy. There's always the chance that something more important will break. Hope this long message was clear enough to understand. I'm just trying to get some observations down in the hope that it helps someone narrow down the issues. Hopefully this doesn't confuse people more. Thanks for a great OS.