Date: Sun, 22 Mar 1998 05:33:44 +0800 From: Peter Wemm <peter@netplex.com.au> To: smp@FreeBSD.ORG Subject: is it just me, or... Message-ID: <199803212133.FAA03833@spinner.netplex.com.au>
next in thread | raw e-mail | index | archive | help
It is just me or has the smp kernel taken a nose dive over the last month or so? Some of the problems I see on a dual ppro@200 (pci/isa, no cards except pci vga, and onboard fxp, 64MB ram, 200MB swap, very little swap activity): - when running cpu intensive processes (eg: rc564 or rc5des), I get particularly lousy interactive response (no keyboard/screen, this machine is accessed via serial console and ssh only). It can be up to 20 or 30 seconds before I get character echo quite often. If I run a pair of rc5 processes, the problem pretty much goes away. - rc5* fails very often... I can only run it for a few minutes before it corrupts itself and it's checkpoint files, loosing the current key. Once it's failed, it stops it's periodic checkpointing. Explicitly killing it causes the checkpoint to be updated still though, but recovering the key always causes a SEGV at the 100% complete mark. - when running a single rc5* process, I regularly get fxp0 and ahc0 device timeouts, corresponding with the response hangs. Problems I see on a dual p5-90 system (pci/eisa/isa, with pci vga, pci de0, eisa ahc2742T, and misc isa cards, 48MB ram, 210MB swap space - often 75% full): - I get regular sig-11 core dumps, although nowhere near as many as I used to a few weeks ago. In particular, large memory image processes are hardest hit.. the ones that die most often are netscape and exmh2 (ie: wish8.0p2). My early netscape communicator aborts were due to a low data size limit, but this has been well and truely fixed. wish easily gets up to about 10MB of ram (I have some very large mail folders :-] ). - 95% of the time at bootup I get a pair of device timeouts, resets, aborts and just about every other error message for sd0 and sd1 when "fsck -p" starts. eg: SMP: AP CPU #1 Launched! WARNING: / was not properly dismounted. de0: enabling 10baseT port sd0: SCB 0x1 - timed out in message in phase, SCSISIGI == 0xe6 SEQADDR = 0x168 SCSISEQ = 0x12 SSTAT0 = 0x7 SSTAT1 = 0x3 sd0: abort message in message buffer ahc0:A:0: Missed busfree. sd0: SCB 0x1 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0xb6 SEQADDR = 0x4 SCSISEQ = 0x5a SSTAT0 = 0x7 SSTAT1 = 0x13 sd0: no longer in timeout ahc0: Issued Channel A Bus Reset. 1 SCBs aborted sd1: SCB 0x0 - timed out in message out phase, SCSISIGI == 0xb6 SEQADDR = 0xab SCSISEQ = 0x12 SSTAT0 = 0x7 SSTAT1 = 0x3 Ordered Tag queued sd1: no longer in timeout Ordered Tag sent sd0: SCB 0x1 - timed out in message in phase, SCSISIGI == 0xe6 SEQADDR = 0x168 SCSISEQ = 0x12 SSTAT0 = 0x7 SSTAT1 = 0x3 sd0: abort message in message buffer ahc0:A:0: Missed busfree. sd0: SCB 0x1 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0xb6 SEQADDR = 0x5 SCSISEQ = 0x5a SSTAT0 = 0x7 SSTAT1 = 0x13 sd0: no longer in timeout ahc0: Issued Channel A Bus Reset. 1 SCBs aborted sd1: SCB 0x2 - timed out in message out phase, SCSISIGI == 0xb6 SEQADDR = 0xa2 SCSISEQ = 0x12 SSTAT0 = 0x7 SSTAT1 = 0x3 Ordered Tag queued sd0: UNIT ATTE sd1: no longer in timeout NTION asc:29,0 sd0: Power on, reset, or bus device reset occurred field replaceable unit: 14 , retries:2 Ordered Tag sent This *only* happens when smp is active. If I compile the same source, same config etc but with SMP disabled, it *never* happens. It also doesn't happen when I use Justin's CAM scsi code (last time I checked a few months ago). This has been a 6 month+ problem though, not something new. I'd have changed to Justin's CAM stuff but maintaining an extra set of diffs was just too much (and besides, the fxp driver has the same problem on the ppro system). What worries me is this particular one near the end: sd0: UNIT ATTE sd1: no longer in timeout NTION asc:29,0 .. this reminds me of reentrancy problems back in the early SMP days where we'd end up with both processors executing in the kernel at the same time. Is anybody else seeing this sort of thing, or is it just me? Both of these systems have been highly modified (the p5-90 has around 400 modified files in a 'cvs update' listing, while the ppro system is a pure elf machine (with a mostly clean kernel right at the moment)) - the systems do not have any changes in common (that I know of :-). I have not had enough time to closely track the changes over the last few months.. Oh, one other odd thing.. the p5-90 machine happily runs rc5des all day without a single response problem, while the ppro just about dies with a single rc5des running (it needs both running to maintain reasonable response. Yes, they are in different directories and not conflicting with each other's checkpoint and key buffer files).. Hmm, the only major things I can think of that's that different between kernels in p5 and p6 mode are the bcopy (p5 uses fpu, p6 uses cpu), and the p6 uses PG_G. Cheers, -Peter To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-smp" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199803212133.FAA03833>