Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 22 Mar 1998 05:33:44 +0800
From:      Peter Wemm <peter@netplex.com.au>
To:        smp@FreeBSD.ORG
Subject:   is it just me, or...
Message-ID:  <199803212133.FAA03833@spinner.netplex.com.au>

next in thread | raw e-mail | index | archive | help
It is just me or has the smp kernel taken a nose dive over the last month 
or so?

Some of the problems I see on a dual ppro@200 (pci/isa, no cards except 
pci vga, and onboard fxp, 64MB ram, 200MB swap, very little swap activity):

- when running cpu intensive processes (eg: rc564 or rc5des), I get
particularly lousy interactive response (no keyboard/screen, this machine
is accessed via serial console and ssh only).  It can be up to 20 or 30
seconds before I get character echo quite often.  If I run a pair of rc5
processes, the problem pretty much goes away.

- rc5* fails very often...  I can only run it for a few minutes before it
corrupts itself and it's checkpoint files, loosing the current key. Once
it's failed, it stops it's periodic checkpointing.  Explicitly killing it
causes the checkpoint to be updated still though, but recovering the key
always causes a SEGV at the 100% complete mark.

- when running a single rc5* process, I regularly get fxp0 and ahc0 device 
timeouts, corresponding with the response hangs.

Problems I see on a dual p5-90 system (pci/eisa/isa, with pci vga, pci de0, 
eisa ahc2742T, and misc isa cards, 48MB ram, 210MB swap space - often 75% 
full):

- I get regular sig-11 core dumps, although nowhere near as many as I used
to a few weeks ago.  In particular, large memory image processes are
hardest hit.. the ones that die most often are netscape and exmh2 (ie:
wish8.0p2).  My early netscape communicator aborts were due to a low data
size limit, but this has been well and truely fixed.  wish easily gets up
to about 10MB of ram (I have some very large mail folders :-] ).

- 95% of the time at bootup I get a pair of device timeouts, resets, 
aborts and just about every other error message for sd0 and sd1 when "fsck 
-p" starts.  eg:

SMP: AP CPU #1 Launched!
WARNING: / was not properly dismounted.
de0: enabling 10baseT port
sd0: SCB 0x1 - timed out in message in phase, SCSISIGI == 0xe6
SEQADDR = 0x168 SCSISEQ = 0x12 SSTAT0 = 0x7 SSTAT1 = 0x3
sd0: abort message in message buffer
ahc0:A:0: Missed busfree.
sd0: SCB 0x1 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0xb6
SEQADDR = 0x4 SCSISEQ = 0x5a SSTAT0 = 0x7 SSTAT1 = 0x13
sd0: no longer in timeout
ahc0: Issued Channel A Bus Reset. 1 SCBs aborted
sd1: SCB 0x0 - timed out in message out phase, SCSISIGI == 0xb6
SEQADDR = 0xab SCSISEQ = 0x12 SSTAT0 = 0x7 SSTAT1 = 0x3
Ordered Tag queued
sd1: no longer in timeout
Ordered Tag sent
sd0: SCB 0x1 - timed out in message in phase, SCSISIGI == 0xe6
SEQADDR = 0x168 SCSISEQ = 0x12 SSTAT0 = 0x7 SSTAT1 = 0x3
sd0: abort message in message buffer
ahc0:A:0: Missed busfree.
sd0: SCB 0x1 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0xb6
SEQADDR = 0x5 SCSISEQ = 0x5a SSTAT0 = 0x7 SSTAT1 = 0x13
sd0: no longer in timeout
ahc0: Issued Channel A Bus Reset. 1 SCBs aborted
sd1: SCB 0x2 - timed out in message out phase, SCSISIGI == 0xb6
SEQADDR = 0xa2 SCSISEQ = 0x12 SSTAT0 = 0x7 SSTAT1 = 0x3
Ordered Tag queued
sd0: UNIT ATTE
sd1: no longer in timeout
NTION asc:29,0
sd0:  Power on, reset, or bus device reset occurred field replaceable unit: 14
, retries:2
Ordered Tag sent

This *only* happens when smp is active.  If I compile the same source, 
same config etc but with SMP disabled, it *never* happens.  It also 
doesn't happen when I use Justin's CAM scsi code (last time I checked a 
few months ago).  This has been a 6 month+ problem though, not something 
new.  I'd have changed to Justin's CAM stuff but maintaining an extra set 
of diffs was just too much (and besides, the fxp driver has the same 
problem on the ppro system).

What worries me is this particular one near the end:
sd0: UNIT ATTE
sd1: no longer in timeout
NTION asc:29,0
..  this reminds me of reentrancy problems back in the early SMP days 
where we'd end up with both processors executing in the kernel at the same 
time.

Is anybody else seeing this sort of thing, or is it just me?  Both of 
these systems have been highly modified (the p5-90 has around 400 modified 
files in a 'cvs update' listing, while the ppro system is a pure elf 
machine (with a mostly clean kernel right at the moment)) - the systems do 
not have any changes in common (that I know of :-).

I have not had enough time to closely track the changes over the last few 
months..

Oh, one other odd thing..  the p5-90 machine happily runs rc5des all day
without a single response problem, while the ppro just about dies with a
single rc5des running (it needs both running to maintain reasonable 
response.  Yes, they are in different directories and not conflicting with 
each other's checkpoint and key buffer files)..  Hmm, the only major things
I can think of that's that different between kernels in p5 and p6 mode are 
the bcopy (p5 uses fpu, p6 uses cpu), and the p6 uses PG_G.


Cheers,
-Peter



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199803212133.FAA03833>