Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 30 Jan 1997 12:34:19 -0700
From:      Steve Passe <smp@csn.net>
To:        bag@sinbin.demos.su (Alex G. Bulushev)
Cc:        mishania@demos.su, freebsd-smp@freebsd.org
Subject:   Re: troubles with smp kernel 
Message-ID:  <199701301934.MAA18162@clem.systemsix.com>
In-Reply-To: Your message of "Thu, 30 Jan 1997 21:00:48 %2B0300." <199701301800.VAA18189@sinbin.demos.su> 

next in thread | previous in thread | raw e-mail | index | archive | help
Hi,

( note that I am answering 3 consecutive mailings in this response. I think
I have identified the problem, but don't get to that till the end, so
read on... )

---
>> > first, you should be using a kernel with options APIC_IO and options
>> > SMP_INVLTBL, although I doubt that is the cause of your problem. 
>	    ^^^ TLB
sorry...

>It works! but reboots once a hour :(

does it reboot EXACTLY once every hour, or just approximatelty than often.
What I am asking is if it might be associated with some job being run by
cron, etc.

---
>static electricity killed mishania and now there is no PARITY ERROR !!
>why ?

in america we have a saying: don't look a gift horse in the mouth.

I guess the translation is, it works now, so don't complain!

Seriously speaking, this all points to hardware problems of some sort.
Might be something as simple as a loose SIMM.  I would powerdown the machine,
reseat the SIMMS, pull and re-insert all cards, including the CPU card and
CPUs.  Then do something about static control, spray static control
on the carpet, or whatever, static can destroy a machine!!!  You might try
another location, perhaps thers's a bad electrical socket there (bad
ground leg, electrical noise, etc.)  Make sure the box is on a surge/noise
outlet strip or UPS.

---
>Fine Manual says we should leave JP5 to handle PIIX3 SMI, never turning APIC 
ON.
>We turned it on, of course, and it works only from then.

manuals for these things are often misleading or incorrect.  Unfortunately
you often have to "read between the lines" or even disbelieve everything they
say and just experiment.  It looks like you have found the right combination.

---
>Seems like it was my letter, but I didn't include mptable output then, here we
>all have it. But, I see it lies, - I _have_ APIC_IO uncommented ...

I'm not sure I understand, if you mean that you ran mptable with a kernel
that has APIC_IO enabled, but you got the mptable output that was
missing the INT section, this is explainable.  You need to understand
that the information provided by mptable is just gotten from what the BIOS
provides, it has nothing to do with which kernel is running.  You can
run mptable from a non SMP kernel and get the same results.  What affects it
is the position of motherboard jumpers and BIOS settings.  Think of mptable as
a tool for getting all these things setup properly.

---
>MPTable, version 2.0.4
> ...
>Processors:   APIC ID Version State           Family  Model   Step    Flags
>               1       0x11    BSP, usable     6       1       6       0xfbff
>               0       0x11    AP, usable      6       1       7       0xfbff
                                                                ^

you had said earlier that you added an "identical" processor from another
machine, but this shows that they are a different stepping.  This may or may
not be a problem (one being stepping 6, the other being stepping 7).  The
safest thing would be to try to find 2 of the same stepping, but don't worry
too much if you can't....

the rest of the table looks good on first glance...

---
>options SMP_INVLTBL    # Steven.
                ^^^
this is my fault, proper spelling is:

options         SMP_INVLTLB

I would suggest you grab the latest mptable from the web page (2.0.6 I think)
it will have these newer options listed in its output.
---
>> is this area really missing or did you truncate the output?  there should be
>> a long list of INTerrupt associations here!!!
>
>this is a real output with JP5 default setings (PIIX3 SMI)
>
>now mptable output for JP5 in APIC SMI position:
> ...

obviously the manual is WRONG!

---
note that the following lines are grabbed from several of the previous
mailings, resorted to explain the issue:

>Bus:            Bus ID  Type
>                 0       PCI
>                 1       PCI
>                 2       PCI
>                 3       ISA

this shows the PCI bus on the motherboard (Bus 0) and the PCI busses
created by the PCI bridge chips on each of the 3940s (Bus 1 & Bus 2)
This is correctly done, by the way, and many SMP motherboards blow
it entirely.

>I/O Ints:       Type    Polarity    Trigger     Bus ID   IRQ    APIC ID INT#
>                INT     active-lo       level        1   4:A          2   19
>                INT     active-lo       level        1   5:A          2   16
>                INT     active-lo       level        0  10:A          2   18
>                INT     active-lo       level        2   4:A          2   16
>                INT     active-lo       level        2   5:A          2   17

>ahc0 <Adaptec 3940 Ultra SCSI host adapter> rev 0 int a irq 19 on pci1:4
>ahc1 <Adaptec 3940 Ultra SCSI host adapter> rev 0 int a irq 16 on pci1:5
>ahc2 <Adaptec 3940 Ultra SCSI host adapter> rev 0 int a irq 19 on pci2:4
>ahc3 <Adaptec 3940 Ultra SCSI host adapter> rev 0 int a irq 16 on pci2:5
                                                             ^^
                                                             ||
here is your major problem, ahc2 and ahc3 are getting the wrong INTs
assigned to them. ahc2 should get IRQ16, and ahc3 should get IRQ17

A little history to explain why the current code is failing:

The original MP spec 1.1 didn't take PCI bridge cards into account
and thus couldn't handle them.  Intel then added appendix D.2/3 to the
spec which attempted to clear this up, but many MBs didn't get
it right.  Beyond that it was unclear to me from the spec exactly
how the code should deal with it till I had a chance to work it thru
with several people who actual had this type of hardware. 
As a result the current code ignores the Bus ID when assigning these
INTs.

The simple solution here would be to run without the 2nd 3940.  The first one
is being properly assigned.  However, since your MB (ASUS) does the mp table
correctly I suggest the better alternative:

You could attempt to fix the code in sys/i386/i386/mp_machdep.c.  The following
patch hopefully will work, but I don't have an SMP machine right now so I
could not test it...  let me know if it works.

-------------------------------------- cut ---------------------------------
*** mp_machdep.c.old	Thu Dec 12 01:43:52 1996
--- mp_machdep.c	Thu Jan 30 12:07:38 1997
***************
*** 917,926 ****
  /*
   * determine which APIC pin a PCI INT is attached to.
   */
  #define SRCBUSDEVICE(I)	((ioApicINTs[(I)].srcBusIRQ >> 2) & 0x1f)
  #define SRCBUSLINE(I)	(ioApicINTs[(I)].srcBusIRQ & 0x03)
  int
! get_pci_apic_irq( int pciBus __attribute__ ((unused)),
  		  int pciDevice, int pciInt )
  {
      /**
--- 917,927 ----
  /*
   * determine which APIC pin a PCI INT is attached to.
   */
+ #define SRCBUSID(I)	(ioApicINTs[(I)].srcBusID)
  #define SRCBUSDEVICE(I)	((ioApicINTs[(I)].srcBusIRQ >> 2) & 0x1f)
  #define SRCBUSLINE(I)	(ioApicINTs[(I)].srcBusIRQ & 0x03)
  int
! get_pci_apic_irq( int pciBus,
  		  int pciDevice, int pciInt )
  {
      /**
***************
*** 932,937 ****
--- 933,939 ----
  
      for ( intr = 0; intr < nintrs; ++intr )	/* search each record */
  	if ( (INTTYPE( intr ) == 0)
+ 	     && (SRCBUSID( intr ) == pciBus)
  	     && (SRCBUSDEVICE( intr ) == pciDevice)
  	     && (SRCBUSLINE( intr ) == pciInt) )	/* a candidate IRQ */
  	    if ( apicIntIsBusType( intr, PCI ) )	/* check bus match */
***************
*** 941,946 ****
--- 943,949 ----
  }
  #undef SRCBUSLINE
  #undef SRCBUSDEVICE
+ #undef SRCBUSID
  
  #undef INTPIN
  #undef INTTYPE
-------------------------------------- cut ---------------------------------

I expect the above to make things much better, assumming you were using devices
on the 2nd 3940.  Note that the above patch will actually cause many
motherboards to STOP working because they don't do the mp table stuff
correctly!  This is why I haven't submitted such a change to the code.  The
real fix is going to involve analyzing the mp table, then making a CORRECTED
in-core copy when the kernel boots.  It ain't gonna be pretty, and it ain't
gonna be easy to get right, so I have been avoiding it!!!

--
Steve Passe	| powered by
smp@csn.net	|            FreeBSD

-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: 2.6.2

mQCNAzHe7tEAAAEEAM274wAEEdP+grIrV6UtBt54FB5ufifFRA5ujzflrvlF8aoE
04it5BsUPFi3jJLfvOQeydbegexspPXL6kUejYt2OeptHuroIVW5+y2M2naTwqtX
WVGeBP6s2q/fPPAS+g+sNZCpVBTbuinKa/C4Q6HJ++M9AyzIq5EuvO0a8Rr9AAUR
tBlTdGV2ZSBQYXNzZSA8c21wQGNzbi5uZXQ+
=ds99
-----END PGP PUBLIC KEY BLOCK-----




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199701301934.MAA18162>