From owner-freebsd-current  Tue Sep 17 20:15:35 1996
Return-Path: owner-current
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id UAA07116
          for current-outgoing; Tue, 17 Sep 1996 20:15:35 -0700 (PDT)
Received: from root.com (implode.root.com [198.145.90.17])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id UAA07083;
          Tue, 17 Sep 1996 20:15:25 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1]) by root.com (8.7.5/8.6.5) with SMTP id UAA11084; Tue, 17 Sep 1996 20:16:07 -0700 (PDT)
Message-Id: <199609180316.UAA11084@root.com>
X-Authentication-Warning: implode.root.com: Host localhost [127.0.0.1] didn't use HELO protocol
To: asami@FreeBSD.org (Satoshi Asami)
cc: current@FreeBSD.org, haertel@ichips.intel.com, erich@uruk.org
Subject: Re: RAM parity error 
In-reply-to: Your message of "Tue, 17 Sep 1996 18:37:45 PDT."
             <199609180137.SAA09571@silvia.HIP.Berkeley.EDU> 
From: David Greenman <dg@root.com>
Reply-To: dg@root.com
Date: Tue, 17 Sep 1996 20:16:07 -0700
Sender: owner-current@FreeBSD.org
X-Loop: FreeBSD.org
Precedence: bulk

   ["parity" errors on P6 machines during heavy I/O]

>Is there any reason why the above would happen when it is NOT the
>hardware that's broken?  I've seen it on a couple of P6 boxes around
>here, with or without ccd, when I try to push a lot of stuff through
>the SCSI system (like parallel iozone's on multiple non-ccd
>filesystems).

   *Very* interesting...

>#4  0xf01b4b61 in calltrap ()
>#5  0xf0194652 in scsi_scsi_cmd ()
>#6  0xf0198269 in sdstart ()

   I'll bet that the real traceback has a "#4.5" that is ahc_scsi_cmd(). gdb
often doesn't decode the traceback correctly since it doesn't deal with
trapframes correctly. I'm seeing *exactly* the same behavior on wcarchive (B0
Orion, Stepping 1 of the P6). During heavy disk I/O, I occasionally see "RAM
parity errors" during the outsl instruction. I've addionally seen *weird*
traps - reserved traps that are outside the 0-18 range. These also happen
during this _same_ outsl instruction. I believe that whatever is causing this
is also the cause of the machine hangs that I'm seeing sometimes multiple
times a day. The weird traps look like this:

instruction pointer     = 0x8:0xe01a4557
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 21 (fsck)
interrupt mask          = bio 
kernel: type 28 trap, code=0
Stopped at      _ahc_scsi_cmd+0x3ff:    repe outsl      (%esi),%dx
db> tr
_ahc_scsi_cmd(e8be6500,eeb4d950,5399c,2,dfbffd94) at _ahc_scsi_cmd+0x3ff
_scsi_scsi_cmd(e8b31980,dfbffd88,a,f034c000,400,4,2710,eeb4d950,400) at _scsi_scsi_cmd+0x164
_sdstart(14,0,e8b32e14,eeb4d950) at _sdstart+0xf3
_sd_strategy(eeb4d950,e8b31980,eeb4d950,eeb4d950,dfbfff1c) at _sd_strategy+0x7b
_scsi_strategy(eeb4d950,e01ad080,dfbffe24,e010afea,eeb4d950) at _scsi_strategy+0x84
_sdstrategy(eeb4d950,eeb4d950) at _sdstrategy+0x10
_physio(e016811c,0,da0,1,e010b100) at _physio+0x1ca
_rawread(da0,dfbfff1c,0,e8be6a00,e8b31580) at _rawread+0x2f
_spec_read(dfbffed0,dfbffeec,e012a2ae,dfbffed0,dfbfd3c4) at _spec_read+0x80
_ufsspec_read(dfbffed0,dfbfd3c4,400,dfbfff94,e8be6a00) at _ufsspec_read+0x21
_vn_read(e8be8e00,dfbfff1c,e8b31580,dfbfd3c4,e01a9ce8) at _vn_read+0x86
_read(e8bedc00,dfbfff94,dfbfff8c,a733800,0) at _read+0xa7
_syscall(dfbf0027,f3e0027,6c7000,0,dfbfd400) at _syscall+0x147
_Xsyscall() at _Xsyscall+0x2b
--- syscall 3, eip = 0x255a5, ebp = 0xdfbfd400 ---

   Note that "trap 28" simply indicates that the trap is not within the 0-18
of supported traps (28 is the internal trap number for T_RESERVED, which all
unhandled traps translate to).

-DG

David Greenman
Core-team/Principal Architect, The FreeBSD Project