From owner-freebsd-current Tue Sep 17 20:15:35 1996 Return-Path: owner-current Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id UAA07116 for current-outgoing; Tue, 17 Sep 1996 20:15:35 -0700 (PDT) Received: from root.com (implode.root.com [198.145.90.17]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id UAA07083; Tue, 17 Sep 1996 20:15:25 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by root.com (8.7.5/8.6.5) with SMTP id UAA11084; Tue, 17 Sep 1996 20:16:07 -0700 (PDT) Message-Id: <199609180316.UAA11084@root.com> X-Authentication-Warning: implode.root.com: Host localhost [127.0.0.1] didn't use HELO protocol To: asami@FreeBSD.org (Satoshi Asami) cc: current@FreeBSD.org, haertel@ichips.intel.com, erich@uruk.org Subject: Re: RAM parity error In-reply-to: Your message of "Tue, 17 Sep 1996 18:37:45 PDT." <199609180137.SAA09571@silvia.HIP.Berkeley.EDU> From: David Greenman Reply-To: dg@root.com Date: Tue, 17 Sep 1996 20:16:07 -0700 Sender: owner-current@FreeBSD.org X-Loop: FreeBSD.org Precedence: bulk ["parity" errors on P6 machines during heavy I/O] >Is there any reason why the above would happen when it is NOT the >hardware that's broken? I've seen it on a couple of P6 boxes around >here, with or without ccd, when I try to push a lot of stuff through >the SCSI system (like parallel iozone's on multiple non-ccd >filesystems). *Very* interesting... >#4 0xf01b4b61 in calltrap () >#5 0xf0194652 in scsi_scsi_cmd () >#6 0xf0198269 in sdstart () I'll bet that the real traceback has a "#4.5" that is ahc_scsi_cmd(). gdb often doesn't decode the traceback correctly since it doesn't deal with trapframes correctly. I'm seeing *exactly* the same behavior on wcarchive (B0 Orion, Stepping 1 of the P6). During heavy disk I/O, I occasionally see "RAM parity errors" during the outsl instruction. I've addionally seen *weird* traps - reserved traps that are outside the 0-18 range. These also happen during this _same_ outsl instruction. I believe that whatever is causing this is also the cause of the machine hangs that I'm seeing sometimes multiple times a day. The weird traps look like this: instruction pointer = 0x8:0xe01a4557 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 21 (fsck) interrupt mask = bio kernel: type 28 trap, code=0 Stopped at _ahc_scsi_cmd+0x3ff: repe outsl (%esi),%dx db> tr _ahc_scsi_cmd(e8be6500,eeb4d950,5399c,2,dfbffd94) at _ahc_scsi_cmd+0x3ff _scsi_scsi_cmd(e8b31980,dfbffd88,a,f034c000,400,4,2710,eeb4d950,400) at _scsi_scsi_cmd+0x164 _sdstart(14,0,e8b32e14,eeb4d950) at _sdstart+0xf3 _sd_strategy(eeb4d950,e8b31980,eeb4d950,eeb4d950,dfbfff1c) at _sd_strategy+0x7b _scsi_strategy(eeb4d950,e01ad080,dfbffe24,e010afea,eeb4d950) at _scsi_strategy+0x84 _sdstrategy(eeb4d950,eeb4d950) at _sdstrategy+0x10 _physio(e016811c,0,da0,1,e010b100) at _physio+0x1ca _rawread(da0,dfbfff1c,0,e8be6a00,e8b31580) at _rawread+0x2f _spec_read(dfbffed0,dfbffeec,e012a2ae,dfbffed0,dfbfd3c4) at _spec_read+0x80 _ufsspec_read(dfbffed0,dfbfd3c4,400,dfbfff94,e8be6a00) at _ufsspec_read+0x21 _vn_read(e8be8e00,dfbfff1c,e8b31580,dfbfd3c4,e01a9ce8) at _vn_read+0x86 _read(e8bedc00,dfbfff94,dfbfff8c,a733800,0) at _read+0xa7 _syscall(dfbf0027,f3e0027,6c7000,0,dfbfd400) at _syscall+0x147 _Xsyscall() at _Xsyscall+0x2b --- syscall 3, eip = 0x255a5, ebp = 0xdfbfd400 --- Note that "trap 28" simply indicates that the trap is not within the 0-18 of supported traps (28 is the internal trap number for T_RESERVED, which all unhandled traps translate to). -DG David Greenman Core-team/Principal Architect, The FreeBSD Project