From owner-freebsd-current  Sun Apr 28 21: 3:48 2002
Delivered-To: freebsd-current@freebsd.org
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by hub.freebsd.org (Postfix) with ESMTP
	id AFBF937B41D; Sun, 28 Apr 2002 21:03:36 -0700 (PDT)
Received: from fledge.watson.org (fledge.pr.watson.org [192.0.2.3])
	by fledge.watson.org (8.11.6/8.11.6) with SMTP id g3T43Ow29942;
	Mon, 29 Apr 2002 00:03:24 -0400 (EDT)
	(envelope-from robert@fledge.watson.org)
Date: Mon, 29 Apr 2002 00:03:23 -0400 (EDT)
From: Robert Watson <rwatson@FreeBSD.org>
X-Sender: robert@fledge.watson.org
To: current@FreeBSD.org
Cc: jeff@FreeBSD.org
Subject: Re: page fault in _mtx_lock_flags
In-Reply-To: <Pine.NEB.3.96L.1020428171658.64976Q-100000@fledge.watson.org>
Message-ID: <Pine.NEB.3.96L.1020428235948.64976V-200000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: MULTIPART/MIXED; BOUNDARY="0-260287369-1020053003=:64976"
Sender: owner-freebsd-current@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-current.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-current>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-current>
X-Loop: FreeBSD.ORG

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.
  Send mail to mime@docserver.cac.washington.edu for more info.

--0-260287369-1020053003=:64976
Content-Type: TEXT/PLAIN; charset=US-ASCII


If I apply the attached diff to the kern_malloc.c, backing out a portion
of kern_malloc.c:1.99, the rate of panics plummets.  Previously, I could
have a box panic within five minutes of getting the crash boxes spinning. 
Now I've been going for about 40 minutes without any perceived failures
(i.e., no panics).  I have no idea why this fixes the problem, but David
Wolfskill pointed me at that particular revision as being a source of
related problems for him.  I'm going to leave the boxes running overnight
and see what I bump into.  It would be nice to know if this is masking the
problem, or fixing the problem, and if so, why. 

Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
robert@fledge.watson.org      NAI Labs, Safeport Network Services

On Sun, 28 Apr 2002, Robert Watson wrote:

> I also get an almost identical fault on crash1 involving mdconfig as
> opposed to sh:
> 
> ray irq 10
> NFS ROOT: 192.168.50.1:/cboss/devel/nfsroot/crash1.cboss.tislabs.com
> 8.50.10 BroadcasP-Address 192.16
> t 192.168.50.255
> 
> Fatal trap 12: page fault while in kernel mode
> cpuid = 1; lapic.id = 01000000
> fault virtual address   = 0x6b73697c
> fault code              = supervisor write, page not present
> instruction pointer     = 0x8:0xc02449b6
> stack pointer           = 0x10:0xc93d8a14
> frame pointer           = 0x10:0xc93d8a20
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, def32 1, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 44 (mdconfig)
> kernel: type 12 trap, code=0
> Stopped at      _mtx_lock_flags+0x42:   lock cmpxchgl   %ecx,0x18(%ebx)
> db> trace
> _mtx_lock_flags(6b736964,0,c03cb862,e3) at _mtx_lock_flags+0x42
> lockmgr(c93a8228,1000001,0,c8f27100) at lockmgr+0x42
> vfs_busy(c93a8200,0,0,c8f27100) at vfs_busy+0x58
> lookup(c93d8c28,0,c93b9c34,c93d8d20,c8f27100) at lookup+0x3a2
> namei(c93d8c28,0,c93b9c34,c93d8d20,0) at namei+0x1c8
> vn_open_cred(c93d8c28,c93d8bf4,0,c3f80c80,c93d8ce8) at vn_open_cred+0x23b
> vn_open(c93d8c28,c93d8bf4,0,c8f271dc,c8f27000) at vn_open+0x18
> open(c8f27100,c93d8d20,0,0,0) at open+0x158
> syscall(2f,2f,2f,0,0) at syscall+0x223
> syscall_with_err_pushed() at syscall_with_err_pushed+0x1b
> --- syscall (5, FreeBSD ELF, open), eip = 0x804950b, esp = 0xbfbffd14, ebp
> = 0xbfbffd50 ---
> db> Context switches not allowed in the debugger.
> db> 
> 
> Still not clear what the origin of this is -- possibly memory corruption
> of the mutex..?
> 
> 
> Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
> robert@fledge.watson.org      NAI Labs, Safeport Network Services
> 
> On Sun, 28 Apr 2002, Robert Watson wrote:
> 
> > 
> > As usual, GENERIC -CURRENT head from last night, from the main tree. 
> > Dual-proc SMP box netbooted using PXE.  System usually boots, does a
> > buildkernel -j 8 over NFS, then reboots and repeats.  This time it didn't. 
> > 
> > I actually have two boxes doing this, which does seem to double the rate
> > of panics I get.
> > 
> > APIC_IO: Testing 8254 interrupt delivery
> > APIC_IO: Broken MP table detected: 8254 is not connected to IOAPIC #0 intpin 2
> > APIC_IO: routing 8254 via 8259 and IOAPIC #0 intpin 0
> > ad0: 19458MB <ST320420A> [39535/16/63] at ata0-master UDMA33
> > acd0: CDROM <MATSHITA CR-176> at ata1-master PIO4
> > doSuMnPt:i nAgP  rCoPoUt  #f1r oLma unnfcsh:etsray irq 10
> > NFS ROOT: 192.168.50.1:/cboss/devel/nfsroot/crash1.cboss.tislabs.com
> > 
> > 
> > Fatal trap 12: page fault while in kernel mode
> > cpuid = 0; lapic.id = 00000000
> > fault virtual address   = 0x7974748b
> > fault code              = supervisor write, page not present
> > instruction pointer     = 0x8:0xc02449b6
> > stack pointer           = 0x10:0xc93dea14
> > frame pointer           = 0x10:0xc93dea20
> > code segment            = base 0x0, limit 0xfffff, type 0x1b
> >                         = DPL 0, pres 1, def32 1, gran 1
> > processor eflags        = interrupt enabled, resume, IOPL = 0
> > current process         = 41 (sh)
> > kernel: type 12 trap, code=0
> > Stopped at      _mtx_lock_flags+0x42:   lock cmpxchgl   %ecx,0x18(%ebx)
> > db> trace
> > _mtx_lock_flags(79747473,0,c03cb862,e3) at _mtx_lock_flags+0x42
> > lockmgr(c93a8228,1000001,0,c8f27100) at lockmgr+0x42
> > vfs_busy(c93a8200,0,0,c8f27100) at vfs_busy+0x58
> > lookup(c93dec28,1a4,c8f03034,c93ded20,c8f27100) at lookup+0x3a2
> > namei(c93dec28,1a4,c8f03034,c93ded20,0) at namei+0x1c8
> > vn_open_cred(c93dec28,c93debf4,1a4,c3f80c80,c93dece8) at vn_open_cred+0x67
> > vn_open(c93dec28,c93debf4,1a4,c8f271dc,c8f27000) at vn_open+0x18
> > open(c8f27100,c93ded20,8125005,0,0) at open+0x158
> > syscall(2f,2f,2f,0,0) at syscall+0x223
> > syscall_with_err_pushed() at syscall_with_err_pushed+0x1b
> > --- syscall (5, FreeBSD ELF, open), eip = 0x808969b, esp = 0xbfbff8f0, ebp
> > = 0xbfbff91c ---
> > db> 
> > 
> > (kgdb) l *_mtx_lock_flags+0x42
> > 0xc02449b6 is in _mtx_lock_flags (machine/atomic.h:139).
> > 134     static __inline int
> > 135     atomic_cmpset_int(volatile u_int *dst, u_int exp, u_int src)
> > 136     {
> > 137             int res = exp;
> > 138
> > 139             __asm __volatile (
> > 140             "       " __XSTRING(MPLOCKED) " "
> > 141             "       cmpxchgl %1,%2 ;        "
> > 142             "       setz    %%al ;          "
> > 143             "       movzbl  %%al,%0 ;       "
> > (gdb) l *lockmgr+0x42
> > 0xc0242376 is in lockmgr (../../../kern/kern_lock.c:228).
> > 223                     pid = LK_KERNPROC;
> > 224             else
> > 225                     pid = td->td_proc->p_pid;
> > 226
> > 227             mtx_lock(lkp->lk_interlock);
> > 228             if (flags & LK_INTERLOCK) {
> > 229                     mtx_assert(interlkp, MA_OWNED | MA_NOTRECURSED);
> > 230                     mtx_unlock(interlkp);
> > 231             }
> > 232
> > 
> > Attempts to get into serial gdb failed:
> > 
> > Fatal trap 12: page fault while in kernel mode
> > cpuid = 1; lapic.id = 01000000
> > fault virtual address   = 0x6aa
> > fault code              = supervisor read, page not present
> > instruction pointer     = 0x8:0xc93debf4
> > stack pointer           = 0x10:0xc93debd4
> > frame pointer           = 0x10:0xc93dec28
> > tokdke nselg trnatp             1=2 waith  0ixn0terlruptts  0dxisfablfed
> > cpan ic: bblo   ck      a=b leP Lsle,epp rlosc k1 ,(sdleefep2  m1ut egx)a
> > pro
> > ssroclescsor  e../a.g./ .=. /ii38e6/iu386 /etnraapl.cd:,7 11e
> > pcmeu, I O=P L0 ;=  l0
> > ccu.rrde =t 0p00o0000s0
> > "Deb1u g(gsehr)(
> > $T0b08:f4eb3dc9;05:28ec3dc9;04:d4eb3dc9;#01~
> > 
> > I'm guessing that I'm dealing with an smp/locking issue there, but
> > unfortunately I didn't get much further:
> > 
> > (kgdb) target remote /dev/cuaa0
> > Remote debugging using /dev/cuaa0
> > 0xc93debf4 in ?? ()
> > (kgdb) bt
> > #0  0xc93debf4 in ?? ()
> > #1  0x0 in ?? ()
> > 
> > Normally getting into serial gdb works OK, perhaps there's an interaction
> > from the mutex code.
> > 
> > Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
> > robert@fledge.watson.org      NAI Labs, Safeport Network Services
> > 
> > 
> > To Unsubscribe: send mail to majordomo@FreeBSD.org
> > with "unsubscribe freebsd-current" in the body of the message
> > 
> 
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-current" in the body of the message
> 

--0-260287369-1020053003=:64976
Content-Type: TEXT/PLAIN; charset=US-ASCII; name="kern_malloc.c.diff"
Content-Transfer-Encoding: BASE64
Content-ID: <Pine.NEB.3.96L.1020429000323.64976W@fledge.watson.org>
Content-Description: kern_malloc.c.diff

SW5kZXg6IGtlcm5fbWFsbG9jLmMNCj09PT09PT09PT09PT09PT09PT09PT09
PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0N
ClJDUyBmaWxlOiAvaG9tZS9uY3ZzL3NyYy9zeXMva2Vybi9rZXJuX21hbGxv
Yy5jLHYNCnJldHJpZXZpbmcgcmV2aXNpb24gMS4xMDANCmRpZmYgLXUgLXIx
LjEwMCBrZXJuX21hbGxvYy5jDQotLS0ga2Vybl9tYWxsb2MuYwkyMyBBcHIg
MjAwMiAxODo1MDoyNSAtMDAwMAkxLjEwMA0KKysrIGtlcm5fbWFsbG9jLmMJ
MjkgQXByIDIwMDIgMDM6NTg6MDkgLTAwMDANCkBAIC05MCw3ICs5MCw3IEBA
DQogI2RlZmluZSBLTUVNX1pCQVNFCTE2DQogI2RlZmluZSBLTUVNX1pNQVNL
CShLTUVNX1pCQVNFIC0gMSkNCiANCi0jZGVmaW5lIEtNRU1fWk1BWAk4MTky
DQorI2RlZmluZSBLTUVNX1pNQVgJNjU1MzYNCiAjZGVmaW5lIEtNRU1fWlNJ
WkUJKEtNRU1fWk1BWCA+PiBLTUVNX1pTSElGVCkNCiBzdGF0aWMgdV9pbnQ4
X3Qga21lbXNpemVbS01FTV9aU0laRSArIDFdOw0KIA0KQEAgLTExMCw2ICsx
MTAsOCBAQA0KIAl7MjA0OCwgIjIwNDgiLCBOVUxMfSwNCiAJezQwOTYsICI0
MDk2IiwgTlVMTH0sDQogCXs4MTkyLCAiODE5MiIsIE5VTEx9LA0KKwl7MzI3
NjgsICIzMjc2OCIsIE5VTEx9LA0KKwl7NjU1MzYsICI2NTUzNiIsIE5VTEx9
LA0KIAl7MCwgTlVMTH0sDQogfTsNCiANCg==
--0-260287369-1020053003=:64976--

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message