From owner-freebsd-current Tue Jul 24 15:59:48 2001 Delivered-To: freebsd-current@freebsd.org Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9]) by hub.freebsd.org (Postfix) with ESMTP id 0727437B405 for ; Tue, 24 Jul 2001 15:59:36 -0700 (PDT) (envelope-from julian@elischer.org) Received: from InterJet.elischer.org (InterJet.elischer.org [192.168.1.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id NAA20797 for ; Tue, 24 Jul 2001 13:01:35 -0700 (PDT) Date: Tue, 24 Jul 2001 13:01:34 -0700 (PDT) From: Julian Elischer To: freebsd-current@freebsd.org Subject: This look familiar to anyone? (bug in 4.11 maybe) Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG I know this is not a -current problem, but if it was fixed by someone they are likely to be reading here, and not in -stable.. We have a hybrid (4.11+patches) kernel that sometimes crashes. The crash always has teh same symptoms and I'm hoping that they look familiar to someone... The message is below, followed by analysis. Fatal trap 12: page fault while in kernel mode fault virtual address = 0xe6b95cc8 fault code = supervisor read, page not present instruction pointer = 0x8:0xc01846d9 stack pointer = 0x10:0xc954de64 frame pointer = 0x10:0xc954de84 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 10326 (qftListener) interrupt mask = none trap number = 12 In a VFS operation, %ecx get's corrupted (maybe from an interrupt?) betweeen the instruction where it's loaded with a constant, and the instruction where it's used... It'always the same instruction, though often in DIFFERENT VFS instructions (fsync, bwrite so far) the trap frame usually looks like: #4 0xc0251813 in trap (frame={tf_fs = 0x10, tf_es = 0x10, tf_ds = 0x10, tf_edi = 0x0, tf_esi = 0x1, tf_ebp = 0xc954de84, tf_isp = 0xc954de50, tf_ebx = 0xc27d6d80, tf_edx = 0xc1344600, tf_ecx = 0xc96145b2, tf_eax = 0xc954de78, tf_trapno = 0xc, tf_err = 0x0, tf_eip = 0xc01846d9, tf_cs = 0x8, tf_eflags = 0x10286, tf_esp = 0xc954de78, tf_ss = 0xc27d6d80}) at /usr/src/sys/i386/i386/trap.c:443 #5 0xc01846d9 in bwrite (bp=0xc27d6d80) at vnode_if.h:923 #6 0xc0189be2 in vop_stdbwrite (ap=0xc954deb4) at /usr/src/sys/kern/vfs_default.c:319 the code there looks like: (kgdb) up 5 #5 0xc01846d9 in bwrite (bp=0xc27d6d80) at vnode_if.h:923 923 rc = VCALL(vp, VOFFSET(vop_strategy), &a); (kgdb) list 918 struct vop_strategy_args a; 919 int rc; 920 a.a_desc = VDESC(vop_strategy); 921 a.a_vp = vp; 922 a.a_bp = bp; 923 rc = VCALL(vp, VOFFSET(vop_strategy), &a); <-------here 924 return (rc); 925 } 926 struct vop_print_args { 927 struct vnodeop_desc *a_desc; In Assembler: 0xc01846cc : mov 0xc029dcc0,%ecx 0xc01846d2 : mov 0x18(%eax),%edx 0xc01846d5 : lea 0xfffffff4(%ebp),%eax 0xc01846d8 : push %eax 0xc01846d9 : mov (%edx,%ecx,4),%eax <<<<< **POW** 0xc01846dc : call *%eax 0xc01846de : add $0x4,%esp 0xc01846e1 : mov 0xfffffff0(%ebp),%eax looking at the regs, dx = 0xc1344600, cx = 0xc96145b2, and C1344600+(4*C96145B2) = 3E6B95CC8 the lower 32 bits of which is the same as the fault address but in the code above we see that %cx was just loaded from location 0xc029dcc0 which contains: (kgdb) x/x 0xc029dcc0 0xc029dcc0 : 0x12 0x12 is the correct offset for a strategy call. so cx got corrupted between the instruction at 0xc01846cc and that at 0xc01846d9. Note that the contents of cx (0xc96145b2) is an address somewhat higher than the kernel stack at the time in question. a dump of ram in that area shows: (kgdb) x/64xw 0xc96145a0 0xc96145a0: 0xc954e900 0xc9709c00 0x00000000 0xc96145a8 0xc96145b0: [0xc9580660] 0xc95c7370 0xc04d7504 0xc04d47d4 0xc96145c0: 0x0000aa26 0x00000020 0x00000000 0x00000000 0xc96145d0: 0xfc812c38 0x00000002 0x00040010 0x00000020 0xc96145e0: 0x00000000 0x00000000 0x00000000 0x00000000 0xc96145f0: 0x00000000 0xc9636a40 0x0001fc93 0x00000000 0xc9614600: 0xc02ed7c0 0xc95b4120 0x00000000 0xc9614608 0xc9614610: 0x00000000 0xc9555548 0x00000000 0xc9614618 0xc9614620: 0x00003f5b 0x00000003 0x00000000 0x00000000 0xc9614630: 0xfe37c115 0x21880000 0x0000000e 0x00000000 0xc9614640: 0x00000000 0x00000000 0x00000000 0x00000000 0xc9614650: 0x00000000 0x00000000 0x00000000 0x00000000 0xc9614660: 0xc9722ae0 0xc961c600 0x00000000 0xc9614668 0xc9614670: 0xc9690660 0xc97091f0 0x00000000 0xc9614678 0xc9614680: 0x0000cabf 0x00000012 0x00000000 0x00000000 0xc9614690: 0xfc8189f2 0x00000002 0x0000001d 0x00000000 This is obviously SOMETHING, but what? And why does %cx point HALF WAY THROUGH an obvious 32 bit pointer? Thoughts of hardware problems do come to mind... but.. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message