Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 24 Jul 2001 13:01:34 -0700 (PDT)
From:      Julian Elischer <julian@elischer.org>
To:        freebsd-current@freebsd.org
Subject:   This look familiar to anyone? (bug in 4.11 maybe)
Message-ID:  <Pine.BSF.4.21.0107241231260.19434-100000@InterJet.elischer.org>

next in thread | raw e-mail | index | archive | help

I know this is not a -current problem, but if it was fixed by someone they
are likely to be reading here, and not in -stable..


We have a hybrid (4.11+patches) kernel that sometimes crashes.
The crash always has teh same symptoms and I'm hoping that 
they look familiar to someone...

The message is below, followed by analysis.

Fatal trap 12: page fault while in kernel mode
fault virtual address	= 0xe6b95cc8
fault code		= supervisor read, page not present
instruction pointer	= 0x8:0xc01846d9
stack pointer	        = 0x10:0xc954de64
frame pointer	        = 0x10:0xc954de84
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, def32 1, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 10326 (qftListener)
interrupt mask		= none
trap number		= 12


In a VFS operation, %ecx get's corrupted (maybe from an interrupt?)
betweeen the instruction where it's loaded with a constant,
and the instruction where it's used...  It'always the same instruction,
though often in DIFFERENT VFS instructions (fsync, bwrite so far)

the trap frame  usually looks like:

#4  0xc0251813 in trap (frame={tf_fs = 0x10, tf_es = 0x10, tf_ds = 0x10,
tf_edi = 0x0, tf_esi = 0x1, tf_ebp = 0xc954de84, 
      tf_isp = 0xc954de50, tf_ebx = 0xc27d6d80, tf_edx = 0xc1344600,
tf_ecx = 0xc96145b2, tf_eax = 0xc954de78, tf_trapno = 0xc, 
      tf_err = 0x0, tf_eip = 0xc01846d9, tf_cs = 0x8, tf_eflags = 0x10286,
tf_esp = 0xc954de78, tf_ss = 0xc27d6d80})
    at /usr/src/sys/i386/i386/trap.c:443
#5  0xc01846d9 in bwrite (bp=0xc27d6d80) at vnode_if.h:923
#6  0xc0189be2 in vop_stdbwrite (ap=0xc954deb4) at
/usr/src/sys/kern/vfs_default.c:319


the code there looks like:

(kgdb) up 5
#5  0xc01846d9 in bwrite (bp=0xc27d6d80) at vnode_if.h:923
923		rc = VCALL(vp, VOFFSET(vop_strategy), &a);
(kgdb) list
918		struct vop_strategy_args a;
919		int rc;
920		a.a_desc = VDESC(vop_strategy);
921		a.a_vp = vp;
922		a.a_bp = bp;
923		rc = VCALL(vp, VOFFSET(vop_strategy), &a); <-------here
924		return (rc);
925	}
926	struct vop_print_args {
927		struct vnodeop_desc *a_desc;

In Assembler:

0xc01846cc <bwrite+460>:	mov    0xc029dcc0,%ecx
0xc01846d2 <bwrite+466>:	mov    0x18(%eax),%edx
0xc01846d5 <bwrite+469>:	lea    0xfffffff4(%ebp),%eax
0xc01846d8 <bwrite+472>:	push   %eax
0xc01846d9 <bwrite+473>:	mov    (%edx,%ecx,4),%eax <<<<< **POW**
0xc01846dc <bwrite+476>:	call   *%eax
0xc01846de <bwrite+478>:	add    $0x4,%esp
0xc01846e1 <bwrite+481>:	mov    0xfffffff0(%ebp),%eax

looking at the regs,
dx = 0xc1344600,
cx = 0xc96145b2,
and 
C1344600+(4*C96145B2) = 3E6B95CC8
the lower 32 bits of which is the same as the fault address

but in the  code above we see that %cx was just loaded from 
location 0xc029dcc0 which contains:
(kgdb) x/x 0xc029dcc0     
0xc029dcc0 <vop_strategy_desc>:	0x12

0x12 is the correct offset for a strategy call.

so cx got corrupted between the instruction at 0xc01846cc
and that at 0xc01846d9.

Note that the contents of cx (0xc96145b2) is an address
somewhat higher than the kernel stack at the time in question.
a dump of ram in that area shows:
(kgdb) x/64xw 0xc96145a0
0xc96145a0:	0xc954e900	0xc9709c00	0x00000000	0xc96145a8
0xc96145b0:    [0xc9580660]	0xc95c7370	0xc04d7504	0xc04d47d4
0xc96145c0:	0x0000aa26	0x00000020	0x00000000	0x00000000
0xc96145d0:	0xfc812c38	0x00000002	0x00040010	0x00000020
0xc96145e0:	0x00000000	0x00000000	0x00000000	0x00000000
0xc96145f0:	0x00000000	0xc9636a40	0x0001fc93	0x00000000
0xc9614600:	0xc02ed7c0	0xc95b4120	0x00000000	0xc9614608
0xc9614610:	0x00000000	0xc9555548	0x00000000	0xc9614618
0xc9614620:	0x00003f5b	0x00000003	0x00000000	0x00000000
0xc9614630:	0xfe37c115	0x21880000	0x0000000e	0x00000000
0xc9614640:	0x00000000	0x00000000	0x00000000	0x00000000
0xc9614650:	0x00000000	0x00000000	0x00000000	0x00000000
0xc9614660:	0xc9722ae0	0xc961c600	0x00000000	0xc9614668
0xc9614670:	0xc9690660	0xc97091f0	0x00000000	0xc9614678
0xc9614680:	0x0000cabf	0x00000012	0x00000000	0x00000000
0xc9614690:	0xfc8189f2	0x00000002	0x0000001d	0x00000000

This is obviously  SOMETHING, but what? And why does %cx point HALF WAY
THROUGH an obvious 32 bit pointer?

Thoughts of hardware problems do come to mind... but..


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.21.0107241231260.19434-100000>