Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 24 Jul 2001 16:21:07 -0700 (PDT)
From:      John Baldwin <jhb@FreeBSD.org>
To:        Julian Elischer <julian@elischer.org>
Cc:        freebsd-current@FreeBSD.org
Subject:   RE: This look familiar to anyone? (bug in 4.11 maybe)
Message-ID:  <XFMail.010724162107.jhb@FreeBSD.org>
In-Reply-To: <Pine.BSF.4.21.0107241231260.19434-100000@InterJet.elischer.org>

next in thread | previous in thread | raw e-mail | index | archive | help

On 24-Jul-01 Julian Elischer wrote:
> 
> In a VFS operation, %ecx get's corrupted (maybe from an interrupt?)
> betweeen the instruction where it's loaded with a constant,
> and the instruction where it's used...  It'always the same instruction,
> though often in DIFFERENT VFS instructions (fsync, bwrite so far)
> 
> the trap frame  usually looks like:
> 
>#4  0xc0251813 in trap (frame={tf_fs = 0x10, tf_es = 0x10, tf_ds = 0x10,
> tf_edi = 0x0, tf_esi = 0x1, tf_ebp = 0xc954de84, 
>       tf_isp = 0xc954de50, tf_ebx = 0xc27d6d80, tf_edx = 0xc1344600,
> tf_ecx = 0xc96145b2, tf_eax = 0xc954de78, tf_trapno = 0xc, 
>       tf_err = 0x0, tf_eip = 0xc01846d9, tf_cs = 0x8, tf_eflags = 0x10286,
> tf_esp = 0xc954de78, tf_ss = 0xc27d6d80})
>     at /usr/src/sys/i386/i386/trap.c:443
>#5  0xc01846d9 in bwrite (bp=0xc27d6d80) at vnode_if.h:923
>#6  0xc0189be2 in vop_stdbwrite (ap=0xc954deb4) at
> /usr/src/sys/kern/vfs_default.c:319
> 
> 
> the code there looks like:
> 
> (kgdb) up 5
>#5  0xc01846d9 in bwrite (bp=0xc27d6d80) at vnode_if.h:923
> 923           rc = VCALL(vp, VOFFSET(vop_strategy), &a);
> (kgdb) list
> 918           struct vop_strategy_args a;
> 919           int rc;
> 920           a.a_desc = VDESC(vop_strategy);
> 921           a.a_vp = vp;
> 922           a.a_bp = bp;
> 923           rc = VCALL(vp, VOFFSET(vop_strategy), &a); <-------here
> 924           return (rc);
> 925   }
> 926   struct vop_print_args {
> 927           struct vnodeop_desc *a_desc;
> 
> In Assembler:
> 
> 0xc01846cc <bwrite+460>:      mov    0xc029dcc0,%ecx
> 0xc01846d2 <bwrite+466>:      mov    0x18(%eax),%edx
> 0xc01846d5 <bwrite+469>:      lea    0xfffffff4(%ebp),%eax
> 0xc01846d8 <bwrite+472>:      push   %eax
> 0xc01846d9 <bwrite+473>:      mov    (%edx,%ecx,4),%eax <<<<< **POW**
> 0xc01846dc <bwrite+476>:      call   *%eax
> 0xc01846de <bwrite+478>:      add    $0x4,%esp
> 0xc01846e1 <bwrite+481>:      mov    0xfffffff0(%ebp),%eax
> 
> looking at the regs,
> dx = 0xc1344600,
> cx = 0xc96145b2,
> and 
> C1344600+(4*C96145B2) = 3E6B95CC8
> the lower 32 bits of which is the same as the fault address
> 
> but in the  code above we see that %cx was just loaded from 
> location 0xc029dcc0 which contains:
> (kgdb) x/x 0xc029dcc0     
> 0xc029dcc0 <vop_strategy_desc>:       0x12
> 
> 0x12 is the correct offset for a strategy call.
> 
> so cx got corrupted between the instruction at 0xc01846cc
> and that at 0xc01846d9.

Very weird.  Note that traps and interrupts will save %ecx in the trapframe,
so you aren't going to end up with those getting corrupted unless we somehow
screw up ecx after popping the frame (or before pushing it).

> Note that the contents of cx (0xc96145b2) is an address
> somewhat higher than the kernel stack at the time in question.

Could be a stack of some other thread.  All the 0xc9XXXXX addresses are
pointers to automatic variables.  The 0xc0[2-4]XXXXX are return addresses.

> a dump of ram in that area shows:
> (kgdb) x/64xw 0xc96145a0
> 0xc96145a0:   0xc954e900      0xc9709c00      0x00000000      0xc96145a8
> 0xc96145b0:    [0xc9580660]   0xc95c7370      0xc04d7504      0xc04d47d4
> 0xc96145c0:   0x0000aa26      0x00000020      0x00000000      0x00000000
> 0xc96145d0:   0xfc812c38      0x00000002      0x00040010      0x00000020
> 0xc96145e0:   0x00000000      0x00000000      0x00000000      0x00000000
> 0xc96145f0:   0x00000000      0xc9636a40      0x0001fc93      0x00000000
> 0xc9614600:   0xc02ed7c0      0xc95b4120      0x00000000      0xc9614608
> 0xc9614610:   0x00000000      0xc9555548      0x00000000      0xc9614618
> 0xc9614620:   0x00003f5b      0x00000003      0x00000000      0x00000000
> 0xc9614630:   0xfe37c115      0x21880000      0x0000000e      0x00000000
> 0xc9614640:   0x00000000      0x00000000      0x00000000      0x00000000
> 0xc9614650:   0x00000000      0x00000000      0x00000000      0x00000000
> 0xc9614660:   0xc9722ae0      0xc961c600      0x00000000      0xc9614668
> 0xc9614670:   0xc9690660      0xc97091f0      0x00000000      0xc9614678
> 0xc9614680:   0x0000cabf      0x00000012      0x00000000      0x00000000
> 0xc9614690:   0xfc8189f2      0x00000002      0x0000001d      0x00000000
> 
> This is obviously  SOMETHING, but what? And why does %cx point HALF WAY
> THROUGH an obvious 32 bit pointer?
> 
> Thoughts of hardware problems do come to mind... but..

Is it just one machine that does this reliably?

-- 

John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/
PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.010724162107.jhb>