Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 24 Jun 2007 11:13:24 -0400
From:      Adam McDougall <mcdouga9@egr.msu.edu>
To:        stable@freebsd.org, Kai <kai@xs4all.nl>
Subject:   Re: [vfs_bio] Re: Fatal trap 12: page fault while in kernel mode (with potential cause)
Message-ID:  <20070624151324.GF31122@egr.msu.edu>
In-Reply-To: <20070624043020.GC31122@egr.msu.edu>
References:  <20070411105332.GC7847@xs4all.nl> <20070419123329.GA10189@xs4all.nl> <20070423153547.GD20155@xs4all.nl> <20070423155552.GB1006@xor.obsecurity.org> <20070624043020.GC31122@egr.msu.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Jun 24, 2007 at 12:30:20AM -0400, Adam McDougall wrote:

  On Mon, Apr 23, 2007 at 11:55:52AM -0400, Kris Kennaway wrote:
  
    On Mon, Apr 23, 2007 at 05:35:47PM +0200, Kai wrote:
    > On Thu, Apr 19, 2007 at 02:33:29PM +0200, Kai wrote:
    > > On Wed, Apr 11, 2007 at 12:53:32PM +0200, Kai wrote:
    > > > 
    > > > Hello all,
    > > > 
    > > > We're running into regular panics on our webserver after upgrading
    > > > from 4.x to 6.2-stable:
    > > 
    > 
    > Hi all,
    > 
    > To continue this story, a colleague wrote a small program in C that launches
    > 40 threads to randomly append and write to 10 files on an NFS mounted
    > filesystem. 
    > 
    > If I keep removing the files on one of the other machines in a while loop,
    > the first system panics:
    > 
    > Fatal trap 12: page fault while in kernel mode
    > cpuid = 1; apic id = 01
    > fault virtual address   = 0x34
    > fault code              = supervisor read, page not present
    > instruction pointer     = 0x20:0xc06bdefa
    > stack pointer           = 0x28:0xeb9f69b8
    > frame pointer           = 0x28:0xeb9f69c4
    > code segment            = base 0x0, limit 0xfffff, type 0x1b
    >                         = DPL 0, pres 1, def32 1, gran 1
    > processor eflags        = interrupt enabled, resume, IOPL = 0
    > current process         = 73626 (nfscrash)
    > trap number             = 12
    > panic: page fault
    > cpuid = 1
    > Uptime: 3h2m14s
    > 
    > Sounds like a nice denial of service problem. I can hand the program to
    > developers on request.
    
    Please send it to me.  Panics are always much easier to get fixed if
    they come with a test case that developer can use to reproduce it.
    
    Kris
  
  I have been working on this problem all weekend and I have a strong hunch at this point 
  that it is a result of 1.424 of sys/kern/vfs_bio.c which was between FreeBSD 5.1 and 
  5.2.  This hunch is currently being verified by a system that was cvsupped to code 
  just before 1.424, and it has been running about 7 times longer than the usual time 
  required to crash.  I am currently attempting to craft a patch for 6.2 that essentially 
  backs out the change to see if that works, but if this information can help send a 
  FreeBSD developer down the right trail to a proper fix, great.  I will follow up with 
  more detailed findings and results tonight or soon.
  
  links:
  http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_bio.c.diff?r1=1.423;r2=1.424
  related to 1.424:
  http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_bio.c.diff?r1=1.420&r2=1.421
  
  Commit emails:
  http://docs.freebsd.org/cgi/mid.cgi?200311150845.hAF8jawU027349
  http://docs.freebsd.org/cgi/mid.cgi?200311110445.hAB4jbYw093253
  _______________________________________________

If I turn on invariants, I get the following panic instead, much quicker, and 
happens with at least as far back as 5.0-RELEASE:

panic: bundirty: buffer 0xffffffff8e2e95f8 still on queue 1
cpuid = 1
Uptime: 35s
Dumping 511 MB (2 chunks)
  chunk 0: 1MB (153 pages) ... ok
  chunk 1: 511MB (130816 pages) 496 480 464 448 432 416 400 384 368 352 336 320 304 288 272 256 240 224 208 192 176 
160 144 128 112 96 80 64 48 32 16

#0  doadump () at pcpu.h:172
172     pcpu.h: No such file or directory.
        in pcpu.h
(kgdb) bt
#0  doadump () at pcpu.h:172
#1  0xffffffff8028d699 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:409
#2  0xffffffff8028d12b in panic (fmt=0xffffffff80443458 "bundirty: buffer %p still on queue %d")
    at /usr/src/sys/kern/kern_shutdown.c:565
#3  0xffffffff802e1e78 in bundirty (bp=0xffffffff8e2e95f8) at /usr/src/sys/kern/vfs_bio.c:1055
#4  0xffffffff802e3eb1 in brelse (bp=0xffffffff8e2e95f8) at /usr/src/sys/kern/vfs_bio.c:1370
#5  0xffffffff803550e8 in nfs_writebp (bp=0xffffffff8e2e95f8, force=0, td=0x0) at 
/usr/src/sys/nfsclient/nfs_vnops.c:3005
#6  0xffffffff802e5197 in getblk (vp=0xffffff000c23e5d0, blkno=0, size=14400, slpflag=256, slptimeo=0, flags=0)
    at buf.h:412
#7  0xffffffff80344f13 in nfs_getcacheblk (vp=0xffffff000c23e5d0, bn=0, size=14400, td=0xffffff0015b274c0)
    at /usr/src/sys/nfsclient/nfs_bio.c:1252
#8  0xffffffff8034616c in nfs_write (ap=0x0) at /usr/src/sys/nfsclient/nfs_bio.c:1068
#9  0xffffffff80405ee4 in VOP_WRITE_APV (vop=0xffffffff805a0260, a=0xffffffff976bfa10) at vnode_if.c:698
#10 0xffffffff80303d2c in vn_write (fp=0xffffff000f524000, uio=0xffffffff976bfb50, active_cred=0x0, flags=0, 
    td=0xffffff0015b274c0) at vnode_if.h:372
#11 0xffffffff802ba2e5 in dofilewrite (td=0xffffff0015b274c0, fd=3, fp=0xffffff000f524000, auio=0xffffffff976bfb50, 
    offset=0, flags=0) at file.h:253
#12 0xffffffff802ba5e1 in kern_writev (td=0xffffff0015b274c0, fd=3, auio=0xffffffff976bfb50)
    at /usr/src/sys/kern/sys_generic.c:402
#13 0xffffffff802ba6da in write (td=0x0, uap=0x0) at /usr/src/sys/kern/sys_generic.c:326
#14 0xffffffff803c6db2 in syscall (frame=
      {tf_rdi = 3, tf_rsi = 140737488344336, tf_rdx = 3254, tf_rcx = 34367099100, tf_r8 = -2142916576, tf_r9 = 
140737488344328, tf_rax = 4, tf_rbx = 3254, tf_rbp = 3, tf_r10 = 140737488348256, tf_r11 = 140737488347968, tf_r12 
= 14100, tf_r13 = 140737488344336, tf_r14 = 0, tf_r15 = 140737488348496, tf_trapno = 12, tf_addr = 5248656, 
tf_flags = 12, tf_err = 2, tf_rip = 34367099100, tf_cs = 43, tf_rflags = 643, tf_rsp = 140737488344328, tf_ss = 
35}) at /usr/src/sys/amd64/amd64/trap.c:803
#15 0xffffffff803aeed8 in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:270
#16 0x00000008007050dc in ?? ()
Previous frame inner to this frame (corrupt stack?)




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070624151324.GF31122>