From owner-freebsd-current@FreeBSD.ORG  Mon Jan 16 22:11:09 2006
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
X-Original-To: current@FreeBSD.org
Delivered-To: freebsd-current@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id EA61316A41F
	for <current@FreeBSD.org>; Mon, 16 Jan 2006 22:11:08 +0000 (GMT)
	(envelope-from sean@gigave.com)
Received: from mailhost.gigave.com (mailhost.gigave.com [38.113.228.14])
	by mx1.FreeBSD.org (Postfix) with ESMTP id B039B43D46
	for <current@FreeBSD.org>; Mon, 16 Jan 2006 22:11:08 +0000 (GMT)
	(envelope-from sean@gigave.com)
Resent-From: Sean Chittenden <sean@gigave.com>
Resent-Date: Mon, 16 Jan 2006 14:11:07 -0800
Resent-Message-ID: <20060116221107.GO1683@arrowstrike.sj1.gigave.com>
Resent-To: current@FreeBSD.org
Date: Mon, 16 Jan 2006 13:27:44 -0800
From: Sean Chittenden <sean@gigave.com>
To: current@FreeBSD.org
Message-ID: <20060116212744.GJ1683@mailhost.gigave.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Cc: 
Subject: amd64 crash in pmap_remove_pages(): page fault
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 16 Jan 2006 22:11:09 -0000

Howdy.  I've got a "diskless" PXE boot client that is crashing about
once a week or so with the following backtrace and info:

#1  0x0000000000000004 in ?? ()
#2  0xffffffff80257623 in boot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:399
#3  0xffffffff80257c26 in panic (fmt=0xffffff006808f720 "")
    at /usr/src/sys/kern/kern_shutdown.c:555
#4  0xffffffff8039da92 in trap_fatal (frame=0xffffff006808f720,
    eva=18446742975955402752) at /usr/src/sys/amd64/amd64/trap.c:660
#5  0xffffffff8039ddaf in trap_pfault (frame=0xffffffffc72b19e0, usermode=0)
    at /usr/src/sys/amd64/amd64/trap.c:573
#6  0xffffffff8039e063 in trap (frame=
      {tf_rdi = -36495808960, tf_rsi = -1097607851600, tf_rdx = 0, tf_rcx = -1097607851600, tf_r8 = 0, tf_r9 = -1097427386368, tf_rax = 0, tf_rbx = -2134343104, tf_rbp = -2138286592, tf_r10 = 6447135, tf_r11 = 429809, tf_r12 = -1097607851680, tf_r13 = -1097375809288, tf_r14 = 140737488355328, tf_r15 = 0, tf_trapno = 12, tf_addr = -36495808960, tf_flags = -2135524400, tf_err = 2, tf_rip = -2143710185, tf_cs = 8, tf_rflags = 66118, tf_rsp = -953476448, tf_ss = 16})
    at /u    at /u    at /u    at /u.c:352
#7  0xffffffff8038d30b in calltrap ()
    at /usr/src/sys/amd64/amd64/exception.S:168
#8  0xffffffff80399417 in pmap_remove_pages (pmap=0xffffff0071795160, sva=0, 
    eva=140737488355328) at /usr/src/sys/amd64/amd64/pmap.c:2590
#9  0xffffffff8023b947 in exit1 (td=0xffffff006808f720, rv=256) at vm_map.h:252
---Type <return> to continue, or q <return> to quit--- 
#10 0xffffffff8023bc5e in sys_exit (td=0xfffffff780ae2640, 
    uap=0xffffff00717951b0) at /usr/src/sys/kern/kern_exit.c:97
#11 0xffffffff8039e8a1 in syscall (frame=
      {tf_rdi = 1, tf_rsi = 34365342200, tf_rdx = 0, tf_rcx = 4, tf_r8 = 0, tf_r9 = 59, tf_rax = 1, tf_rbx = 1, tf_rbp = 0, tf_r10 = 140737488349248, tf_r11 = 2, tf_r12 = 0, tf_r13 = 5520656, tf_r14 = 14400, tf_r15 = 1, tf_trapno = 12, tf_addr = 34368227008, tf_flags = 0, tf_err = 2, tf_rip = 34368000696, tf_cs = 43, tf_rflags = 518, tf_rsp = 140737488349560, tf_ss = 35})
    at /usr/src/sys/amd64/amd64/trap.c:792
#12 0xffffffff8038d4a8 in Xfast_syscall ()
    at /usr/src/sys/amd64/amd64/exception.S:270
#13 0x00000008007e12b8 in ?? ()

Dump header from device /dev/ad4s1b
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 2147024896B (2047 MB)
  Blocksize: 512
  Dumptime: Sun Jan 15 20:54:41 2006
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 6.0-STABLE #1: Wed Jan  4 22:40:57 PST 2006
    sean@host.example.org:/usr/obj/usr/src/sys/WEBHEAD
  Panic String: page fault
  Dump Parity: 1702696226
  Bounds: 42
  Dump Status: good

Any pearls of wisdom as to what's causing this?  My nfs options are:
rw,tcp,nfsv3,-r=32768,-w=32768 and my kernel config is included below.
I have the cores around if anyone's interested or I'm missing
something, but if I'm looking at this correctly, it seems like a race
condition which is causing a problem with one of the TAILQ macro's.

amd64/amd64/pmap.c:2590
    TAILQ_REMOVE(&m->md.pv_list, pv, pv_list);

I've been digging around on -stable, -current, and following your
recent work on HEAD, but haven't seen anything that touches this area
of code.  Being a VM rookie, seems as though this is a bug that's
being tripped in pmap_remove_pages(), but isn't caused by a bug there.
TAILQ_*()'s usage in this function seems correct.  With NFS diskless
root, zero copy sockets, and sendfile(2) in use on these machines,
there are a number of places for potential problems and I'm at a loss
as to a fix.  Any ideas?  -sc

-- 
Sean Chittenden