From owner-freebsd-current@FreeBSD.ORG Mon Jan 16 22:11:09 2006 Return-Path: X-Original-To: current@FreeBSD.org Delivered-To: freebsd-current@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id EA61316A41F for ; Mon, 16 Jan 2006 22:11:08 +0000 (GMT) (envelope-from sean@gigave.com) Received: from mailhost.gigave.com (mailhost.gigave.com [38.113.228.14]) by mx1.FreeBSD.org (Postfix) with ESMTP id B039B43D46 for ; Mon, 16 Jan 2006 22:11:08 +0000 (GMT) (envelope-from sean@gigave.com) Resent-From: Sean Chittenden Resent-Date: Mon, 16 Jan 2006 14:11:07 -0800 Resent-Message-ID: <20060116221107.GO1683@arrowstrike.sj1.gigave.com> Resent-To: current@FreeBSD.org Date: Mon, 16 Jan 2006 13:27:44 -0800 From: Sean Chittenden To: current@FreeBSD.org Message-ID: <20060116212744.GJ1683@mailhost.gigave.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Cc: Subject: amd64 crash in pmap_remove_pages(): page fault X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jan 2006 22:11:09 -0000 Howdy. I've got a "diskless" PXE boot client that is crashing about once a week or so with the following backtrace and info: #1 0x0000000000000004 in ?? () #2 0xffffffff80257623 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:399 #3 0xffffffff80257c26 in panic (fmt=0xffffff006808f720 "") at /usr/src/sys/kern/kern_shutdown.c:555 #4 0xffffffff8039da92 in trap_fatal (frame=0xffffff006808f720, eva=18446742975955402752) at /usr/src/sys/amd64/amd64/trap.c:660 #5 0xffffffff8039ddaf in trap_pfault (frame=0xffffffffc72b19e0, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:573 #6 0xffffffff8039e063 in trap (frame= {tf_rdi = -36495808960, tf_rsi = -1097607851600, tf_rdx = 0, tf_rcx = -1097607851600, tf_r8 = 0, tf_r9 = -1097427386368, tf_rax = 0, tf_rbx = -2134343104, tf_rbp = -2138286592, tf_r10 = 6447135, tf_r11 = 429809, tf_r12 = -1097607851680, tf_r13 = -1097375809288, tf_r14 = 140737488355328, tf_r15 = 0, tf_trapno = 12, tf_addr = -36495808960, tf_flags = -2135524400, tf_err = 2, tf_rip = -2143710185, tf_cs = 8, tf_rflags = 66118, tf_rsp = -953476448, tf_ss = 16}) at /u at /u at /u at /u.c:352 #7 0xffffffff8038d30b in calltrap () at /usr/src/sys/amd64/amd64/exception.S:168 #8 0xffffffff80399417 in pmap_remove_pages (pmap=0xffffff0071795160, sva=0, eva=140737488355328) at /usr/src/sys/amd64/amd64/pmap.c:2590 #9 0xffffffff8023b947 in exit1 (td=0xffffff006808f720, rv=256) at vm_map.h:252 ---Type to continue, or q to quit--- #10 0xffffffff8023bc5e in sys_exit (td=0xfffffff780ae2640, uap=0xffffff00717951b0) at /usr/src/sys/kern/kern_exit.c:97 #11 0xffffffff8039e8a1 in syscall (frame= {tf_rdi = 1, tf_rsi = 34365342200, tf_rdx = 0, tf_rcx = 4, tf_r8 = 0, tf_r9 = 59, tf_rax = 1, tf_rbx = 1, tf_rbp = 0, tf_r10 = 140737488349248, tf_r11 = 2, tf_r12 = 0, tf_r13 = 5520656, tf_r14 = 14400, tf_r15 = 1, tf_trapno = 12, tf_addr = 34368227008, tf_flags = 0, tf_err = 2, tf_rip = 34368000696, tf_cs = 43, tf_rflags = 518, tf_rsp = 140737488349560, tf_ss = 35}) at /usr/src/sys/amd64/amd64/trap.c:792 #12 0xffffffff8038d4a8 in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:270 #13 0x00000008007e12b8 in ?? () Dump header from device /dev/ad4s1b Architecture: amd64 Architecture Version: 2 Dump Length: 2147024896B (2047 MB) Blocksize: 512 Dumptime: Sun Jan 15 20:54:41 2006 Magic: FreeBSD Kernel Dump Version String: FreeBSD 6.0-STABLE #1: Wed Jan 4 22:40:57 PST 2006 sean@host.example.org:/usr/obj/usr/src/sys/WEBHEAD Panic String: page fault Dump Parity: 1702696226 Bounds: 42 Dump Status: good Any pearls of wisdom as to what's causing this? My nfs options are: rw,tcp,nfsv3,-r=32768,-w=32768 and my kernel config is included below. I have the cores around if anyone's interested or I'm missing something, but if I'm looking at this correctly, it seems like a race condition which is causing a problem with one of the TAILQ macro's. amd64/amd64/pmap.c:2590 TAILQ_REMOVE(&m->md.pv_list, pv, pv_list); I've been digging around on -stable, -current, and following your recent work on HEAD, but haven't seen anything that touches this area of code. Being a VM rookie, seems as though this is a bug that's being tripped in pmap_remove_pages(), but isn't caused by a bug there. TAILQ_*()'s usage in this function seems correct. With NFS diskless root, zero copy sockets, and sendfile(2) in use on these machines, there are a number of places for potential problems and I'm at a loss as to a fix. Any ideas? -sc -- Sean Chittenden