From owner-freebsd-current@FreeBSD.ORG Wed Jul 22 17:17:43 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E257E1065694 for ; Wed, 22 Jul 2009 17:17:43 +0000 (UTC) (envelope-from peter.schuller@infidyne.com) Received: from hyperion.scode.org (cl-1361.ams-04.nl.sixxs.net [IPv6:2001:960:2:550::2]) by mx1.freebsd.org (Postfix) with ESMTP id 4F9C98FC26 for ; Wed, 22 Jul 2009 17:17:43 +0000 (UTC) (envelope-from peter.schuller@infidyne.com) Received: from hyperion.scode.org (hyperion.scode.org [85.17.42.115]) by hyperion.scode.org (Postfix) with ESMTPS id 92FCD23C45D for ; Wed, 22 Jul 2009 19:17:42 +0200 (CEST) Date: Wed, 22 Jul 2009 19:17:41 +0200 From: Peter Schuller To: freebsd-current@freebsd.org Message-ID: <20090722171741.GB17684@hyperion.scode.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="s2ZSL+KKDSLx8OML" Content-Disposition: inline User-Agent: Mutt/1.5.20 (2009-06-14) Subject: vm_page_remove() crash on sys_exit() (possibly ZFS related) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Jul 2009 17:17:44 -0000 --s2ZSL+KKDSLx8OML Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hello, so I finally got my crash dump. I'll include some more history further down. First off: http://distfiles.scode.org/mlref/crashdump_20090722/core.txt.0 http://distfiles.scode.org/mlref/crashdump_20090722/backtrace.txt Inline version of backtrace appears below[1] (after background). So this is a general protection fault in vm_page_remove called indirectly from sys_exit(). Worth nothing is that at least once (the previous crash, without a dump) I got a "logic" panic rather than a memory error; I'm pretty sure the panic message was related to page *inserts*. Grepping the source indicates: vm_page.c: panic("vm_page_insert: page already inserted"); vm_page.c: panic("vm_page_insert: offset already all= ocated"); However I could not say for sure whether one of these was indeed the exact panic I got and I neither have a crash nor was able to see a track trace at the time. Some further background and speculation: This system is root-on-ZFS where I have been tracking CURRENT for several months. I updated every month or so in part to test improvements to ZFS; specifically the fixes that have gone in for deadlock/hang issues. My "test case" is to run bulk building of all my ports (the port list is a semi-typical desktop; about 700 or so packages in total). It would very often hang (before) or crash (now) at least once during such a build; the building of firefox was in particular extremely over-represented, at least now that I see the crash symptome. Going back to my tracking of current, at some point, I think roughly a couple of months ago by now, I stopped experiencing deadlocks/hangs (or at least have not seen it yet), but instead began seeing panic:s. No longer seeing hangs was expected because the reason I updated that particular time, if I recall correctly, was specifically that I believed that all the work-in-progress ZFS fixes had gone in. However I am not 100% sure of the timing. Since then I've updated a couple of times more, most recently to BETA1, but am still seeing this crash. Wannabe speculation based on insufficient understanding of the VM system: vm_page_remove() requires, according to comments, that the object and page must be locked. The actual crash in this case happens when checking m->oflags: if (m->oflags & VPO_BUSY) { m->oflags &=3D ~VPO_BUSY; vm_page_flash(m); } The "m->oflags & VPO_BUSY" evaluation is the culprit, if line numbers can be trusted. If I recall correctly, at least one of the deadlock/hang fixes for ZFS did involve a change to locking, so I'm thinking the introduction of the crashing may in fact be related to the ZFS fix itself. However now that I think about it perhaps the only locking changes were vnode ones rather than vm objects/pages? Also interestingly reading m->object right before suceeds, and the lock assert on the object does too. Is it possible the vm page was NOT locked even though m->object was locked? [1] Inline backtrace: #0 doadump () at pcpu.h:223 #1 0xffffffff801d248c in db_fncall (dummy1=3DVariable "dummy1" is not avai= lable. ) at /usr/src/sys/ddb/db_command.c:548 #2 0xffffffff801d27c1 in db_command (last_cmdp=3D0xffffffff80b667a0, cmd_t= able=3DVariable "cmd_table" is not available. ) at /usr/src/sys/ddb/db_command.c:445 #3 0xffffffff801d2a10 in db_command_loop () at /usr/src/sys/ddb/db_command= =2Ec:498 #4 0xffffffff801d49a9 in db_trap (type=3DVariable "type" is not available. ) at /usr/src/sys/ddb/db_main.c:229 #5 0xffffffff805b5f25 in kdb_trap (type=3D9, code=3D0, tf=3D0xffffff805b96= 08d0) at /usr/src/sys/kern/subr_kdb.c:534 #6 0xffffffff80812efd in trap_fatal (frame=3D0xffffff805b9608d0, eva=3DVar= iable "eva" is not available. ) at /usr/src/sys/amd64/amd64/trap.c:847 #7 0xffffffff80813a1d in trap (frame=3D0xffffff805b9608d0) at /usr/src/sys= /amd64/amd64/trap.c:639 #8 0xffffffff807f9793 in calltrap () at /usr/src/sys/amd64/amd64/exception= =2ES:223 #9 0xffffffff807d941f in vm_page_remove (m=3D0xffffff00bebe7f90) at /usr/s= rc/sys/vm/vm_page.c:730 #10 0xffffffff807d957d in vm_page_free_toq (m=3D0xffffff00bebe7f90) at /usr= /src/sys/vm/vm_page.c:1394 #11 0xffffffff807d7c6b in vm_object_terminate (object=3D0xffffff0066392948)= at /usr/src/sys/vm/vm_object.c:694 #12 0xffffffff807d821c in vm_object_deallocate (object=3D0xffffff0066392948= ) at /usr/src/sys/vm/vm_object.c:592 #13 0xffffffff807cfad0 in _vm_map_unlock (map=3D0xffffff0004811310, file=3D= Variable "file" is not available. ) at /usr/src/sys/vm/vm_map.c:480 #14 0xffffffff807cff8f in vm_map_remove (map=3D0xffffff0004811310, start=3D= Variable "start" is not available. ) at /usr/src/sys/vm/vm_map.c:2765 #15 0xffffffff807d2e44 in vmspace_exit (td=3D0xffffff004eb78ab0) at /usr/sr= c/sys/vm/vm_map.c:329 #16 0xffffffff8055a33e in exit1 (td=3D0xffffff004eb78ab0, rv=3D0) at /usr/s= rc/sys/kern/kern_exit.c:299 #17 0xffffffff8055b43e in sys_exit (td=3DVariable "td" is not available. ) at /usr/src/sys/kern/kern_exit.c:110 #18 0xffffffff80813546 in syscall (frame=3D0xffffff805b960c90) at /usr/src/= sys/amd64/amd64/trap.c:984 #19 0xffffffff807f9a20 in Xfast_syscall () at /usr/src/sys/amd64/amd64/exce= ption.S:364 #20 0x000000000047f63c in ?? () Previous frame inner to this frame (corrupt stack?) --=20 / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller ' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org --s2ZSL+KKDSLx8OML Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.12 (FreeBSD) iEYEARECAAYFAkpnSbMACgkQDNor2+l1i31stwCcDVn4u/Do7JwnSwG9AUO+k3AQ xXIAnimLX6qk7uDVtQrl/dlzX83y20nN =dU7K -----END PGP SIGNATURE----- --s2ZSL+KKDSLx8OML--