From owner-freebsd-current@FreeBSD.ORG  Wed Jul 22 17:17:43 2009
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E257E1065694
	for <freebsd-current@freebsd.org>; Wed, 22 Jul 2009 17:17:43 +0000 (UTC)
	(envelope-from peter.schuller@infidyne.com)
Received: from hyperion.scode.org (cl-1361.ams-04.nl.sixxs.net
	[IPv6:2001:960:2:550::2])
	by mx1.freebsd.org (Postfix) with ESMTP id 4F9C98FC26
	for <freebsd-current@freebsd.org>; Wed, 22 Jul 2009 17:17:43 +0000 (UTC)
	(envelope-from peter.schuller@infidyne.com)
Received: from hyperion.scode.org (hyperion.scode.org [85.17.42.115])
	by hyperion.scode.org (Postfix) with ESMTPS id 92FCD23C45D
	for <freebsd-current@freebsd.org>;
	Wed, 22 Jul 2009 19:17:42 +0200 (CEST)
Date: Wed, 22 Jul 2009 19:17:41 +0200
From: Peter Schuller <peter.schuller@infidyne.com>
To: freebsd-current@freebsd.org
Message-ID: <20090722171741.GB17684@hyperion.scode.org>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="s2ZSL+KKDSLx8OML"
Content-Disposition: inline
User-Agent: Mutt/1.5.20 (2009-06-14)
Subject: vm_page_remove() crash on sys_exit() (possibly ZFS related)
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 22 Jul 2009 17:17:44 -0000


--s2ZSL+KKDSLx8OML
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Hello,

so I finally got my crash dump. I'll include some more history further
down. First off:

   http://distfiles.scode.org/mlref/crashdump_20090722/core.txt.0
   http://distfiles.scode.org/mlref/crashdump_20090722/backtrace.txt

Inline version of backtrace appears below[1] (after background).

So this is a general protection fault in vm_page_remove called
indirectly from sys_exit(). Worth nothing is that at least once (the
previous crash, without a dump) I got a "logic" panic rather than a
memory error; I'm pretty sure the panic message was related to page
*inserts*. Grepping the source indicates:

  vm_page.c:              panic("vm_page_insert: page already inserted");
  vm_page.c:                      panic("vm_page_insert: offset already all=
ocated");

However I could not say for sure whether one of these was indeed the
exact panic I got and I neither have a crash nor was able to see a
track trace at the time.

Some further background and speculation:

This system is root-on-ZFS where I have been tracking CURRENT for
several months. I updated every month or so in part to test
improvements to ZFS; specifically the fixes that have gone in for
deadlock/hang issues.

My "test case" is to run bulk building of all my ports (the port list
is a semi-typical desktop; about 700 or so packages in total). It
would very often hang (before) or crash (now) at least once during
such a build; the building of firefox was in particular extremely
over-represented, at least now that I see the crash symptome.

Going back to my tracking of current, at some point, I think roughly a
couple of months ago by now, I stopped experiencing deadlocks/hangs
(or at least have not seen it yet), but instead began seeing
panic:s. No longer seeing hangs was expected because the reason I
updated that particular time, if I recall correctly, was specifically
that I believed that all the work-in-progress ZFS fixes had gone
in. However I am not 100% sure of the timing.

Since then I've updated a couple of times more, most recently to
BETA1, but am still seeing this crash.

Wannabe speculation based on insufficient understanding of the VM
system:

vm_page_remove() requires, according to comments, that the object and
page must be locked. The actual crash in this case happens when
checking m->oflags:

        if (m->oflags & VPO_BUSY) {
                m->oflags &=3D ~VPO_BUSY;
                vm_page_flash(m);
        }

The "m->oflags & VPO_BUSY" evaluation is the culprit, if line numbers
can be trusted.

If I recall correctly, at least one of the deadlock/hang fixes for ZFS
did involve a change to locking, so I'm thinking the introduction of
the crashing may in fact be related to the ZFS fix itself. However now
that I think about it perhaps the only locking changes were vnode ones
rather than vm objects/pages? Also interestingly reading m->object
right before suceeds, and the lock assert on the object does too.

Is it possible the vm page was NOT locked even though m->object was
locked?

[1] Inline backtrace:

#0  doadump () at pcpu.h:223
#1  0xffffffff801d248c in db_fncall (dummy1=3DVariable "dummy1" is not avai=
lable.
) at /usr/src/sys/ddb/db_command.c:548
#2  0xffffffff801d27c1 in db_command (last_cmdp=3D0xffffffff80b667a0, cmd_t=
able=3DVariable "cmd_table" is not available.
) at /usr/src/sys/ddb/db_command.c:445
#3  0xffffffff801d2a10 in db_command_loop () at /usr/src/sys/ddb/db_command=
=2Ec:498
#4  0xffffffff801d49a9 in db_trap (type=3DVariable "type" is not available.
) at /usr/src/sys/ddb/db_main.c:229
#5  0xffffffff805b5f25 in kdb_trap (type=3D9, code=3D0, tf=3D0xffffff805b96=
08d0) at /usr/src/sys/kern/subr_kdb.c:534
#6  0xffffffff80812efd in trap_fatal (frame=3D0xffffff805b9608d0, eva=3DVar=
iable "eva" is not available.
) at /usr/src/sys/amd64/amd64/trap.c:847
#7  0xffffffff80813a1d in trap (frame=3D0xffffff805b9608d0) at /usr/src/sys=
/amd64/amd64/trap.c:639
#8  0xffffffff807f9793 in calltrap () at /usr/src/sys/amd64/amd64/exception=
=2ES:223
#9  0xffffffff807d941f in vm_page_remove (m=3D0xffffff00bebe7f90) at /usr/s=
rc/sys/vm/vm_page.c:730
#10 0xffffffff807d957d in vm_page_free_toq (m=3D0xffffff00bebe7f90) at /usr=
/src/sys/vm/vm_page.c:1394
#11 0xffffffff807d7c6b in vm_object_terminate (object=3D0xffffff0066392948)=
 at /usr/src/sys/vm/vm_object.c:694
#12 0xffffffff807d821c in vm_object_deallocate (object=3D0xffffff0066392948=
) at /usr/src/sys/vm/vm_object.c:592
#13 0xffffffff807cfad0 in _vm_map_unlock (map=3D0xffffff0004811310, file=3D=
Variable "file" is not available.
) at /usr/src/sys/vm/vm_map.c:480
#14 0xffffffff807cff8f in vm_map_remove (map=3D0xffffff0004811310, start=3D=
Variable "start" is not available.
) at /usr/src/sys/vm/vm_map.c:2765
#15 0xffffffff807d2e44 in vmspace_exit (td=3D0xffffff004eb78ab0) at /usr/sr=
c/sys/vm/vm_map.c:329
#16 0xffffffff8055a33e in exit1 (td=3D0xffffff004eb78ab0, rv=3D0) at /usr/s=
rc/sys/kern/kern_exit.c:299
#17 0xffffffff8055b43e in sys_exit (td=3DVariable "td" is not available.
) at /usr/src/sys/kern/kern_exit.c:110
#18 0xffffffff80813546 in syscall (frame=3D0xffffff805b960c90) at /usr/src/=
sys/amd64/amd64/trap.c:984
#19 0xffffffff807f9a20 in Xfast_syscall () at /usr/src/sys/amd64/amd64/exce=
ption.S:364
#20 0x000000000047f63c in ?? ()
Previous frame inner to this frame (corrupt stack?)


--=20
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>'
Key retrieval: Send an E-Mail to getpgpkey@scode.org
E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org


--s2ZSL+KKDSLx8OML
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.12 (FreeBSD)

iEYEARECAAYFAkpnSbMACgkQDNor2+l1i31stwCcDVn4u/Do7JwnSwG9AUO+k3AQ
xXIAnimLX6qk7uDVtQrl/dlzX83y20nN
=dU7K
-----END PGP SIGNATURE-----

--s2ZSL+KKDSLx8OML--