From owner-freebsd-current@FreeBSD.ORG  Sat Jun 19 22:27:23 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id AC9B316A4CE
	for <freebsd-current@freebsd.org>;
	Sat, 19 Jun 2004 22:27:23 +0000 (GMT)
Received: from mail.sandvine.com (sandvine.com [199.243.201.138])
	by mx1.FreeBSD.org (Postfix) with ESMTP id E7A1E43D5C
	for <freebsd-current@freebsd.org>;
	Sat, 19 Jun 2004 22:27:17 +0000 (GMT)
	(envelope-from gnagelhout@sandvine.com)
Received: by mail.sandvine.com with Internet Mail Service (5.5.2657.72)
	id <MTR2846Z>; Sat, 19 Jun 2004 18:26:22 -0400
Message-ID: <FE045D4D9F7AED4CBFF1B3B813C85337054EC49F@mail.sandvine.com>
From: Gerrit Nagelhout <gnagelhout@sandvine.com>
To: freebsd-current@freebsd.org
Date: Sat, 19 Jun 2004 18:26:19 -0400
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2657.72)
Content-Type: text/plain;
	charset="iso-8859-1"
Subject: filesystem deadlocks
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 19 Jun 2004 22:27:24 -0000

I am currently running a stress test where I have about 30 postgres 
processes running on a dual Xeon with an adaptec raid controller.
I am trying to reproduce some kernel lockups, but in the process 
keep getting into a state where no more io activity occurs, and all
the postgres processes seem to be stuck in a sleep for a mutex
(not making any progress).  Some of the time, ufs_fsck is running
because of an improper shutdown.  The code is based on CURRENT
from a couple of weeks ago.  After enabling witnes, the
following messages appear:

Jun 19 18:00:51 TPC-D7-23 lock order reversal
Jun 19 18:00:51 TPC-D7-23 1st 0xcab85294 vm object (vm object) @
/.amd_mnt/gnagelhout-pc3.sandvine.com/host/gerrit_bsd_5
_main/fw-bsd/src/sys/vm/swap_pager.c:1313
Jun 19 18:00:51 TPC-D7-23 2nd 0xc0780ba0 swap_pager swhash (swap_pager
swhash) @ /.amd_mnt/gnagelhout-pc3.sandvine.com/h
ost/gerrit_bsd_5_main/fw-bsd/src/sys/vm/swap_pager.c:1799
Jun 19 18:00:51 TPC-D7-23 3rd 0xca966108 vm object (vm object) @
/.amd_mnt/gnagelhout-pc3.sandvine.com/host/gerrit_bsd_5
_main/fw-bsd/src/sys/vm/uma_core.c:886
Jun 19 18:00:51 TPC-D7-23 Stack backtrace:
Jun 19 18:00:51 TPC-D7-23
backtrace(c06de7a0,ca966108,c06ef9dd,c06ef9dd,c06f05b8) at backtrace+0x17
Jun 19 18:00:51 TPC-D7-23
witness_checkorder(ca966108,9,c06f05b8,376,ca924e00) at
witness_checkorder+0x5f3
Jun 19 18:00:51 TPC-D7-23 _mtx_lock_flags(ca966108,0,c06f05b8,376,ca924e14)
at _mtx_lock_flags+0x32
Jun 19 18:00:51 TPC-D7-23 obj_alloc(ca924e00,1000,e6897a1b,101,e6897a30) at
obj_alloc+0x3f
Jun 19 18:00:51 TPC-D7-23 slab_zalloc(ca924e00,1,ca924e14,8,c06f05b8) at
slab_zalloc+0xb3
Jun 19 18:00:51 TPC-D7-23 uma_zone_slab(ca924e00,1,c06f05b8,68f,ca924eb0) at
uma_zone_slab+0xda
Jun 19 18:00:51 TPC-D7-23 uma_zalloc_internal(ca924e00,0,1,5c4,1) at
uma_zalloc_internal+0x3e
Jun 19 18:00:51 TPC-D7-23 uma_zalloc_arg(ca924e00,0,1,707,2) at
uma_zalloc_arg+0x283
Jun 19 18:00:51 TPC-D7-23 swp_pager_meta_build(cab85294,5,0,2,0) at
swp_pager_meta_build+0x12e
Jun 19 18:00:51 TPC-D7-23
swap_pager_putpages(cab85294,e6897be0,1,0,e6897b50) at
swap_pager_putpages+0x306
Jun 19 18:00:51 TPC-D7-23
default_pager_putpages(cab85294,e6897be0,1,0,e6897b50) at
default_pager_putpages+0x2e
Jun 19 18:00:51 TPC-D7-23 vm_pageout_flush(e6897be0,1,0,116,c073bda0) at
vm_pageout_flush+0xdb
Jun 19 18:00:51 TPC-D7-23 vm_pageout_clean(c436cb30,0,c06f03a0,33b,0) at
vm_pageout_clean+0x2a3
Jun 19 18:00:51 TPC-D7-23 vm_pageout_scan(0,0,c06f03a0,5b7,30d4) at
vm_pageout_scan+0x5d5
Jun 19 18:00:51 TPC-D7-23 vm_pageout(0,e6897d48,c06d9172,328,0) at
vm_pageout+0x31d
Jun 19 18:00:51 TPC-D7-23 fork_exit(c064ad69,0,e6897d48) at fork_exit+0x77
Jun 19 18:00:51 TPC-D7-23 fork_trampoline() at fork_trampoline+0x8
Jun 19 18:00:51 TPC-D7-23 --- trap 0x1, eip = 0, esp = 0xe6897d7c, ebp = 0
---


What else can I do to further debug this problem?

A second problem I have noticed (with similar symptoms, ie no more 
IO, everything is blocked), all of my postgres processes are in the wddrain
state.  The code that is supposed to wake them up (runningbufwakeup) 
still gets called on occassion,  but runningbufspace never becomes greater
than lorunningspace, and thus will not call wakeup.  I don't know if this
is due to a slow leak (of runningbufspace), or some deadlock condition.
Any ideas?

Thanks,

Gerrit Nagelhout