From owner-freebsd-current@FreeBSD.ORG Sat Jun 19 22:27:23 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AC9B316A4CE for ; Sat, 19 Jun 2004 22:27:23 +0000 (GMT) Received: from mail.sandvine.com (sandvine.com [199.243.201.138]) by mx1.FreeBSD.org (Postfix) with ESMTP id E7A1E43D5C for ; Sat, 19 Jun 2004 22:27:17 +0000 (GMT) (envelope-from gnagelhout@sandvine.com) Received: by mail.sandvine.com with Internet Mail Service (5.5.2657.72) id ; Sat, 19 Jun 2004 18:26:22 -0400 Message-ID: From: Gerrit Nagelhout To: freebsd-current@freebsd.org Date: Sat, 19 Jun 2004 18:26:19 -0400 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2657.72) Content-Type: text/plain; charset="iso-8859-1" Subject: filesystem deadlocks X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Jun 2004 22:27:24 -0000 I am currently running a stress test where I have about 30 postgres processes running on a dual Xeon with an adaptec raid controller. I am trying to reproduce some kernel lockups, but in the process keep getting into a state where no more io activity occurs, and all the postgres processes seem to be stuck in a sleep for a mutex (not making any progress). Some of the time, ufs_fsck is running because of an improper shutdown. The code is based on CURRENT from a couple of weeks ago. After enabling witnes, the following messages appear: Jun 19 18:00:51 TPC-D7-23 lock order reversal Jun 19 18:00:51 TPC-D7-23 1st 0xcab85294 vm object (vm object) @ /.amd_mnt/gnagelhout-pc3.sandvine.com/host/gerrit_bsd_5 _main/fw-bsd/src/sys/vm/swap_pager.c:1313 Jun 19 18:00:51 TPC-D7-23 2nd 0xc0780ba0 swap_pager swhash (swap_pager swhash) @ /.amd_mnt/gnagelhout-pc3.sandvine.com/h ost/gerrit_bsd_5_main/fw-bsd/src/sys/vm/swap_pager.c:1799 Jun 19 18:00:51 TPC-D7-23 3rd 0xca966108 vm object (vm object) @ /.amd_mnt/gnagelhout-pc3.sandvine.com/host/gerrit_bsd_5 _main/fw-bsd/src/sys/vm/uma_core.c:886 Jun 19 18:00:51 TPC-D7-23 Stack backtrace: Jun 19 18:00:51 TPC-D7-23 backtrace(c06de7a0,ca966108,c06ef9dd,c06ef9dd,c06f05b8) at backtrace+0x17 Jun 19 18:00:51 TPC-D7-23 witness_checkorder(ca966108,9,c06f05b8,376,ca924e00) at witness_checkorder+0x5f3 Jun 19 18:00:51 TPC-D7-23 _mtx_lock_flags(ca966108,0,c06f05b8,376,ca924e14) at _mtx_lock_flags+0x32 Jun 19 18:00:51 TPC-D7-23 obj_alloc(ca924e00,1000,e6897a1b,101,e6897a30) at obj_alloc+0x3f Jun 19 18:00:51 TPC-D7-23 slab_zalloc(ca924e00,1,ca924e14,8,c06f05b8) at slab_zalloc+0xb3 Jun 19 18:00:51 TPC-D7-23 uma_zone_slab(ca924e00,1,c06f05b8,68f,ca924eb0) at uma_zone_slab+0xda Jun 19 18:00:51 TPC-D7-23 uma_zalloc_internal(ca924e00,0,1,5c4,1) at uma_zalloc_internal+0x3e Jun 19 18:00:51 TPC-D7-23 uma_zalloc_arg(ca924e00,0,1,707,2) at uma_zalloc_arg+0x283 Jun 19 18:00:51 TPC-D7-23 swp_pager_meta_build(cab85294,5,0,2,0) at swp_pager_meta_build+0x12e Jun 19 18:00:51 TPC-D7-23 swap_pager_putpages(cab85294,e6897be0,1,0,e6897b50) at swap_pager_putpages+0x306 Jun 19 18:00:51 TPC-D7-23 default_pager_putpages(cab85294,e6897be0,1,0,e6897b50) at default_pager_putpages+0x2e Jun 19 18:00:51 TPC-D7-23 vm_pageout_flush(e6897be0,1,0,116,c073bda0) at vm_pageout_flush+0xdb Jun 19 18:00:51 TPC-D7-23 vm_pageout_clean(c436cb30,0,c06f03a0,33b,0) at vm_pageout_clean+0x2a3 Jun 19 18:00:51 TPC-D7-23 vm_pageout_scan(0,0,c06f03a0,5b7,30d4) at vm_pageout_scan+0x5d5 Jun 19 18:00:51 TPC-D7-23 vm_pageout(0,e6897d48,c06d9172,328,0) at vm_pageout+0x31d Jun 19 18:00:51 TPC-D7-23 fork_exit(c064ad69,0,e6897d48) at fork_exit+0x77 Jun 19 18:00:51 TPC-D7-23 fork_trampoline() at fork_trampoline+0x8 Jun 19 18:00:51 TPC-D7-23 --- trap 0x1, eip = 0, esp = 0xe6897d7c, ebp = 0 --- What else can I do to further debug this problem? A second problem I have noticed (with similar symptoms, ie no more IO, everything is blocked), all of my postgres processes are in the wddrain state. The code that is supposed to wake them up (runningbufwakeup) still gets called on occassion, but runningbufspace never becomes greater than lorunningspace, and thus will not call wakeup. I don't know if this is due to a slow leak (of runningbufspace), or some deadlock condition. Any ideas? Thanks, Gerrit Nagelhout