From owner-freebsd-stable@FreeBSD.ORG Thu Feb 26 21:22:18 2009 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BA14E106564A; Thu, 26 Feb 2009 21:22:18 +0000 (UTC) (envelope-from ghelmer@palisadesys.com) Received: from cetus.palisadesys.com (cetus.palisadesys.com [205.237.115.21]) by mx1.freebsd.org (Postfix) with ESMTP id 79BE18FC21; Thu, 26 Feb 2009 21:22:18 +0000 (UTC) (envelope-from ghelmer@palisadesys.com) Received: from cancer.palisadesys.com (serverwatch [172.16.1.98]) by cetus.palisadesys.com (8.14.3/8.14.3) with ESMTP id n1QLMHgK092047; Thu, 26 Feb 2009 15:22:17 -0600 (CST) (envelope-from ghelmer@palisadesys.com) Received: from [172.16.2.242] (cetus.palisadesys.com [205.237.115.21]) (authenticated bits=0) by cancer.palisadesys.com (8.14.2/8.14.2) with ESMTP id n1QLMIsU032346; Thu, 26 Feb 2009 15:22:18 -0600 (CST) (envelope-from ghelmer@palisadesys.com) Message-ID: <49A70807.9020600@palisadesys.com> Date: Thu, 26 Feb 2009 15:22:15 -0600 From: Guy Helmer User-Agent: Thunderbird 2.0.0.19 (Windows/20081209) MIME-Version: 1.0 To: John Baldwin References: <49A46AB4.3080003@palisadesys.com> <200902250915.57562.jhb@freebsd.org> <49A55D78.1070604@palisadesys.com> <200902261012.24325.jhb@freebsd.org> In-Reply-To: <200902261012.24325.jhb@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-3.0 (cancer.palisadesys.com [205.237.115.20]); Thu, 26 Feb 2009 15:22:18 -0600 (CST) X-Palisade-MailScanner-Information: Please contact the ISP for more information X-Palisade-MailScanner: Found to be clean X-Palisade-MailScanner-SpamCheck: not spam (whitelisted), SpamAssassin (not cached, score=-3.799, required 6, ALL_TRUSTED -1.80, BAYES_00 -2.60, J_CHICKENPOX_48 0.60) X-Palisade-MailScanner-From: ghelmer@palisadesys.com Cc: freebsd-stable@freebsd.org Subject: Re: 7.1 hangs in cache_lookup mutex? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 Feb 2009 21:22:19 -0000 John Baldwin wrote: > On Wednesday 25 February 2009 10:02:16 am Guy Helmer wrote: > >> John Baldwin wrote: >> >>> On Tuesday 24 February 2009 4:46:28 pm Guy Helmer wrote: >>> >>> >>>> I think I may have found a clue regarding some of the hangs I'm seeing >>>> on FreeBSD 7.1. >>>> I have a program (kvoop), compiled under FreeBSD 6 and using >>>> compatibility libraries under FreeBSD 7, that seems to be consistently >>>> involved during these hangs. This time, I noticed that many processes >>>> are stopped, waiting on the ufs lock. I can't tell for certain, but I >>>> assume this kvoop process 33203 is blocking the other processes. The >>>> kvoop process looks to me like it is waiting for a mutex, but nothing >>>> shows up being deadlocked. Am I on the right track? Is there something >>>> else I should be looking for? >>>> >>>> (kgdb) proc 33203 >>>> [Switching to thread 129 (Thread 100241)]#0 sched_switch ( >>>> td=0xffffff0019109a50, newtd=0x0, flags=1) >>>> at ../../../kern/sched_ule.c:1944 >>>> 1944 cpuid = PCPU_GET(cpuid); >>>> (kgdb) where >>>> #0 sched_switch (td=0xffffff0019109a50, newtd=0x0, flags=1) >>>> at ../../../kern/sched_ule.c:1944 >>>> #1 0xffffffff803a573b in mi_switch (flags=1, newtd=0x0) >>>> at ../../../kern/kern_synch.c:440 >>>> #2 0xffffffff803d1214 in turnstile_wait (ts=Variable "ts" is not available. >>>> ) >>>> at ../../../kern/subr_turnstile.c:748 >>>> #3 0xffffffff80392db0 in _mtx_lock_sleep (m=0xffffffff8083c220, >>>> tid=18446742974618442320, opts=Variable "opts" is not available. >>>> ) at ../../../kern/kern_mutex.c:420 >>>> #4 0xffffffff80392ea6 in _mtx_lock_flags (m=0xffffffff8083c220, opts=0, >>>> file=0xffffffff805bd438 "../../../kern/vfs_cache.c", line=327) >>>> at ../../../kern/kern_mutex.c:186 >>>> #5 0xffffffff80403bc6 in cache_lookup (dvp=0xffffff00013e0dc8, >>>> vpp=0xffffffffa2d93a18, cnp=0xffffffffa2d93a40) >>>> at ../../../kern/vfs_cache.c:327 >>>> #6 0xffffffff80404093 in vfs_cache_lookup (ap=Variable "ap" is not >>>> available. >>>> ) >>>> at ../../../kern/vfs_cache.c:674 >>>> #7 0xffffffff805628a0 in VOP_LOOKUP_APV (vop=0xffffffff8076e440, >>>> a=0xffffffffa2d936b0) at vnode_if.c:99 >>>> #8 0xffffffff80409afd in lookup (ndp=0xffffffffa2d939f0) at vnode_if.h:57 >>>> #9 0xffffffff8040a824 in namei (ndp=0xffffffffa2d939f0) >>>> at ../../../kern/vfs_lookup.c:219 >>>> #10 0xffffffff8041dc4f in vn_open_cred (ndp=0xffffffffa2d939f0, >>>> flagp=0xffffffffa2d9393c, cmode=0, cred=0xffffff0001588600, >>>> fp=0xffffff0001964200) at ../../../kern/vfs_vnops.c:188 >>>> #11 0xffffffff8041ccc7 in kern_open (td=0xffffff0019109a50, >>>> path=0xffffac30
, pathseg=Variable >>>> "pathseg" is not available. >>>> ) >>>> at ../../../kern/vfs_syscalls.c:1032 >>>> #12 0xffffffff80544660 in ia32_syscall (frame=0xffffffffa2d93c80) >>>> at ../../../amd64/ia32/ia32_syscall.c:182 >>>> #13 0xffffffff805014d0 in Xint0x80_syscall () at ia32_exception.S:65 >>>> #14 0x0000000028284037 in ?? () >>>> >>>> (kgdb) frame 4 >>>> #4 0xffffffff80392ea6 in _mtx_lock_flags (m=0xffffffff8083c220, opts=0, >>>> file=0xffffffff805bd438 "../../../kern/vfs_cache.c", line=327) >>>> at ../../../kern/kern_mutex.c:186 >>>> 186 _get_sleep_lock(m, curthread, opts, file, line); >>>> (kgdb) list >>>> 181 ("mtx_lock() of spin mutex %s @ %s:%d", >>>> m->lock_object.lo_name, >>>> 182 file, line)); >>>> 183 WITNESS_CHECKORDER(&m->lock_object, opts | LOP_NEWORDER >>>> | LOP_EXCLUSIVE, >>>> 184 file, line); >>>> 185 >>>> 186 _get_sleep_lock(m, curthread, opts, file, line); >>>> 187 LOCK_LOG_LOCK("LOCK", &m->lock_object, opts, >>>> m->mtx_recurse, file, >>>> 188 line); >>>> 189 WITNESS_LOCK(&m->lock_object, opts | LOP_EXCLUSIVE, >>>> file, line); >>>> 190 curthread->td_locks++; >>>> >>>> (kgdb) print *m >>>> $2 = {lock_object = {lo_name = 0xffffffff805bd5d2 "Name Cache", >>>> lo_type = 0xffffffff805bd5d2 "Name Cache", lo_flags = 16973824, >>>> lo_witness_data = {lod_list = {stqe_next = 0xffffffff807cd600}, >>>> lod_witness = 0xffffffff807cd600}}, mtx_lock = 4, mtx_recurse = 0} >>>> >>>> >>> Is this from a coredump or while the system is live? mtx_lock = 4 indicates >>> the mutex is unlocked, so there shouldn't be any threads waiting on it. >>> >>> >>> >> That was from running kgdb on a vmcore file. Would the state of the >> mutex be different than if I was running kgdb on a live kernel? >> > > Well, if it was live perhaps the thread wasn't hung but was changing states > while you were watching it. However, that is not true in a coredump. This is > a very disturbing bug as it means something is broken in the mutex implementation > itself. Have you reproduced this on multiple machines? > > The trace above is from a Supermicro X6DHR-8G motherboard with dual 3.6GHz Nocona Xeons running the amd64 architecture w/ SCHED_ULE. Here is a similar situation on a completely different machine running the same process load -- Supermicro P4DPR-8G2 with dual 2.4GHz Xeons -- running the i386 architecture w/ SCHED_ULE. Process 23092 has exclusive sx "user map" and is waiting on mutex "vm page queue mutex"; a bunch of other processes are stuck on allproc because a vmstat process has the allproc lock and is waiting on the "user map" exclusive lock held by proc 23092. First, from ddb: db> show alllocks Process 23110 (vmstat) thread 0xc5e3b230 (100181) shared sx allproc r = 0 (0xc0907edc) locked @ vm/vm_meter.c:130 exclusive sx sysctl lock r = 0 (0xc09083d4) locked @ kern/kern_sysctl.c:1397 Process 23092 (kvoop) thread 0xc60d2d20 (100208) exclusive sleep mutex vm object (standard object) r = 0 (0xc6043000) locked @ vm/vm_map.c:1476 exclusive sx user map r = 0 (0xc49eec0c) locked @ vm/vm_map.c:1211 Process 23090 (filter) thread 0xc5e2f000 (100199) shared sx filedesc structure r = 0 (0xc5e4cb2c) locked @ kern/kern_descrip.c:2062 Process 22521 (postgres) thread 0xc60d3000 (100207) exclusive sx so_rcv_sx r = 0 (0xc6174bec) locked @ kern/uipc_sockbuf.c:148 Process 7257 (tcsh) thread 0xc47db8c0 (100077) shared sx proctree r = 0 (0xc0907ef4) locked @ kern/kern_fork.c:300 Process 7253 (sshd) thread 0xc60d2690 (100211) exclusive sx so_rcv_sx r = 0 (0xc4c193cc) locked @ kern/uipc_sockbuf.c:148 Process 4570 (postgres) thread 0xc4ca68c0 (100139) exclusive sx so_rcv_sx r = 0 (0xc5f77d8c) locked @ kern/uipc_sockbuf.c:148 Process 2225 (postgres) thread 0xc47e1460 (100078) exclusive sx so_rcv_sx r = 0 (0xc4cca3cc) locked @ kern/uipc_sockbuf.c:148 Process 2128 (postgres) thread 0xc49e6000 (100128) exclusive sx so_rcv_sx r = 0 (0xc4cca3cc) locked @ kern/uipc_sockbuf.c:148 Process 2128 (postgres) thread 0xc49e6000 (100128) exclusive sx so_rcv_sx r = 0 (0xc491bd8c) locked @ kern/uipc_sockbuf.c:148 Process 2112 (perform_ca) thread 0xc47dcd20 (100051) shared sx proctree r = 0 (0xc0907ef4) locked @ kern/kern_fork.c:300 Process 1914 (postgres) thread 0xc49ead20 (100115) exclusive sx so_rcv_sx r = 0 (0xc491c70c) locked @ kern/uipc_sockbuf.c:148 Process 1499 (ph_alert_mgr) thread 0xc47db460 (100080) shared sx proctree r = 0 (0xc0907ef4) locked @ kern/kern_fork.c:300 Process 1482 (postgres) thread 0xc49c4d20 (100105) shared sx proctree r = 0 (0xc0907ef4) locked @ kern/kern_fork.c:300 db> show sleepchain 23110 thread 100181 (pid 23110, vmstat) blocked on sx "user map" XLOCK thread 100208 (pid 23092, kvoop) is on a run queue db> show sleepchain 23092 thread 100208 (pid 23092, kvoop) is on a run queue db> show sleepchain 23090 thread 100199 (pid 23090, filter) running on CPU 3 db> show sleepchain 22521 thread 100207 (pid 22521, postgres) sleeping on 0xc6174c1c "sbwait" db> show sleepchain 7257 thread 100077 (pid 7257, tcsh) blocked on sx "allproc" SLOCK (count 1) db> show sleepchain 2112 thread 100051 (pid 2112, perform_ca) blocked on sx "allproc" SLOCK (count 1) db> show sleepchain 1499 thread 100080 (pid 1499, initial thread) blocked on sx "allproc" SLOCK (count 1) db> show sleepchain 1482 thread 100105 (pid 1482, initial thread) blocked on sx "allproc" SLOCK (count 1) From kgdb on the vmcore file: (kgdb) proc 23092 [Switching to thread 132 (Thread 100208)]#0 sched_switch (td=0xc60d2d20, newtd=Variable "newtd" is not available. ) at ../../../kern/sched_ule.c:1944 1944 cpuid = PCPU_GET(cpuid); (kgdb) where #0 sched_switch (td=0xc60d2d20, newtd=Variable "newtd" is not available. ) at ../../../kern/sched_ule.c:1944 #1 0xc065d073 in mi_switch (flags=Variable "flags" is not available. ) at ../../../kern/kern_synch.c:440 #2 0xc06872a4 in turnstile_wait (ts=0xc60d46e0, owner=0xc49c4460, queue=0) at ../../../kern/subr_turnstile.c:748 #3 0xc064931e in _mtx_lock_sleep (m=0xc09635dc, tid=3322752288, opts=0, file=0xc086a6ef "../../../vm/vm_map.c", line=1543) at ../../../kern/kern_mutex.c:420 #4 0xc06493a7 in _mtx_lock_flags (m=0xc09635dc, opts=0, file=0xc086a6ef "../../../vm/vm_map.c", line=1543) at ../../../kern/kern_mutex.c:186 #5 0xc07a7010 in vm_map_pmap_enter (map=0xc49eebc8, addr=1210679296, prot=5 '\005', object=0xc6043000, pindex=0, size=204800, flags=16) at ../../../vm/vm_map.c:1543 #6 0xc07a7366 in vm_map_insert (map=0xc49eebc8, object=0xc6043000, offset=0, start=1210679296, end=1210884096, prot=Variable "prot" is not available. ) at ../../../vm/vm_map.c:1072 #7 0xc07a8a22 in vm_map_find (map=0xc49eebc8, object=0xc6043000, offset=0, addr=0xe6fccc6c, length=204800, find_space=1, prot=Variable "prot" is not available. ) at ../../../vm/vm_map.c:1219 #8 0xc07ab2d3 in vm_mmap (map=0xc49eebc8, addr=0xe6fccc6c, size=204800, prot=Variable "prot" is not available. ) at ../../../vm/vm_mmap.c:1411 #9 0xc07abae8 in mmap (td=0xc60d2d20, uap=0xe6fcccfc) at ../../../vm/vm_mmap.c:377 #10 0xc07f8343 in syscall (frame=0xe6fccd38) at ../../../i386/i386/trap.c:1090 ---Type to continue, or q to quit--- #11 0xc07e0340 in Xint0x80_syscall () at ../../../i386/i386/exception.s:255 #12 0x00000033 in ?? () Previous frame inner to this frame (corrupt stack?) (kgdb) frame 4 #4 0xc06493a7 in _mtx_lock_flags (m=0xc09635dc, opts=0, file=0xc086a6ef "../../../vm/vm_map.c", line=1543) at ../../../kern/kern_mutex.c:186 186 _get_sleep_lock(m, curthread, opts, file, line); (kgdb) print *m $3 = {lock_object = {lo_name = 0xc085951a "vm page queue mutex", lo_type = 0xc085951a "vm page queue mutex", lo_flags = 25886720, lo_witness_data = {lod_list = {stqe_next = 0xc091d1a0}, lod_witness = 0xc091d1a0}}, mtx_lock = 4, mtx_recurse = 0}