Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 26 Feb 2009 15:22:15 -0600
From:      Guy Helmer <ghelmer@palisadesys.com>
To:        John Baldwin <jhb@freebsd.org>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: 7.1 hangs in cache_lookup mutex?
Message-ID:  <49A70807.9020600@palisadesys.com>
In-Reply-To: <200902261012.24325.jhb@freebsd.org>
References:  <49A46AB4.3080003@palisadesys.com> <200902250915.57562.jhb@freebsd.org> <49A55D78.1070604@palisadesys.com> <200902261012.24325.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
John Baldwin wrote:
> On Wednesday 25 February 2009 10:02:16 am Guy Helmer wrote:
>   
>> John Baldwin wrote:
>>     
>>> On Tuesday 24 February 2009 4:46:28 pm Guy Helmer wrote:
>>>   
>>>       
>>>> I think I may have found a clue regarding some of the hangs I'm seeing 
>>>> on FreeBSD 7.1.
>>>> I have a program (kvoop), compiled under FreeBSD 6 and using 
>>>> compatibility libraries under FreeBSD 7, that seems to be consistently 
>>>> involved during these hangs.  This time, I noticed that many processes 
>>>> are stopped, waiting on the ufs lock.  I can't tell for certain, but I 
>>>> assume this kvoop process 33203 is blocking the other processes.  The 
>>>> kvoop process looks to me like it is waiting for a mutex, but nothing 
>>>> shows up being deadlocked.  Am I on the right track?  Is there something 
>>>> else I should be looking for?
>>>>
>>>> (kgdb) proc 33203
>>>> [Switching to thread 129 (Thread 100241)]#0  sched_switch (
>>>>     td=0xffffff0019109a50, newtd=0x0, flags=1)
>>>>     at ../../../kern/sched_ule.c:1944
>>>> 1944                    cpuid = PCPU_GET(cpuid);
>>>> (kgdb) where
>>>> #0  sched_switch (td=0xffffff0019109a50, newtd=0x0, flags=1)
>>>>     at ../../../kern/sched_ule.c:1944
>>>> #1  0xffffffff803a573b in mi_switch (flags=1, newtd=0x0)
>>>>     at ../../../kern/kern_synch.c:440
>>>> #2  0xffffffff803d1214 in turnstile_wait (ts=Variable "ts" is not available.
>>>> )
>>>>     at ../../../kern/subr_turnstile.c:748
>>>> #3  0xffffffff80392db0 in _mtx_lock_sleep (m=0xffffffff8083c220,
>>>>     tid=18446742974618442320, opts=Variable "opts" is not available.
>>>> ) at ../../../kern/kern_mutex.c:420
>>>> #4  0xffffffff80392ea6 in _mtx_lock_flags (m=0xffffffff8083c220, opts=0,
>>>>     file=0xffffffff805bd438 "../../../kern/vfs_cache.c", line=327)
>>>>     at ../../../kern/kern_mutex.c:186
>>>> #5  0xffffffff80403bc6 in cache_lookup (dvp=0xffffff00013e0dc8,
>>>>     vpp=0xffffffffa2d93a18, cnp=0xffffffffa2d93a40)
>>>>     at ../../../kern/vfs_cache.c:327
>>>> #6  0xffffffff80404093 in vfs_cache_lookup (ap=Variable "ap" is not 
>>>> available.
>>>> )
>>>>     at ../../../kern/vfs_cache.c:674
>>>> #7  0xffffffff805628a0 in VOP_LOOKUP_APV (vop=0xffffffff8076e440,
>>>>     a=0xffffffffa2d936b0) at vnode_if.c:99
>>>> #8  0xffffffff80409afd in lookup (ndp=0xffffffffa2d939f0) at vnode_if.h:57
>>>> #9  0xffffffff8040a824 in namei (ndp=0xffffffffa2d939f0)
>>>>     at ../../../kern/vfs_lookup.c:219
>>>> #10 0xffffffff8041dc4f in vn_open_cred (ndp=0xffffffffa2d939f0,
>>>>     flagp=0xffffffffa2d9393c, cmode=0, cred=0xffffff0001588600,
>>>>     fp=0xffffff0001964200) at ../../../kern/vfs_vnops.c:188
>>>> #11 0xffffffff8041ccc7 in kern_open (td=0xffffff0019109a50,
>>>>     path=0xffffac30 <Address 0xffffac30 out of bounds>, pathseg=Variable 
>>>> "pathseg" is not available.
>>>> )
>>>>     at ../../../kern/vfs_syscalls.c:1032
>>>> #12 0xffffffff80544660 in ia32_syscall (frame=0xffffffffa2d93c80)
>>>>     at ../../../amd64/ia32/ia32_syscall.c:182
>>>> #13 0xffffffff805014d0 in Xint0x80_syscall () at ia32_exception.S:65
>>>> #14 0x0000000028284037 in ?? ()
>>>>
>>>> (kgdb) frame 4
>>>> #4  0xffffffff80392ea6 in _mtx_lock_flags (m=0xffffffff8083c220, opts=0,
>>>>     file=0xffffffff805bd438 "../../../kern/vfs_cache.c", line=327)
>>>>     at ../../../kern/kern_mutex.c:186
>>>> 186             _get_sleep_lock(m, curthread, opts, file, line);
>>>> (kgdb) list
>>>> 181                 ("mtx_lock() of spin mutex %s @ %s:%d", 
>>>> m->lock_object.lo_name,
>>>> 182                 file, line));
>>>> 183             WITNESS_CHECKORDER(&m->lock_object, opts | LOP_NEWORDER 
>>>> | LOP_EXCLUSIVE,
>>>> 184                 file, line);
>>>> 185
>>>> 186             _get_sleep_lock(m, curthread, opts, file, line);
>>>> 187             LOCK_LOG_LOCK("LOCK", &m->lock_object, opts, 
>>>> m->mtx_recurse, file,
>>>> 188                 line);
>>>> 189             WITNESS_LOCK(&m->lock_object, opts | LOP_EXCLUSIVE, 
>>>> file, line);
>>>> 190             curthread->td_locks++;
>>>>
>>>> (kgdb) print *m
>>>> $2 = {lock_object = {lo_name = 0xffffffff805bd5d2 "Name Cache",
>>>>     lo_type = 0xffffffff805bd5d2 "Name Cache", lo_flags = 16973824,
>>>>     lo_witness_data = {lod_list = {stqe_next = 0xffffffff807cd600},
>>>>       lod_witness = 0xffffffff807cd600}}, mtx_lock = 4, mtx_recurse = 0}
>>>>     
>>>>         
>>> Is this from a coredump or while the system is live?  mtx_lock = 4 indicates 
>>> the mutex is unlocked, so there shouldn't be any threads waiting on it.
>>>
>>>   
>>>       
>> That was from running kgdb on a vmcore file.  Would the state of the 
>> mutex be different than if I was running kgdb on a live kernel?
>>     
>
> Well, if it was live perhaps the thread wasn't hung but was changing states
> while you were watching it.  However, that is not true in a coredump.  This is
> a very disturbing bug as it means something is broken in the mutex implementation
> itself.  Have you reproduced this on multiple machines?
>
>   
The trace above is from a Supermicro X6DHR-8G motherboard with dual 
3.6GHz Nocona Xeons running the amd64 architecture w/ SCHED_ULE.

Here is a similar situation on a completely different machine running 
the same process load -- Supermicro P4DPR-8G2 with dual 2.4GHz Xeons -- 
running the i386 architecture w/ SCHED_ULE.  Process 23092 has exclusive 
sx "user map" and is waiting on mutex "vm page queue mutex"; a bunch of 
other processes are stuck on allproc because a vmstat process has the 
allproc lock and is waiting on the "user map" exclusive lock held by 
proc 23092.

First, from ddb:
db> show alllocks
Process 23110 (vmstat) thread 0xc5e3b230 (100181)
shared sx allproc r = 0 (0xc0907edc) locked @ vm/vm_meter.c:130
exclusive sx sysctl lock r = 0 (0xc09083d4) locked @ kern/kern_sysctl.c:1397
Process 23092 (kvoop) thread 0xc60d2d20 (100208)
exclusive sleep mutex vm object (standard object) r = 0 (0xc6043000) 
locked @ vm/vm_map.c:1476
exclusive sx user map r = 0 (0xc49eec0c) locked @ vm/vm_map.c:1211
Process 23090 (filter) thread 0xc5e2f000 (100199)
shared sx filedesc structure r = 0 (0xc5e4cb2c) locked @ 
kern/kern_descrip.c:2062
Process 22521 (postgres) thread 0xc60d3000 (100207)
exclusive sx so_rcv_sx r = 0 (0xc6174bec) locked @ kern/uipc_sockbuf.c:148
Process 7257 (tcsh) thread 0xc47db8c0 (100077)
shared sx proctree r = 0 (0xc0907ef4) locked @ kern/kern_fork.c:300
Process 7253 (sshd) thread 0xc60d2690 (100211)
exclusive sx so_rcv_sx r = 0 (0xc4c193cc) locked @ kern/uipc_sockbuf.c:148
Process 4570 (postgres) thread 0xc4ca68c0 (100139)
exclusive sx so_rcv_sx r = 0 (0xc5f77d8c) locked @ kern/uipc_sockbuf.c:148
Process 2225 (postgres) thread 0xc47e1460 (100078)
exclusive sx so_rcv_sx r = 0 (0xc4cca3cc) locked @ kern/uipc_sockbuf.c:148
Process 2128 (postgres) thread 0xc49e6000 (100128)
exclusive sx so_rcv_sx r = 0 (0xc4cca3cc) locked @ kern/uipc_sockbuf.c:148
Process 2128 (postgres) thread 0xc49e6000 (100128)
exclusive sx so_rcv_sx r = 0 (0xc491bd8c) locked @ kern/uipc_sockbuf.c:148
Process 2112 (perform_ca) thread 0xc47dcd20 (100051)
shared sx proctree r = 0 (0xc0907ef4) locked @ kern/kern_fork.c:300
Process 1914 (postgres) thread 0xc49ead20 (100115)
exclusive sx so_rcv_sx r = 0 (0xc491c70c) locked @ kern/uipc_sockbuf.c:148
Process 1499 (ph_alert_mgr) thread 0xc47db460 (100080)
shared sx proctree r = 0 (0xc0907ef4) locked @ kern/kern_fork.c:300
Process 1482 (postgres) thread 0xc49c4d20 (100105)
shared sx proctree r = 0 (0xc0907ef4) locked @ kern/kern_fork.c:300
db> show sleepchain 23110
thread 100181 (pid 23110, vmstat) blocked on sx "user map" XLOCK
thread 100208 (pid 23092, kvoop) is on a run queue
db> show sleepchain 23092
thread 100208 (pid 23092, kvoop) is on a run queue
db> show sleepchain 23090
thread 100199 (pid 23090, filter) running on CPU 3
db> show sleepchain 22521
thread 100207 (pid 22521, postgres) sleeping on 0xc6174c1c "sbwait"
db> show sleepchain 7257
thread 100077 (pid 7257, tcsh) blocked on sx "allproc" SLOCK (count 1)
db> show sleepchain 2112
thread 100051 (pid 2112, perform_ca) blocked on sx "allproc" SLOCK (count 1)
db> show sleepchain 1499
thread 100080 (pid 1499, initial thread) blocked on sx "allproc" SLOCK 
(count 1)
db> show sleepchain 1482
thread 100105 (pid 1482, initial thread) blocked on sx "allproc" SLOCK 
(count 1)

 From kgdb on the vmcore file:

(kgdb) proc 23092
[Switching to thread 132 (Thread 100208)]#0  sched_switch (td=0xc60d2d20,
    newtd=Variable "newtd" is not available.
) at ../../../kern/sched_ule.c:1944
1944                    cpuid = PCPU_GET(cpuid);
(kgdb) where
#0  sched_switch (td=0xc60d2d20, newtd=Variable "newtd" is not available.
) at ../../../kern/sched_ule.c:1944
#1  0xc065d073 in mi_switch (flags=Variable "flags" is not available.
) at ../../../kern/kern_synch.c:440
#2  0xc06872a4 in turnstile_wait (ts=0xc60d46e0, owner=0xc49c4460, queue=0)
    at ../../../kern/subr_turnstile.c:748
#3  0xc064931e in _mtx_lock_sleep (m=0xc09635dc, tid=3322752288, opts=0,
    file=0xc086a6ef "../../../vm/vm_map.c", line=1543)
    at ../../../kern/kern_mutex.c:420
#4  0xc06493a7 in _mtx_lock_flags (m=0xc09635dc, opts=0,
    file=0xc086a6ef "../../../vm/vm_map.c", line=1543)
    at ../../../kern/kern_mutex.c:186
#5  0xc07a7010 in vm_map_pmap_enter (map=0xc49eebc8, addr=1210679296,
    prot=5 '\005', object=0xc6043000, pindex=0, size=204800, flags=16)
    at ../../../vm/vm_map.c:1543
#6  0xc07a7366 in vm_map_insert (map=0xc49eebc8, object=0xc6043000, 
offset=0,
    start=1210679296, end=1210884096, prot=Variable "prot" is not available.
) at ../../../vm/vm_map.c:1072
#7  0xc07a8a22 in vm_map_find (map=0xc49eebc8, object=0xc6043000, offset=0,
    addr=0xe6fccc6c, length=204800, find_space=1, prot=Variable "prot" 
is not available.
)
    at ../../../vm/vm_map.c:1219
#8  0xc07ab2d3 in vm_mmap (map=0xc49eebc8, addr=0xe6fccc6c, size=204800, 
prot=Variable "prot" is not available.
)
    at ../../../vm/vm_mmap.c:1411
#9  0xc07abae8 in mmap (td=0xc60d2d20, uap=0xe6fcccfc)
    at ../../../vm/vm_mmap.c:377
#10 0xc07f8343 in syscall (frame=0xe6fccd38) at 
../../../i386/i386/trap.c:1090
---Type <return> to continue, or q <return> to quit---
#11 0xc07e0340 in Xint0x80_syscall () at ../../../i386/i386/exception.s:255
#12 0x00000033 in ?? ()
Previous frame inner to this frame (corrupt stack?)
(kgdb) frame 4
#4  0xc06493a7 in _mtx_lock_flags (m=0xc09635dc, opts=0,
    file=0xc086a6ef "../../../vm/vm_map.c", line=1543)
    at ../../../kern/kern_mutex.c:186
186             _get_sleep_lock(m, curthread, opts, file, line);
(kgdb) print *m
$3 = {lock_object = {lo_name = 0xc085951a "vm page queue mutex",
    lo_type = 0xc085951a "vm page queue mutex", lo_flags = 25886720,
    lo_witness_data = {lod_list = {stqe_next = 0xc091d1a0},
      lod_witness = 0xc091d1a0}}, mtx_lock = 4, mtx_recurse = 0}







Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?49A70807.9020600>