From owner-freebsd-fs@FreeBSD.ORG  Wed Aug 27 21:24:54 2014
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id EE6E32E6;
 Wed, 27 Aug 2014 21:24:54 +0000 (UTC)
Received: from caida.org (rommie.caida.org [192.172.226.78])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id D77E83E94;
 Wed, 27 Aug 2014 21:24:54 +0000 (UTC)
Message-ID: <53FE4C9F.7030406@caida.org>
Date: Wed, 27 Aug 2014 14:24:47 -0700
From: Daniel Andersen <dea@caida.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.0
MIME-Version: 1.0
To: John Baldwin <jhb@freebsd.org>, freebsd-fs@freebsd.org
Subject: Re: Process enters unkillable state and somewhat wedges zfs
References: <53F25402.1020907@caida.org> <53F4E3C0.6000406@FreeBSD.org>
 <53FB716D.1060007@caida.org> <201408271639.09352.jhb@freebsd.org>
In-Reply-To: <201408271639.09352.jhb@freebsd.org>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 27 Aug 2014 21:24:55 -0000

On 08/27/2014 01:39 PM, John Baldwin wrote:
> On Monday, August 25, 2014 1:25:01 pm Daniel Andersen wrote:
>> On 08/20/2014 11:06 AM, Bryan Drewery wrote:
>>> On 8/18/2014 2:29 PM, Daniel Andersen wrote:
>>>> We are currently experiencing a strange problem that sort of locks up one 
> of our zfs pools.  This is on a FreeBSD 10
>>>> machine.  Let me give a rough layout of our system to better describe 
> what is happening:
>>>
>>> When it happens get the output of 'procstat -kka|grep zfs' please.
>>>
>>
>> I'd read something about that while attempting to debug the last few times, 
> so I happen to have that output, I believe.
>> Here are the last two examples ( the latter having had more wedged processes 
> ):
>>
>>     0 100687 kernel           zfs_vn_rele_task mi_switch+0xde 
> sleepq_wait+0x3a _sleep+0x26f taskqueue_thread_loop+0xd5
>> fork_exit+0x9a fork_trampoline+0xe
>>     0 101055 kernel           zfs_vn_rele_task mi_switch+0xde 
> sleepq_wait+0x3a _sleep+0x26f taskqueue_thread_loop+0xd5
>> fork_exit+0x9a fork_trampoline+0xe
>>    38 100306 zfskern          arc_reclaim_thre mi_switch+0xde 
> sleepq_timedwait+0x3a _cv_timedwait_sbt+0x18d
>> arc_reclaim_thread+0x302 fork_exit+0x9a fork_trampoline+0xe
>>    38 100307 zfskern          l2arc_feed_threa mi_switch+0xde 
> sleepq_timedwait+0x3a _cv_timedwait_sbt+0x18d
>> l2arc_feed_thread+0xb29 fork_exit+0x9a fork_trampoline+0xe
>>    38 100672 zfskern          trim tank        mi_switch+0xde 
> sleepq_timedwait+0x3a _cv_timedwait_sbt+0x18d
>> trim_thread+0x9b fork_exit+0x9a fork_trampoline+0xe
>>    38 100688 zfskern          txg_thread_enter mi_switch+0xde 
> sleepq_wait+0x3a _cv_wait+0x16d txg_quiesce_thread+0x30b
>> fork_exit+0x9a fork_trampoline+0xe
>>    38 100689 zfskern          txg_thread_enter mi_switch+0xde 
> sleepq_timedwait+0x3a _cv_timedwait_sbt+0x18d
>> txg_sync_thread+0x1dd fork_exit+0x9a fork_trampoline+0xe
>>    38 101054 zfskern          trim work        mi_switch+0xde 
> sleepq_timedwait+0x3a _cv_timedwait_sbt+0x18d
>> trim_thread+0x9b fork_exit+0x9a fork_trampoline+0xe
>>    38 101056 zfskern          txg_thread_enter mi_switch+0xde 
> sleepq_wait+0x3a _cv_wait+0x16d txg_quiesce_thread+0x30b
>> fork_exit+0x9a fork_trampoline+0xe
>>    38 101057 zfskern          txg_thread_enter mi_switch+0xde 
> sleepq_timedwait+0x3a _cv_timedwait_sbt+0x18d
>> txg_sync_thread+0x1dd fork_exit+0x9a fork_trampoline+0xe
>> 16656 101610 ls               -                mi_switch+0xde 
> sleepq_wait+0x3a sleeplk+0x11c __lockmgr_args+0x950
>> vop_stdlock+0x3c VOP_LOCK1_APV+0x9d _vn_lock+0x43 zfs_root+0x8f lookup+0x7f0 
> namei+0x504 kern_statat_vnhook+0xa5
>> sys_lstat+0x30 amd64_syscall+0x357 Xfast_syscall+0xfb
>>
>>
>> and
>>
>>     0 100688 kernel           zfs_vn_rele_task mi_switch+0xde 
> sleepq_wait+0x3a _sleep+0x26f taskqueue_thread_loop+0xd5
>> fork_exit+0x9a fork_trampoline+0xe
>>     0 101056 kernel           zfs_vn_rele_task mi_switch+0xde 
> sleepq_wait+0x3a _sleep+0x26f taskqueue_thread_loop+0xd5
>> fork_exit+0x9a fork_trampoline+0xe
>>    38 100306 zfskern          arc_reclaim_thre mi_switch+0xde 
> sleepq_timedwait+0x3a _cv_timedwait_sbt+0x18d
>> arc_reclaim_thread+0x302 fork_exit+0x9a fork_trampoline+0xe
>>    38 100307 zfskern          l2arc_feed_threa mi_switch+0xde 
> sleepq_timedwait+0x3a _cv_timedwait_sbt+0x18d
>> l2arc_feed_thread+0xb29 fork_exit+0x9a fork_trampoline+0xe
>>    38 100673 zfskern          trim tank        mi_switch+0xde 
> sleepq_timedwait+0x3a _cv_timedwait_sbt+0x18d
>> trim_thread+0x9b fork_exit+0x9a fork_trampoline+0xe
>>    38 100689 zfskern          txg_thread_enter mi_switch+0xde 
> sleepq_wait+0x3a _cv_wait+0x16d txg_quiesce_thread+0x30b
>> fork_exit+0x9a fork_trampoline+0xe
>>    38 100690 zfskern          txg_thread_enter mi_switch+0xde 
> sleepq_wait+0x3a _cv_wait+0x16d zio_wait+0x5b
>> vdev_uberblock_sync_list+0xad vdev_config_sync+0x118 spa_sync+0x827 
> txg_sync_thread+0x375 fork_exit+0x9a
>> fork_trampoline+0xe
>>    38 101055 zfskern          trim work        mi_switch+0xde 
> sleepq_timedwait+0x3a _cv_timedwait_sbt+0x18d
>> trim_thread+0x9b fork_exit+0x9a fork_trampoline+0xe
>>    38 101057 zfskern          txg_thread_enter mi_switch+0xde 
> sleepq_wait+0x3a _cv_wait+0x16d txg_quiesce_thread+0x30b
>> fork_exit+0x9a fork_trampoline+0xe
>>    38 101058 zfskern          txg_thread_enter mi_switch+0xde 
> sleepq_timedwait+0x3a _cv_timedwait_sbt+0x18d
>> txg_sync_thread+0x1dd fork_exit+0x9a fork_trampoline+0xe
>>  2638 101160 nfsd             nfsd: service    mi_switch+0xde 
> sleepq_wait+0x3a sleeplk+0x11c __lockmgr_args+0x950
>> vop_stdlock+0x3c VOP_LOCK1_APV+0x9d _vn_lock+0x43 zfs_vget+0xfe 
> nullfs_vget+0x54 nfsrvd_readdirplus+0x905
>> nfsrvd_dorpc+0x773 nfssvc_program+0x4f6 svc_run_internal+0x1f9 
> svc_thread_start+0xb fork_exit+0x9a fork_trampoline+0xe
>>  3891 101641 collectd         -                mi_switch+0xde 
> sleepq_wait+0x3a sleeplk+0x11c __lockmgr_args+0x950
>> vop_stdlock+0x3c VOP_LOCK1_APV+0x9d _vn_lock+0x43 zfs_root+0x8f lookup+0x7f0 
> namei+0x504 kern_statfs+0x6f
>> sys_statfs+0x2c amd64_syscall+0x357 Xfast_syscall+0xfb
>> 11808 101802 sendsize         -                mi_switch+0xde 
> sleepq_wait+0x3a sleeplk+0x11c __lockmgr_args+0x950
>> vop_stdlock+0x3c VOP_LOCK1_APV+0x9d _vn_lock+0x43 zfs_root+0x8f lookup+0x7f0 
> namei+0x504 kern_statat_vnhook+0xa5
>> sys_stat+0x2d amd64_syscall+0x357 Xfast_syscall+0xfb
>> 13900 101899 sendsize         -                mi_switch+0xde 
> sleepq_wait+0x3a sleeplk+0x11c __lockmgr_args+0x950
>> vop_stdlock+0x3c VOP_LOCK1_APV+0x9d _vn_lock+0x43 zfs_root+0x8f lookup+0x7f0 
> namei+0x504 kern_statat_vnhook+0xa5
>> sys_stat+0x2d amd64_syscall+0x357 Xfast_syscall+0xfb
>> 17961 102536 df               -                mi_switch+0xde 
> sleepq_wait+0x3a sleeplk+0x11c __lockmgr_args+0x950
>> vop_stdlock+0x3c VOP_LOCK1_APV+0x9d _vn_lock+0x43 zfs_root+0x8f lookup+0x7f0 
> namei+0x504 kern_statfs+0x6f
>> sys_statfs+0x2c amd64_syscall+0x357 Xfast_syscall+0xfb
>> 38964 102279 ls               -                mi_switch+0xde 
> sleepq_wait+0x3a sleeplk+0x11c __lockmgr_args+0x950
>> vop_stdlock+0x3c VOP_LOCK1_APV+0x9d _vn_lock+0x43 zfs_root+0x8f lookup+0x7f0 
> namei+0x504 kern_statat_vnhook+0xa5
>> sys_lstat+0x30 amd64_syscall+0x357 Xfast_syscall+0xfb
>> 42315 102539 df               -                mi_switch+0xde 
> sleepq_wait+0x3a sleeplk+0x11c __lockmgr_args+0x950
>> vop_stdlock+0x3c VOP_LOCK1_APV+0x9d _vn_lock+0x43 zfs_root+0x8f lookup+0x7f0 
> namei+0x504 kern_statfs+0x6f
>> sys_statfs+0x2c amd64_syscall+0x357 Xfast_syscall+0xfb
>> 42335 101782 sendsize         -                mi_switch+0xde 
> sleepq_wait+0x3a sleeplk+0x11c __lockmgr_args+0x950
>> vop_stdlock+0x3c VOP_LOCK1_APV+0x9d _vn_lock+0x43 zfs_root+0x8f lookup+0x7f0 
> namei+0x504 kern_statat_vnhook+0xa5
>> sys_stat+0x2d amd64_syscall+0x357 Xfast_syscall+0xfb
> 
> These are all blocked in "zfs" then.  (For future reference, the 'mwchan' 
> field that you see as 'STATE' in top or via 'ps O mwchan' is more detailed 
> than the 'D' state.)
> 
> To diagnose this further, you would need to see which thread holds the
> ZFS vnode lock these threads need.  I have some gdb scripts you can use to
> do that at www.freebsd.org/~jhb/gdb/.  You would want to download 'gdb6*'
> files from there and then do this as root:
> 
> # cd /path/to/gdb/files
> # kgdb
> (kgdb) source gdb6
> (kgdb) sleepchain 42335
> 
> Where '42335' is the pid of some process stuck in "zfs".
> 

I will keep this in mind the next time the machine wedges.  Another data point:
the second procstat output I sent was the most recent.  All the processes listed
there were after the fact.  The process that started the entire problem ( this
time ) was sudo, and it only has this one entry in procstat:

38003 102797 sudo             -                <running>

Of note, this does not appear to be blocked on zfs in anyway.  'ps' showed it
in 'R' state instead of 'D' ( I will be sure to use mwchan in the future. )
It appeared to be pegging an entire CPU core at 100% usage, as well, and was
only single threaded.

Processes that try to access /tank after that wedge as shown above, while
processes that access the same filesystem through the /data/ nullfs mount
do not.  nfsd appears to be an exception, as we only export nfs mounts
via a /tank/foo -> /export/foo nullfs mount.  It's possible nfsd is doing
something else there, though, I guess.  :/

Many thanks,
Dan