From owner-freebsd-fs@FreeBSD.ORG Fri Aug 29 19:45:36 2014 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 83E86A11; Fri, 29 Aug 2014 19:45:36 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 5A1B31CF7; Fri, 29 Aug 2014 19:45:36 +0000 (UTC) Received: from ralph.baldwin.cx (pool-173-70-85-31.nwrknj.fios.verizon.net [173.70.85.31]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 17C6EB960; Fri, 29 Aug 2014 15:45:35 -0400 (EDT) From: John Baldwin To: Daniel Andersen Subject: Re: Process enters unkillable state and somewhat wedges zfs Date: Fri, 29 Aug 2014 14:24:07 -0400 Message-ID: <5842681.mjgMD2kESs@ralph.baldwin.cx> User-Agent: KMail/4.10.5 (FreeBSD/10.0-STABLE; KDE/4.10.5; amd64; ; ) In-Reply-To: <53FE4C9F.7030406@caida.org> References: <53F25402.1020907@caida.org> <201408271639.09352.jhb@freebsd.org> <53FE4C9F.7030406@caida.org> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Fri, 29 Aug 2014 15:45:35 -0400 (EDT) Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 29 Aug 2014 19:45:36 -0000 On Wednesday, August 27, 2014 02:24:47 PM Daniel Andersen wrote: > On 08/27/2014 01:39 PM, John Baldwin wrote: > > These are all blocked in "zfs" then. (For future reference, the 'mwchan' > > field that you see as 'STATE' in top or via 'ps O mwchan' is more detailed > > than the 'D' state.) > > > > To diagnose this further, you would need to see which thread holds the > > ZFS vnode lock these threads need. I have some gdb scripts you can use to > > do that at www.freebsd.org/~jhb/gdb/. You would want to download 'gdb6*' > > files from there and then do this as root: > > > > # cd /path/to/gdb/files > > # kgdb > > (kgdb) source gdb6 > > (kgdb) sleepchain 42335 > > > > Where '42335' is the pid of some process stuck in "zfs". > > I will keep this in mind the next time the machine wedges. Another data > point: the second procstat output I sent was the most recent. All the > processes listed there were after the fact. The process that started the > entire problem ( this time ) was sudo, and it only has this one entry in > procstat: > > 38003 102797 sudo - > > Of note, this does not appear to be blocked on zfs in anyway. 'ps' showed > it in 'R' state instead of 'D' ( I will be sure to use mwchan in the > future. ) It appeared to be pegging an entire CPU core at 100% usage, as > well, and was only single threaded. Well, if it is spinning in some sort of loop in the kernel while holding a ZFS vnode lock that could be blocking all the other threads. In that case, you don't need to do what I asked for above. Instead, we need to find out what that thread is doing. There are two ways of doing this. One is to force a panic via 'sysctl debug.kdb.panic=1' and then use kgdb on the crashdump to determine what the running thread is doing. Another option is to break into the DDB debugger on the console (note that you will need to build a custom kernel with DDB if you are on stable) and request a stack trace of the running process via 'tr '. Ideally you can do this over a serial console so you can just cut and paste the output of the trace into a mail. Over a video console you can either transcribe it by hand or take photos. -- John Baldwin