From owner-freebsd-fs@FreeBSD.ORG Wed May 20 15:43:26 2015 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E35EBDFC; Wed, 20 May 2015 15:43:25 +0000 (UTC) Received: from smtp.digiware.nl (unknown [IPv6:2001:4cb8:90:ffff::3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id A73F61A87; Wed, 20 May 2015 15:43:25 +0000 (UTC) Received: from rack1.digiware.nl (unknown [127.0.0.1]) by smtp.digiware.nl (Postfix) with ESMTP id 2926D16A406; Wed, 20 May 2015 17:43:22 +0200 (CEST) X-Virus-Scanned: amavisd-new at digiware.nl Received: from smtp.digiware.nl ([127.0.0.1]) by rack1.digiware.nl (rack1.digiware.nl [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id B8Dvz59DQ2Gf; Wed, 20 May 2015 17:43:11 +0200 (CEST) Received: from [192.168.101.176] (vpn.ecoracks.nl [31.223.170.173]) by smtp.digiware.nl (Postfix) with ESMTPA id 8658116A404; Wed, 20 May 2015 17:43:11 +0200 (CEST) Message-ID: <555CAB90.7070506@digiware.nl> Date: Wed, 20 May 2015 17:43:12 +0200 From: Willem Jan Withagen User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: fs@freebsd.org, Edward Tomasz Napierala Subject: ZFS / NFS deadlock??? (Was: Re: Unexpected reboot after ctld run into trouble.) References: <55573756.9070503@digiware.nl> In-Reply-To: <55573756.9070503@digiware.nl> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 May 2015 15:43:26 -0000 On 16/05/2015 14:25, Willem Jan Withagen wrote: > Hi, > > Found the following in my logs: > Losts of > ---- > (0:4:0/0): Task Action: LUN Reset > (0:4:0/0): CTL Status: Command Completed Successfully > sonewconn: pcb 0xfffff8004e69e930: Listen queue overflow: 8 already in > queue awaiting acceptance (740688 occurrences) > (0:4:0/0): Task Action: LUN Reset > (0:4:0/0): CTL Status: Command Completed Successfully > sonewconn: pcb 0xfffff8004e69e930: Listen queue overflow: 8 already in > queue awaiting acceptance (713721 occurrences) > (0:4:0/0): Task Action: LUN Reset > (0:4:0/0): CTL Status: Command Completed Successfully > sonewconn: pcb 0xfffff8004e69e930: Listen queue overflow: 8 already in > queue awaiting acceptance (691776 occurrences) > ---- > > Which then ends in: > ---- > panic: deadlkres: possible deadlock detected for 0xfffff8001ee94920, > blocked for 1801009 ticks > > > cpuid = 1 > Uptime: 14d13h13m47s > Dumping 7557 out of 8175 > MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%Table 'FACP' at > 0xcfedbcf8 > ---- > The system is running ZFS with ZFS-on-root: FreeBSD zfs.digiware.nl > 10.1-STABLE FreeBSD 10.1-STABLE #221 r282282: Fri May 1 06:51:41 > CEST 2015 > > This could stem from the fact that I woke up my Win8 PC which has a > iscsi volume mounted. It is used to store security cam captures on > and does have somewhat bigger traffic on it. > > Suggestions or question to look at are welcome. > I do have a core in /var/crash, but will need some guidance to > retrieve stuff from it. Followup to this story, after some discussion with/debugging by Edward (trasz@): >> Now, the bad news: I don't think I'll be able to help you with this >> one. It looks like the problem is actually NFS-related. Using the >> hex address from the deadlock message in dmesg: >> >> % kgdb boot/kernel/kernel vmcore.3 >> >> (kgdb) p ((struct thread *)0xfffff8001ee94920)->td_proc->p_comm $6 >> = "nfsd", '\0' (kgdb) p ((struct thread >> *)0xfffff8001ee94920)->td_wmesg $7 = 0xffffffff80edcfc3 "zfs" >> >> So it might actually be a ZFS deadlock the nfsd thread tripped on. > The panic was triggered by deadlkres; it noticed that there was a > thread that spent way too much time waiting for something - so, > presumably, it become "hung" due to a deadlock. > The 0xfffff8001ee94920 in dmesg is the address of "struct thread" of the > problematic thread. > The first print shows the "command name" (p_comm) of the process the > thread belongs to. The second print shows the "wait channel", on > which the thread sleeped. So now the questions are: 1) Is this indeed a ZFS / NFS deadlock problem? 2) Who can/wil help to get this worked out? Thanx, --WjW