From owner-freebsd-fs@FreeBSD.ORG  Tue May 19 16:03:10 2015
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id D193D8D4
 for <freebsd-fs@freebsd.org>; Tue, 19 May 2015 16:03:10 +0000 (UTC)
Received: from mail.egr.msu.edu (hill.egr.msu.edu [35.9.37.162])
 by mx1.freebsd.org (Postfix) with ESMTP id AC78B10F5
 for <freebsd-fs@freebsd.org>; Tue, 19 May 2015 16:03:10 +0000 (UTC)
Received: from hill (localhost [127.0.0.1])
 by mail.egr.msu.edu (Postfix) with ESMTP id E9A6E2A0E1
 for <freebsd-fs@freebsd.org>; Tue, 19 May 2015 12:03:08 -0400 (EDT)
X-Virus-Scanned: amavisd-new at egr.msu.edu
Received: from mail.egr.msu.edu ([127.0.0.1])
 by hill (hill.egr.msu.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id hfYgl_5uUU6f for <freebsd-fs@freebsd.org>;
 Tue, 19 May 2015 12:03:08 -0400 (EDT)
Received: from EGR authenticated sender mcdouga9
Message-ID: <555B5EBB.20306@egr.msu.edu>
Date: Tue, 19 May 2015 12:03:07 -0400
From: Adam McDougall <mcdouga9@egr.msu.edu>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:31.0) Gecko/20100101 Thunderbird/31.7.0
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: hardware fault during ZFS send/receive blocks /dev/zfs
 indefinitely
References: <86wq048x8h.fsf@emacs.campese.org>
In-Reply-To: <86wq048x8h.fsf@emacs.campese.org>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 19 May 2015 16:03:10 -0000

(trimmed)

On 05/19/2015 10:20, Simon Campese wrote:
> Hello,
> 
> I tried to send/receive a ZFS filesystem from a raidz2-pool to another
> pool with just a single disk, when this disk failed. As a result, now
> both, the zfs send and zfs receive processes are in uninterruptible
> sleep state and all new zpool and zfs commands which I issue immediately
> enter uninterruptible sleep. Is this just bad luck (i.e. my disk failed
> in the wrong moment) or might this be a bug? 
> 
> Anyway, my only solution is to schedule a reboot soon as the machine is
> a file server and the operational status of zfs is critical.  
> 
> I'm not very experienced with zfs or the FreeBSD kernel, so I just try
> to supply as much relevant information as possible. Please tell me if
> there is more I can do. 
> 
> The system I run is FreeBSD 10.1-RELEASE-p6, the machine is a small intel
> file server (eight core Atom, 64G Ram, Supermicro board, two raidz2
> pools connected via reflashed IBM M1015 controllers).  Here are the
> relevant lines from "ps ax" (with anonymized pool/filesystem names):  
> 
> The errors showing up in /var/log/messages when my harddisk went west
> are (excerpt):
> 
> May 19 15:00:48 srv0 kernel: ahcich7: Timeout on slot 0 port 0
> May 19 15:00:48 srv0 kernel: ahcich7: is 00000000 cs c000001f ss
> f800001f rs f800001f tfd 40 serr 00000000 cmd 0004dd17
> May 19 15:00:48 srv0 kernel: (ada7:ahcich7:0:0:0):
> WRITE_FPDMA_QUEUED. ACB: 61 0b 8c f3 6a 40 00 00 00 00 00 00
> May 19 15:00:48 srv0 kernel: (ada7:ahcich7:0:0:0): CAM status: Command
> timeout
> May 19 15:00:48 srv0 kernel: (ada7:ahcich7:0:0:0): Retrying command
> 
> Lines of this form continued for some minutes and after a while, my geli
> volume on this hdd began complaining as well:
> 
> May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
> (error=6). label/bkp101.eli[WRITE(offset=3595775488, length=131072)]
> 
> Is there any hope for me to resolve this issue without a reboot?
> 
> Thanks for your help,
> 
> Simon

Can you try using the geli and/or glabel command to force detach
label/bkp101.eli so zfs treats it as a failure?  Also I'm not sure how
geli and glabel will treat it but you could try sysctl
kern.cam.ada.retry_count=0 to make the kernel give up on the disk
quicker and the "failure" might cascade up to zfs where it should
hopefully give up on the disk.  I think the problem here is ZFS does not
know about the incomplete failures on the lower layers.