Date: Tue, 19 May 2015 12:03:07 -0400 From: Adam McDougall <mcdouga9@egr.msu.edu> To: freebsd-fs@freebsd.org Subject: Re: hardware fault during ZFS send/receive blocks /dev/zfs indefinitely Message-ID: <555B5EBB.20306@egr.msu.edu> In-Reply-To: <86wq048x8h.fsf@emacs.campese.org> References: <86wq048x8h.fsf@emacs.campese.org>
next in thread | previous in thread | raw e-mail | index | archive | help
(trimmed) On 05/19/2015 10:20, Simon Campese wrote: > Hello, > > I tried to send/receive a ZFS filesystem from a raidz2-pool to another > pool with just a single disk, when this disk failed. As a result, now > both, the zfs send and zfs receive processes are in uninterruptible > sleep state and all new zpool and zfs commands which I issue immediately > enter uninterruptible sleep. Is this just bad luck (i.e. my disk failed > in the wrong moment) or might this be a bug? > > Anyway, my only solution is to schedule a reboot soon as the machine is > a file server and the operational status of zfs is critical. > > I'm not very experienced with zfs or the FreeBSD kernel, so I just try > to supply as much relevant information as possible. Please tell me if > there is more I can do. > > The system I run is FreeBSD 10.1-RELEASE-p6, the machine is a small intel > file server (eight core Atom, 64G Ram, Supermicro board, two raidz2 > pools connected via reflashed IBM M1015 controllers). Here are the > relevant lines from "ps ax" (with anonymized pool/filesystem names): > > The errors showing up in /var/log/messages when my harddisk went west > are (excerpt): > > May 19 15:00:48 srv0 kernel: ahcich7: Timeout on slot 0 port 0 > May 19 15:00:48 srv0 kernel: ahcich7: is 00000000 cs c000001f ss > f800001f rs f800001f tfd 40 serr 00000000 cmd 0004dd17 > May 19 15:00:48 srv0 kernel: (ada7:ahcich7:0:0:0): > WRITE_FPDMA_QUEUED. ACB: 61 0b 8c f3 6a 40 00 00 00 00 00 00 > May 19 15:00:48 srv0 kernel: (ada7:ahcich7:0:0:0): CAM status: Command > timeout > May 19 15:00:48 srv0 kernel: (ada7:ahcich7:0:0:0): Retrying command > > Lines of this form continued for some minutes and after a while, my geli > volume on this hdd began complaining as well: > > May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed > (error=6). label/bkp101.eli[WRITE(offset=3595775488, length=131072)] > > Is there any hope for me to resolve this issue without a reboot? > > Thanks for your help, > > Simon Can you try using the geli and/or glabel command to force detach label/bkp101.eli so zfs treats it as a failure? Also I'm not sure how geli and glabel will treat it but you could try sysctl kern.cam.ada.retry_count=0 to make the kernel give up on the disk quicker and the "failure" might cascade up to zfs where it should hopefully give up on the disk. I think the problem here is ZFS does not know about the incomplete failures on the lower layers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?555B5EBB.20306>