Date: Mon, 18 Feb 2013 12:01:21 -0800 From: John-Mark Gurney <jmg@funkthat.com> To: freebsd-fs@FreeBSD.org Subject: Re: ZFS on 9.1 doesn't see errors on geli volumes... Message-ID: <20130218200121.GJ55866@funkthat.com> In-Reply-To: <20130218191242.GI55866@funkthat.com> References: <20130218191242.GI55866@funkthat.com>
next in thread | previous in thread | raw e-mail | index | archive | help
John-Mark Gurney wrote this message on Mon, Feb 18, 2013 at 11:12 -0800: > I'm running 9.1: > FreeBSD gold.funkthat.com 9.1-PRERELEASE FreeBSD 9.1-PRERELEASE #26 r241041M: Wed Dec 12 23:02:31 PST 2012 jmg@gold.funkthat.com:/usr/src.9stable/sys/amd64/compile/gold amd64 > > The modifications are limited to improving AES-NI performance. > > On a box, and decided to go full ZFS w/ geli encrypted volumes (including > root fs)... One of the hard drives started going bad, so I started > seeing: > hpt27xx: Device error information 0x1000000 > hpt27xx: Task file error, StatusReg=0x51, ErrReg=0x40, LBA[0-3]=0xf495e928,LBA[4-7]=0x0. > (da3:hpt27xx0:0:3:0): READ(10). CDB: 28 0 f4 95 e8 f8 0 0 80 0 > (da3:hpt27xx0:0:3:0): CAM status: Auto-Sense Retrieval Failed > (da3:hpt27xx0:0:3:0): Error 5, Unretryable error > GEOM_ELI: g_eli_read_done() failed label/toby.eli[READ(offset=2100974186496, length=90112)] > > and: > (da3:hpt27xx0:0:3:0): WRITE(10). CDB: 2a 0 ef cc 10 90 0 0 8 0 > (da3:hpt27xx0:0:3:0): CAM status: Auto-Sense Retrieval Failed > (da3:hpt27xx0:0:3:0): Error 5, Unretryable error > GEOM_ELI: Crypto WRITE request failed (error=5). label/toby.eli[WRITE(offset=2059841654784, length=4096)] > > So we can see that geli is failing, but zpool status command doesn't show > any errors at all... The READ and WRITE columns both show 0 for the device.. > > Now I do know that the WRITEs are not retried, because if I do a scrub > afterward, it detects cksum errors, and does properly increases the > count in the CKSUM column... > > Now if I pull a device, it will see that the device is lost, but no > matter how many read or write errors get returned by geli, zfs doesn't > seem to count them... > > Has anyone else seen this w/ ZFS? Is it possible that it's a problem w/ > geli, and not ZFS? > > I haven't tried to run a test w/ gnop to fail some read/writes on -current.. > > P.S. Please keep me cc'd, as I'm not on the list. Well, after some digging w/ help from smh@, it looks like the write case in geli is broken... in g_eli_write_done, we have the code: if (pbp->bio_error != 0) { G_ELI_LOGREQ(0, pbp, "Crypto WRITE request failed (error=%d).", pbp->bio_error); pbp->bio_completed = 0; } /* * Write is finished, send it up. */ pbp->bio_completed = pbp->bio_length; sc = pbp->bio_to->geom->softc; g_io_deliver(pbp, pbp->bio_error); atomic_subtract_int(&sc->sc_inflight, 1); so, we just end up overwriting the bio_completed error... pjd, should we just put the bio_completed = line under an else? something like: if (pbp->bio_error != 0) { G_ELI_LOGREQ(0, pbp, "Crypto WRITE request failed (error=%d).", pbp->bio_error); pbp->bio_completed = 0; } else pbp->bio_completed = pbp->bio_length; /* Write is finished, send it up. */ g_io_deliver(pbp, pbp->bio_error); sc = pbp->bio_to->geom->softc; atomic_subtract_int(&sc->sc_inflight, 1); But doesn't explain why read's aren't being counted though... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not."
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130218200121.GJ55866>