Date: Sat, 20 Apr 2019 16:50:38 +0100 From: Steven Hartland <killing@multiplay.co.uk> To: Karl Denninger <karl@denninger.net>, freebsd-stable@freebsd.org Subject: Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20) Message-ID: <758d5611-c3cf-82dd-220f-a775a57bdd0b@multiplay.co.uk> In-Reply-To: <6dc1bad1-05b8-2c65-99d3-61c547007dfe@denninger.net> References: <f87f32f2-b8c5-75d3-4105-856d9f4752ef@denninger.net> <c96e31ad-6731-332e-5d2d-7be4889716e1@FreeBSD.org> <9a96b1b5-9337-fcae-1a2a-69d7bb24a5b3@denninger.net> <CACpH0MdLNQ_dqH%2Bto=amJbUuWprx3LYrOLO0rQi7eKw-ZcqWJw@mail.gmail.com> <1866e238-e2a1-ef4e-bee5-5a2f14e35b22@denninger.net> <3d2ad225-b223-e9db-cce8-8250571b92c9@FreeBSD.org> <2bc8a172-6168-5ba9-056c-80455eabc82b@denninger.net> <CACpH0MfmPzEO5BO2kFk8-F1hP9TsXEiXbfa1qxcvB8YkvAjWWw@mail.gmail.com> <2c23c0de-1802-37be-323e-d390037c6a84@denninger.net> <864062ab-f68b-7e63-c3da-539d1e9714f9@denninger.net> <6dc1bad1-05b8-2c65-99d3-61c547007dfe@denninger.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Have you eliminated geli as possible source? I've just setup an old server which has a LSI 2008 running and old FW (11.0) so was going to have a go at reproducing this. Apart from the disconnect steps below is there anything else needed e.g. read / write workload during disconnect? mps0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem 0xfaf3c000-0xfaf3ffff,0xfaf40000-0xfaf7ffff irq 26 at device 0.0 on pci3 mps0: Firmware: 11.00.00.00, Driver: 21.02.00.00-fbsd mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR> Regards Steve On 20/04/2019 15:39, Karl Denninger wrote: > I can confirm that 20.00.07.00 does *not* stop this. > The previous write/scrub on this device was on 20.00.07.00. It was > swapped back in from the vault yesterday, resilvered without incident, > but a scrub says.... > > root@NewFS:/home/karl # zpool status backup > pool: backup > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. > see: http://illumos.org/msg/ZFS-8000-9P > scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat Apr > 20 08:45:09 2019 > config: > > NAME STATE READ WRITE CKSUM > backup DEGRADED 0 0 0 > mirror-0 DEGRADED 0 0 0 > gpt/backup61.eli ONLINE 0 0 0 > gpt/backup62-1.eli ONLINE 0 0 47 > 13282812295755460479 OFFLINE 0 0 0 was > /dev/gpt/backup62-2.eli > > errors: No known data errors > > So this is firmware-invariant (at least between 19.00.00.00 and > 20.00.07.00); the issue persists. > > Again, in my instance these devices are never removed "unsolicited" so > there can't be (or at least shouldn't be able to) unflushed data in the > device or kernel cache. The procedure is and remains: > > zpool offline ..... > geli detach ..... > camcontrol standby ... > > Wait a few seconds for the spindle to spin down. > > Remove disk. > > Then of course on the other side after insertion and the kernel has > reported "finding" the device: > > geli attach ... > zpool online .... > > Wait... > > If this is a boogered TXG that's held in the metadata for the > "offline"'d device (maybe "off by one"?) that's potentially bad in that > if there is an unknown failure in the other mirror component the > resilver will complete but data has been irrevocably destroyed. > > Granted, this is a very low probability scenario (the area where the bad > checksums are has to be where the corruption hits, and it has to happen > between the resilver and access to that data.) Those are long odds but > nonetheless a window of "you're hosed" does appear to exist. >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?758d5611-c3cf-82dd-220f-a775a57bdd0b>