Date: Sat, 20 Apr 2019 21:56:45 +0100 From: Steven Hartland <killing@multiplay.co.uk> To: Karl Denninger <karl@denninger.net>, freebsd-stable@freebsd.org Subject: Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20) Message-ID: <8108da18-2cdd-fa29-983c-3ae7be6be412@multiplay.co.uk> In-Reply-To: <3f53389a-0cb5-d106-1f64-bbc2123e975c@denninger.net> References: <f87f32f2-b8c5-75d3-4105-856d9f4752ef@denninger.net> <c96e31ad-6731-332e-5d2d-7be4889716e1@FreeBSD.org> <9a96b1b5-9337-fcae-1a2a-69d7bb24a5b3@denninger.net> <CACpH0MdLNQ_dqH%2Bto=amJbUuWprx3LYrOLO0rQi7eKw-ZcqWJw@mail.gmail.com> <1866e238-e2a1-ef4e-bee5-5a2f14e35b22@denninger.net> <3d2ad225-b223-e9db-cce8-8250571b92c9@FreeBSD.org> <2bc8a172-6168-5ba9-056c-80455eabc82b@denninger.net> <CACpH0MfmPzEO5BO2kFk8-F1hP9TsXEiXbfa1qxcvB8YkvAjWWw@mail.gmail.com> <2c23c0de-1802-37be-323e-d390037c6a84@denninger.net> <864062ab-f68b-7e63-c3da-539d1e9714f9@denninger.net> <6dc1bad1-05b8-2c65-99d3-61c547007dfe@denninger.net> <758d5611-c3cf-82dd-220f-a775a57bdd0b@multiplay.co.uk> <3f53389a-0cb5-d106-1f64-bbc2123e975c@denninger.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Thanks for extra info, the next question would be have you eliminated that corruption exists before the disk is removed? Would be interesting to add a zpool scrub to confirm this isn't the case before the disk removal is attempted. Regards Steve On 20/04/2019 18:35, Karl Denninger wrote: > > On 4/20/2019 10:50, Steven Hartland wrote: >> Have you eliminated geli as possible source? > No; I could conceivably do so by re-creating another backup volume set > without geli-encrypting the drives, but I do not have an extra set of > drives of the capacity required laying around to do that. I would have > to do it with lower-capacity disks, which I can attempt if you think > it would help. I *do* have open slots in the drive backplane to set > up a second "test" unit of this sort. For reasons below it will take > at least a couple of weeks to get good data on whether the problem > exists without geli, however. >> >> I've just setup an old server which has a LSI 2008 running and old FW >> (11.0) so was going to have a go at reproducing this. >> >> Apart from the disconnect steps below is there anything else needed >> e.g. read / write workload during disconnect? > > Yes. An attempt to recreate this on my sandbox machine using smaller > disks (WD RE-320s) and a decent amount of read/write activity (tens to > ~100 gigabytes) on a root mirror of three disks with one taken offline > did not succeed. It *reliably* appears, however, on my backup volumes > with every drive swap. The sandbox machine is physically identical > other than the physical disks; both are Xeons with ECC RAM in them. > > The only operational difference is that the backup volume sets have a > *lot* of data written to them via zfs send|zfs recv over the > intervening period where with "ordinary" activity from I/O (which was > the case on my sandbox) the I/O pattern is materially different. The > root pool on the sandbox where I tried to reproduce it synthetically > *is* using geli (in fact it boots native-encrypted.) > > The "ordinary" resilver on a disk swap typically covers ~2-3Tb and is > a ~6-8 hour process. > > The usual process for the backup pool looks like this: > > Have 2 of the 3 physical disks mounted; the third is in the bank vault. > > Over the space of a week, the backup script is run daily. It first > imports the pool and then for each zfs filesystem it is backing up > (which is not all of them; I have a few volatile ones that I don't > care if I lose, such as object directories for builds and such, plus > some that are R/O data sets that are backed up separately) it does: > > If there is no "...@zfs-base": zfs snapshot -r ...@zfs-base; zfs send > -R ...@zfs-base | zfs receive -Fuvd $BACKUP > > else > > zfs rename -r ...@zfs-base ...@zfs-old > zfs snapshot -r ...@zfs-base > > zfs send -RI ...@zfs-old ...@zfs-base |zfs recv -Fudv $BACKUP > > .... if ok then zfs destroy -vr ...@zfs-old otherwise print a > complaint and stop. > > When all are complete it then does a "zpool export backup" to detach > the pool in order to reduce the risk of "stupid root user" (me) accidents. > > In short I send an incremental of the changes since the last backup, > which in many cases includes a bunch of automatic snapshots that are > taken on frequent basis out of the cron. Typically there are a week's > worth of these that accumulate between swaps of the disk to the vault, > and the offline'd disk remains that way for a week. I also wait for > the zpool destroy on each of the targets to drain before continuing, > as not doing so back in the 9 and 10.x days was a good way to > stimulate an instant panic on re-import the next day due to kernel > stack page exhaustion if the previous operation destroyed hundreds of > gigabytes of snapshots (which does routinely happen as part of the > backed up data is Macrium images from PCs, so when a new month comes > around the PC's backup routine removes a huge amount of old data from > the filesystem.) > > Trying to simulate the checksum errors in a few hours' time thus far > has failed. But every time I swap the disks on a weekly basis I get a > handful of checksum errors on the scrub. If I export and re-import the > backup mirror after that the counters are zeroed -- the checksum error > count does *not* remain across an export/import cycle although the > "scrub repaired" line remains. > > For example after the scrub completed this morning I exported the pool > (the script expects the pool exported before it begins) and ran the > backup. When it was complete: > > root@NewFS:~/backup-zfs # zpool status backup > pool: backup > state: DEGRADED > status: One or more devices has been taken offline by the administrator. > Sufficient replicas exist for the pool to continue functioning > in a > degraded state. > action: Online the device using 'zpool online' or replace the device with > 'zpool replace'. > scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat > Apr 20 08:45:09 2019 > config: > > NAME STATE READ WRITE CKSUM > backup DEGRADED 0 0 0 > mirror-0 DEGRADED 0 0 0 > gpt/backup61.eli ONLINE 0 0 0 > gpt/backup62-1.eli ONLINE 0 0 0 > 13282812295755460479 OFFLINE 0 0 0 was > /dev/gpt/backup62-2.eli > > errors: No known data errors > > It knows it fixed the checksums but the error count is zero -- I did > NOT "zpool clear". > > This may have been present in 11.2; I didn't run that long enough in > this environment to know. It definitely was *not* present in 11.1 and > before; the same data structure and script for backups has been in use > for a very long time without any changes and this first appeared when > I upgraded from 11.1 to 12.0 on this specific machine, with the exact > same physical disks being used for over a year (they're currently 6Tb > units; the last change out for those was ~1.5 years ago when I went > from 4Tb to 6Tb volumes.) I have both HGST-NAS and He-Enterprise > disks in the rotation and both show identical behavior so it doesn't > appear to be related to a firmware problem in one disk .vs. the other > (e.g. firmware that fails to flush the on-drive cache before going to > standby even though it was told to.) > >> >> mps0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem >> 0xfaf3c000-0xfaf3ffff,0xfaf40000-0xfaf7ffff irq 26 at device 0.0 on pci3 >> mps0: Firmware: 11.00.00.00, Driver: 21.02.00.00-fbsd >> mps0: IOCCapabilities: >> 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR> >> >> Regards >> Steve >> >> On 20/04/2019 15:39, Karl Denninger wrote: >>> I can confirm that 20.00.07.00 does *not* stop this. >>> The previous write/scrub on this device was on 20.00.07.00. It was >>> swapped back in from the vault yesterday, resilvered without incident, >>> but a scrub says.... >>> >>> root@NewFS:/home/karl # zpool status backup >>> pool: backup >>> state: DEGRADED >>> status: One or more devices has experienced an unrecoverable error. An >>> attempt was made to correct the error. Applications are >>> unaffected. >>> action: Determine if the device needs to be replaced, and clear the >>> errors >>> using 'zpool clear' or replace the device with 'zpool >>> replace'. >>> see: http://illumos.org/msg/ZFS-8000-9P >>> scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat >>> Apr >>> 20 08:45:09 2019 >>> config: >>> >>> NAME STATE READ WRITE CKSUM >>> backup DEGRADED 0 0 0 >>> mirror-0 DEGRADED 0 0 0 >>> gpt/backup61.eli ONLINE 0 0 0 >>> gpt/backup62-1.eli ONLINE 0 0 47 >>> 13282812295755460479 OFFLINE 0 0 0 was >>> /dev/gpt/backup62-2.eli >>> >>> errors: No known data errors >>> >>> So this is firmware-invariant (at least between 19.00.00.00 and >>> 20.00.07.00); the issue persists. >>> >>> Again, in my instance these devices are never removed "unsolicited" so >>> there can't be (or at least shouldn't be able to) unflushed data in the >>> device or kernel cache. The procedure is and remains: >>> >>> zpool offline ..... >>> geli detach ..... >>> camcontrol standby ... >>> >>> Wait a few seconds for the spindle to spin down. >>> >>> Remove disk. >>> >>> Then of course on the other side after insertion and the kernel has >>> reported "finding" the device: >>> >>> geli attach ... >>> zpool online .... >>> >>> Wait... >>> >>> If this is a boogered TXG that's held in the metadata for the >>> "offline"'d device (maybe "off by one"?) that's potentially bad in that >>> if there is an unknown failure in the other mirror component the >>> resilver will complete but data has been irrevocably destroyed. >>> >>> Granted, this is a very low probability scenario (the area where the >>> bad >>> checksums are has to be where the corruption hits, and it has to happen >>> between the resilver and access to that data.) Those are long odds but >>> nonetheless a window of "you're hosed" does appear to exist. >>> >> > -- > Karl Denninger > karl@denninger.net <mailto:karl@denninger.net> > /The Market Ticker/ > /[S/MIME encrypted email preferred]/
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8108da18-2cdd-fa29-983c-3ae7be6be412>