Date: Sat, 20 Apr 2019 16:26:01 -0500 From: Karl Denninger <karl@denninger.net> To: freebsd-stable@freebsd.org Subject: Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20) Message-ID: <e90494e9-9d6d-ce19-05db-3ebb06d00766@denninger.net> In-Reply-To: <8108da18-2cdd-fa29-983c-3ae7be6be412@multiplay.co.uk> References: <f87f32f2-b8c5-75d3-4105-856d9f4752ef@denninger.net> <c96e31ad-6731-332e-5d2d-7be4889716e1@FreeBSD.org> <9a96b1b5-9337-fcae-1a2a-69d7bb24a5b3@denninger.net> <CACpH0MdLNQ_dqH%2Bto=amJbUuWprx3LYrOLO0rQi7eKw-ZcqWJw@mail.gmail.com> <1866e238-e2a1-ef4e-bee5-5a2f14e35b22@denninger.net> <3d2ad225-b223-e9db-cce8-8250571b92c9@FreeBSD.org> <2bc8a172-6168-5ba9-056c-80455eabc82b@denninger.net> <CACpH0MfmPzEO5BO2kFk8-F1hP9TsXEiXbfa1qxcvB8YkvAjWWw@mail.gmail.com> <2c23c0de-1802-37be-323e-d390037c6a84@denninger.net> <864062ab-f68b-7e63-c3da-539d1e9714f9@denninger.net> <6dc1bad1-05b8-2c65-99d3-61c547007dfe@denninger.net> <758d5611-c3cf-82dd-220f-a775a57bdd0b@multiplay.co.uk> <3f53389a-0cb5-d106-1f64-bbc2123e975c@denninger.net> <8108da18-2cdd-fa29-983c-3ae7be6be412@multiplay.co.uk>
index | next in thread | previous in thread | raw e-mail
[-- Attachment #1 --] No; I can, but of course that's another ~8 hour (overnight) delay between swaps. That's not a bad idea however.... On 4/20/2019 15:56, Steven Hartland wrote: > Thanks for extra info, the next question would be have you eliminated > that corruption exists before the disk is removed? > > Would be interesting to add a zpool scrub to confirm this isn't the > case before the disk removal is attempted. > > Regards > Steve > > On 20/04/2019 18:35, Karl Denninger wrote: >> >> On 4/20/2019 10:50, Steven Hartland wrote: >>> Have you eliminated geli as possible source? >> No; I could conceivably do so by re-creating another backup volume >> set without geli-encrypting the drives, but I do not have an extra >> set of drives of the capacity required laying around to do that. I >> would have to do it with lower-capacity disks, which I can attempt if >> you think it would help. I *do* have open slots in the drive >> backplane to set up a second "test" unit of this sort. For reasons >> below it will take at least a couple of weeks to get good data on >> whether the problem exists without geli, however. >>> >>> I've just setup an old server which has a LSI 2008 running and old >>> FW (11.0) so was going to have a go at reproducing this. >>> >>> Apart from the disconnect steps below is there anything else needed >>> e.g. read / write workload during disconnect? >> >> Yes. An attempt to recreate this on my sandbox machine using smaller >> disks (WD RE-320s) and a decent amount of read/write activity (tens >> to ~100 gigabytes) on a root mirror of three disks with one taken >> offline did not succeed. It *reliably* appears, however, on my >> backup volumes with every drive swap. The sandbox machine is >> physically identical other than the physical disks; both are Xeons >> with ECC RAM in them. >> >> The only operational difference is that the backup volume sets have a >> *lot* of data written to them via zfs send|zfs recv over the >> intervening period where with "ordinary" activity from I/O (which was >> the case on my sandbox) the I/O pattern is materially different. The >> root pool on the sandbox where I tried to reproduce it synthetically >> *is* using geli (in fact it boots native-encrypted.) >> >> The "ordinary" resilver on a disk swap typically covers ~2-3Tb and is >> a ~6-8 hour process. >> >> The usual process for the backup pool looks like this: >> >> Have 2 of the 3 physical disks mounted; the third is in the bank vault. >> >> Over the space of a week, the backup script is run daily. It first >> imports the pool and then for each zfs filesystem it is backing up >> (which is not all of them; I have a few volatile ones that I don't >> care if I lose, such as object directories for builds and such, plus >> some that are R/O data sets that are backed up separately) it does: >> >> If there is no "...@zfs-base": zfs snapshot -r ...@zfs-base; zfs send >> -R ...@zfs-base | zfs receive -Fuvd $BACKUP >> >> else >> >> zfs rename -r ...@zfs-base ...@zfs-old >> zfs snapshot -r ...@zfs-base >> >> zfs send -RI ...@zfs-old ...@zfs-base |zfs recv -Fudv $BACKUP >> >> .... if ok then zfs destroy -vr ...@zfs-old otherwise print a >> complaint and stop. >> >> When all are complete it then does a "zpool export backup" to detach >> the pool in order to reduce the risk of "stupid root user" (me) >> accidents. >> >> In short I send an incremental of the changes since the last backup, >> which in many cases includes a bunch of automatic snapshots that are >> taken on frequent basis out of the cron. Typically there are a week's >> worth of these that accumulate between swaps of the disk to the >> vault, and the offline'd disk remains that way for a week. I also >> wait for the zpool destroy on each of the targets to drain before >> continuing, as not doing so back in the 9 and 10.x days was a good >> way to stimulate an instant panic on re-import the next day due to >> kernel stack page exhaustion if the previous operation destroyed >> hundreds of gigabytes of snapshots (which does routinely happen as >> part of the backed up data is Macrium images from PCs, so when a new >> month comes around the PC's backup routine removes a huge amount of >> old data from the filesystem.) >> >> Trying to simulate the checksum errors in a few hours' time thus far >> has failed. But every time I swap the disks on a weekly basis I get >> a handful of checksum errors on the scrub. If I export and re-import >> the backup mirror after that the counters are zeroed -- the checksum >> error count does *not* remain across an export/import cycle although >> the "scrub repaired" line remains. >> >> For example after the scrub completed this morning I exported the >> pool (the script expects the pool exported before it begins) and ran >> the backup. When it was complete: >> >> root@NewFS:~/backup-zfs # zpool status backup >> pool: backup >> state: DEGRADED >> status: One or more devices has been taken offline by the administrator. >> Sufficient replicas exist for the pool to continue >> functioning in a >> degraded state. >> action: Online the device using 'zpool online' or replace the device >> with >> 'zpool replace'. >> scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat >> Apr 20 08:45:09 2019 >> config: >> >> NAME STATE READ WRITE CKSUM >> backup DEGRADED 0 0 0 >> mirror-0 DEGRADED 0 0 0 >> gpt/backup61.eli ONLINE 0 0 0 >> gpt/backup62-1.eli ONLINE 0 0 0 >> 13282812295755460479 OFFLINE 0 0 0 was >> /dev/gpt/backup62-2.eli >> >> errors: No known data errors >> >> It knows it fixed the checksums but the error count is zero -- I did >> NOT "zpool clear". >> >> This may have been present in 11.2; I didn't run that long enough in >> this environment to know. It definitely was *not* present in 11.1 >> and before; the same data structure and script for backups has been >> in use for a very long time without any changes and this first >> appeared when I upgraded from 11.1 to 12.0 on this specific machine, >> with the exact same physical disks being used for over a year >> (they're currently 6Tb units; the last change out for those was ~1.5 >> years ago when I went from 4Tb to 6Tb volumes.) I have both HGST-NAS >> and He-Enterprise disks in the rotation and both show identical >> behavior so it doesn't appear to be related to a firmware problem in >> one disk .vs. the other (e.g. firmware that fails to flush the >> on-drive cache before going to standby even though it was told to.) >> >>> >>> mps0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem >>> 0xfaf3c000-0xfaf3ffff,0xfaf40000-0xfaf7ffff irq 26 at device 0.0 on >>> pci3 >>> mps0: Firmware: 11.00.00.00, Driver: 21.02.00.00-fbsd >>> mps0: IOCCapabilities: >>> 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR> >>> >>> Regards >>> Steve >>> >>> On 20/04/2019 15:39, Karl Denninger wrote: >>>> I can confirm that 20.00.07.00 does *not* stop this. >>>> The previous write/scrub on this device was on 20.00.07.00. It was >>>> swapped back in from the vault yesterday, resilvered without incident, >>>> but a scrub says.... >>>> >>>> root@NewFS:/home/karl # zpool status backup >>>> pool: backup >>>> state: DEGRADED >>>> status: One or more devices has experienced an unrecoverable >>>> error. An >>>> attempt was made to correct the error. Applications are >>>> unaffected. >>>> action: Determine if the device needs to be replaced, and clear the >>>> errors >>>> using 'zpool clear' or replace the device with 'zpool >>>> replace'. >>>> see: http://illumos.org/msg/ZFS-8000-9P >>>> scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on >>>> Sat Apr >>>> 20 08:45:09 2019 >>>> config: >>>> >>>> NAME STATE READ WRITE CKSUM >>>> backup DEGRADED 0 0 0 >>>> mirror-0 DEGRADED 0 0 0 >>>> gpt/backup61.eli ONLINE 0 0 0 >>>> gpt/backup62-1.eli ONLINE 0 0 47 >>>> 13282812295755460479 OFFLINE 0 0 0 was >>>> /dev/gpt/backup62-2.eli >>>> >>>> errors: No known data errors >>>> >>>> So this is firmware-invariant (at least between 19.00.00.00 and >>>> 20.00.07.00); the issue persists. >>>> >>>> Again, in my instance these devices are never removed "unsolicited" so >>>> there can't be (or at least shouldn't be able to) unflushed data in >>>> the >>>> device or kernel cache. The procedure is and remains: >>>> >>>> zpool offline ..... >>>> geli detach ..... >>>> camcontrol standby ... >>>> >>>> Wait a few seconds for the spindle to spin down. >>>> >>>> Remove disk. >>>> >>>> Then of course on the other side after insertion and the kernel has >>>> reported "finding" the device: >>>> >>>> geli attach ... >>>> zpool online .... >>>> >>>> Wait... >>>> >>>> If this is a boogered TXG that's held in the metadata for the >>>> "offline"'d device (maybe "off by one"?) that's potentially bad in >>>> that >>>> if there is an unknown failure in the other mirror component the >>>> resilver will complete but data has been irrevocably destroyed. >>>> >>>> Granted, this is a very low probability scenario (the area where >>>> the bad >>>> checksums are has to be where the corruption hits, and it has to >>>> happen >>>> between the resilver and access to that data.) Those are long odds >>>> but >>>> nonetheless a window of "you're hosed" does appear to exist. >>>> >>> >> -- >> Karl Denninger >> karl@denninger.net <mailto:karl@denninger.net> >> /The Market Ticker/ >> /[S/MIME encrypted email preferred]/ > > _______________________________________________ > freebsd-stable@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" -- Karl Denninger karl@denninger.net <mailto:karl@denninger.net> /The Market Ticker/ /[S/MIME encrypted email preferred]/ [-- Attachment #2 --] 0 *H 010 `He 0 *H 00 H^Ōc!5 H0 *H 010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA0 170817164217Z 270815164217Z0{10 UUS10UFlorida10U Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA0"0 *H 0 h-5B>[;olӴ0~͎O9}9Ye*$g!ukvʶLzN`jL>MD'7U 45CB+kY`bd~b*c3Ny-78ju]9HeuέsӬDؽmgwER?&UURj'}9nWD i`XcbGz \gG=u%\Oi13ߝ4 K44pYQr]Ie/r0+eEޝݖ0C15Mݚ@JSZ(zȏ NTa(25DD5.l<g[[ZarQQ%Buȴ~~`IohRbʳڟu2MS8EdFUClCMaѳ !}ș+2k/bųE,n当ꖛ\(8WV8 d]b yXw ܊:I39 00U]^§Q\ӎ0U#0T039N0b010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA @Ui0U0 0U0 *H :P U!>vJnio-#ן]WyujǑR̀Q nƇ!GѦFg\yLxgw=OPycehf[}ܷ['4ڝ\[p 6\o.B&JF"ZC{;*o*mcCcLY߾` t*S!(`]DHP5A~/NPp6=mhk밣'doA$86hm5ӚS@jެEgl )0JG`%k35PaC?σ ׳HEt}!P㏏%*BxbQwaKG$6h¦Mve;[o-Iی& I,Tcߎ#t wPA@l0P+KXBպT zGv;NcI3&JĬUPNa?/%W6G۟N000 k#Xd\=0 *H 0{10 UUS10UFlorida10U Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA0 170817212120Z 220816212120Z0W10 UUS10UFlorida10U Cuda Systems LLC10Ukarl@denninger.net0"0 *H 0 T[I-ΆϏ dn;Å@שy.us~_ZG%<MYd\gvfnsa1'6Egyjs"C [{~_K Pn+<*pv#Q+H/7[-vqDV^U>f%GX)H.|l`M(Cr>е͇6#odc"YljҦln8@5SA0&ۖ"OGj?UDWZ5 dDB7k-)9Izs-JAv J6L$Ն1SmY.Lqw*SH;EF'DĦH]MOgQQ|Mٙג2Z9y@y]}6ٽeY9Y2xˆ$T=eCǺǵbn֛{j|@LLt1[Dk5:$= ` M 00<+00.0,+0 http://ocsp.cudasystems.net:88880 U0 0 `HB0U0U%0++03 `HB &$OpenSSL Generated Client Certificate0U%՞V=;bzQ0U#0]^§Q\ӎϡ010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA H^Ōc!5 H0U0karl@denninger.net0 *H ۠A0-j%--$%g2#ޡ1^>{K+uGEv1ş7Af&b&O;.;A5*U)ND2bF|\=]<sˋL!wrw٧>YMÄ3\mWR hSv!_zvl? 3_ xU%\^#O*Gk̍YI_&Fꊛ@&1n } ͬ:{hTP3B.;bU8:Z=^Gw8!k-@xE@i,+'Iᐚ:fhztX7/(hY` O.1}a`%RW^akǂpCAufgDix UTЩ/7}%=jnVZvcF<M= 2^GKH5魉 _O4ެByʈySkw=5@h.0z> W1000{10 UUS10UFlorida10U Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA k#Xd\=0 `He E0 *H 1 *H 0 *H 1 190420212601Z0O *H 1B@YIv[Khm9>ٝkGCIh5$En咋;|$9©a%0l *H 1_0]0 `He*0 `He0 *H 0*H 0 *H @0+0 *H (0 +7100{10 UUS10UFlorida10U Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA k#Xd\=0*H 10{10 UUS10UFlorida10U Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA k#Xd\=0 *H S!\!A=|W3dK. ;9tbT;nߝd\4S$IF9Qcy>JR|ޱFEcHkxyl 8'b@%|P%*: j 좓=,"21J^:m][ULZzOKDc]䘄)#- odF,dD%l2`LǬ;zC-/=8oI09K}JOd;&,}qk=EJvKāP~.@NTc ? &p9L߈!z$7kH/m }Fve&`) HvABċk= ?Lác"/zͰ͐?V<LE7SCxcXoO,'uݸSeif@8z Ʊhome | help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?e90494e9-9d6d-ce19-05db-3ebb06d00766>
