Date: Mon, 25 Apr 2016 01:00:45 -0400 From: "Michael B. Eichorn" <ike@michaeleichorn.com> To: freebsd-fs@freebsd.org Subject: GELI + Zpool Scrub Results in GELI Device Destruction (and Later a Corrupt Pool) Message-ID: <1461560445.22294.53.camel@michaeleichorn.com>
index | next in thread | raw e-mail
[-- Attachment #1 --] I just ran into something rather unexpected. I have a pool consisting of a mirrored pair of geli encrypted partitions on WD Red 3TB disks. The machine is running 10.3-RELEASE, the root zpool was setup with GELI encryption from the installer, the pool that is acting up was setup per the handbook. See the below timeline for what happened, tldr: zpool scrub destroyed the eli devices, my attempt to recreate the eli device earned me a ZFS-8000-8A critical error (corrupted data). All of the errors reported with zpool status -v are metadata and not regualar files, but as I now have permanent metadata errors I am looking for guidance as to: 1) Is it safe to keep running the pool as-is for a day or two or am I risking data corruption? 2) It would be much much faster to copy the data to another pool than recreate the pool and copy the data back, rather than restore from backups, am I looking at any potential data loss if I do this? 3) What infomation would be useful to generate for the PR, the error is reproducable so what should be tried before I nuke the pool? Thanks, Ike -- TIMELINE -- I had just noticed that I had failed to enable the zpool scrub periodic on this machine. So I began to run zpool scrub by hand. It succeeded for the root pool which is also geli encrypted, but when I ran it against my primary data pool I encountered: Apr 24 23:18:23 terra kernel: GEOM_ELI: Device ada3p1.eli destroyed. Apr 24 23:18:23 terra kernel: GEOM_ELI: Detached ada3p1.eli on last close. Apr 24 23:18:23 terra kernel: GEOM_ELI: Device ada2p1.eli destroyed. Apr 24 23:18:23 terra kernel: GEOM_ELI: Detached ada2p1.eli on last close. And the scrub failed to initialize (command never returned to the shell). I then performed a reboot, which suceeded and brought everything up as normal. I then attempted to scrub the pool again. This time I only lost one of the partitions: Apr 24 23:37:34 terra kernel: GEOM_ELI: Device ada2p1.eli destroyed. Apr 24 23:37:34 terra kernel: GEOM_ELI: Detached ada2p1.eli on last close. I then performed a geli attach and zpool online, which onlined the disk that was offline and offlined the disk that was online (EEEK!): Apr 24 23:38:28 terra kernel: GEOM_ELI: Device ada2p1.eli created. Apr 24 23:38:28 terra kernel: GEOM_ELI: Encryption: AES-XTS 256 Apr 24 23:38:28 terra kernel: GEOM_ELI: Crypto: hardware Apr 24 23:41:05 terra kernel: GEOM_ELI: Device ada3p1.eli destroyed. Apr 24 23:41:05 terra kernel: GEOM_ELI: Detached ada3p1.eli on last close. Apr 24 23:41:05 terra devd: Executing 'logger -p kern.notice -t ZFS 'vdev state changed, pool_guid=5890893416839487107 vdev_guid=17504861086892353515'' Apr 24 23:41:05 terra ZFS: vdev state changed, pool_guid=5890893416839487107 vdev_guid=17504861086892353515 I immediately rebooted and both disks came back and resilvered, with permanent metadata errors -- END TIMELINE -- [-- Attachment #2 --] 0 *H 010 `He 0 *H 000]0 *H 010 UIL10U StartCom Ltd.1+0)U"Secure Digital Certificate Signing1806U/StartCom Class 1 Primary Intermediate Client CA0 150613202446Z 160614003550Z0H10Uike@michaeleichorn.com1%0# *H ike@michaeleichorn.com0"0 *H 0 UՀ,k9D %Z|Y6J<rrK g;&|uNlUE9)V.[ט̊:qS](#vSYDz*CpugYݔ,v<`j(waS#ڒ6n(K5'KVLåErv<J=[}W bLA%gޭnVb| I?M7D:$׃bM_T[,ƃ\ 00 U0 0U0U%0++0Ujj: γ+39啖0U#0Sr풜\|~5NԸQ0!U0ike@michaeleichorn.com0LU C0?0;+70*0.+"http://www.startssl.com/policy.pdf0+00' StartCom Certification Authority0This certificate was issued according to the Class 1 Validation requirements of the StartCom CA policy, reliance only for the intended purpose in compliance of the relying party obligations.06U/0-0+)'%http://crl.startssl.com/crtu1-crl.crl0+009+0-http://ocsp.startssl.com/sub/class1/client/ca0B+06http://aia.startssl.com/certs/sub.class1.client.ca.crt0#U0http://www.startssl.com/0 *H x+ȐF}pw.XvF?rg P]EOp)L˻yA ;hi0u2]m [Sbp$_ gr Xm*YP3#H>mKAǠt)HO|=@}3ӝ'iO81>03 v'h5U "H;ECZtpҗ4rWHu^6+i*kJL8shAV|5;?HMc\ j[j|+000]0 *H 010 UIL10U StartCom Ltd.1+0)U"Secure Digital Certificate Signing1806U/StartCom Class 1 Primary Intermediate Client CA0 150613202446Z 160614003550Z0H10Uike@michaeleichorn.com1%0# *H ike@michaeleichorn.com0"0 *H 0 UՀ,k9D %Z|Y6J<rrK g;&|uNlUE9)V.[ט̊:qS](#vSYDz*CpugYݔ,v<`j(waS#ڒ6n(K5'KVLåErv<J=[}W bLA%gޭnVb| I?M7D:$׃bM_T[,ƃ\ 00 U0 0U0U%0++0Ujj: γ+39啖0U#0Sr풜\|~5NԸQ0!U0ike@michaeleichorn.com0LU C0?0;+70*0.+"http://www.startssl.com/policy.pdf0+00' StartCom Certification Authority0This certificate was issued according to the Class 1 Validation requirements of the StartCom CA policy, reliance only for the intended purpose in compliance of the relying party obligations.06U/0-0+)'%http://crl.startssl.com/crtu1-crl.crl0+009+0-http://ocsp.startssl.com/sub/class1/client/ca0B+06http://aia.startssl.com/certs/sub.class1.client.ca.crt0#U0http://www.startssl.com/0 *H x+ȐF}pw.XvF?rg P]EOp)L˻yA ;hi0u2]m [Sbp$_ gr Xm*YP3#H>mKAǠt)HO|=@}3ӝ'iO81>03 v'h5U "H;ECZtpҗ4rWHu^6+i*kJL8shAV|5;?HMc\ j[j|+0400 *H 0}10 UIL10U StartCom Ltd.1+0)U"Secure Digital Certificate Signing1)0'U StartCom Certification Authority0 071024210155Z 171024210155Z010 UIL10U StartCom Ltd.1+0)U"Secure Digital Certificate Signing1806U/StartCom Class 1 Primary Intermediate Client CA0"0 *H 0 -).2AUGo#G B|NDRpM-B=o-we5JQpa>O.#._<V [~**pz~3WG .ᘟMlr[<Ce6fqO"uxfWN#uicgkv$Lb%y`_{`xK'GN 00U00U0USr풜\|~5NԸQ0U#0N@[i04hCA0f+Z0X0'+0http://ocsp.startssl.com/ca0-+0!http://www.startssl.com/sfsca.crt0[UT0R0'%#!http://www.startssl.com/sfsca.crl0'%#!http://crl.startssl.com/sfsca.crl0U y0w0u+70f0.+"http://www.startssl.com/policy.pdf04+(http://www.startssl.com/intermediate.pdf0 *H }x,\c^#wMq}>UK/^yX֏y frMIŲB61ymQҨݬZ0&
