From owner-freebsd-stable@freebsd.org Tue Apr 30 14:15:53 2019 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3C4D31592D13 for ; Tue, 30 Apr 2019 14:15:53 +0000 (UTC) (envelope-from michelle@sorbs.net) Received: from hades.sorbs.net (hades.sorbs.net [72.12.213.40]) by mx1.freebsd.org (Postfix) with ESMTP id 8184F73298; Tue, 30 Apr 2019 14:15:52 +0000 (UTC) (envelope-from michelle@sorbs.net) MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Received: from [10.10.0.230] (gate.mhix.org [203.206.128.220]) by hades.sorbs.net (Oracle Communications Messaging Server 7.0.5.29.0 64bit (built Jul 9 2013)) with ESMTPSA id <0PQS002XI2XJY400@hades.sorbs.net>; Tue, 30 Apr 2019 07:29:47 -0700 (PDT) Subject: Re: ZFS... From: Michelle Sullivan X-Mailer: iPad Mail (16A404) In-reply-to: Date: Wed, 01 May 2019 00:15:47 +1000 Cc: Karl Denninger , FreeBSD Content-transfer-encoding: quoted-printable Message-id: <9F250929-BAA1-4A18-9025-06F3EC13CD42@sorbs.net> References: <30506b3d-64fb-b327-94ae-d9da522f3a48@sorbs.net> <56833732-2945-4BD3-95A6-7AF55AB87674@sorbs.net> <3d0f6436-f3d7-6fee-ed81-a24d44223f2f@netfence.it> <17B373DA-4AFC-4D25-B776-0D0DED98B320@sorbs.net> <70fac2fe3f23f85dd442d93ffea368e1@ultra-secure.de> <70C87D93-D1F9-458E-9723-19F9777E6F12@sorbs.net> <5ED8BADE-7B2C-4B73-93BC-70739911C5E3@sorbs.net> <2e4941bf-999a-7f16-f4fe-1a520f2187c0@sorbs.net> <34539589-162B-4891-A68F-88F879B59650@sorbs.net> To: Alan Somers X-Rspamd-Queue-Id: 8184F73298 X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; spf=pass (mx1.freebsd.org: domain of michelle@sorbs.net designates 72.12.213.40 as permitted sender) smtp.mailfrom=michelle@sorbs.net X-Spamd-Result: default: False [-3.04 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-0.999,0]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; R_SPF_ALLOW(-0.20)[+a:hades.sorbs.net]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[sorbs.net]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; MX_GOOD(-0.01)[cached: battlestar.sorbs.net]; NEURAL_HAM_SHORT(-0.56)[-0.564,0]; RCVD_IN_DNSWL_NONE(0.00)[40.213.12.72.list.dnswl.org : 127.0.10.0]; SUBJ_ALL_CAPS(0.45)[6]; IP_SCORE(-0.72)[ip: (-1.89), ipnet: 72.12.192.0/19(-0.92), asn: 11114(-0.71), country: US(-0.06)]; RCVD_NO_TLS_LAST(0.10)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:11114, ipnet:72.12.192.0/19, country:US]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 30 Apr 2019 14:15:53 -0000 This issue is definitely related to sudden unexpected loss of power during r= esilver.. not ECC/non-ECC issues. Michelle Sullivan http://www.mhix.org/ Sent from my iPad > On 01 May 2019, at 00:12, Alan Somers wrote: >=20 >> On Tue, Apr 30, 2019 at 8:05 AM Michelle Sullivan wr= ote: >>=20 >>=20 >>=20 >> Michelle Sullivan >> http://www.mhix.org/ >> Sent from my iPad >>=20 >>>> On 01 May 2019, at 00:01, Alan Somers wrote: >>>>=20 >>>> On Tue, Apr 30, 2019 at 7:30 AM Michelle Sullivan w= rote: >>>>=20 >>>> Karl Denninger wrote: >>>>> On 4/30/2019 05:14, Michelle Sullivan wrote: >>>>>>>> On 30 Apr 2019, at 19:50, Xin LI wrote: >>>>>>>> On Tue, Apr 30, 2019 at 5:08 PM Michelle Sullivan wrote: >>>>>>>> but in my recent experience 2 issues colliding at the same time res= ults in disaster >>>>>>> Do we know exactly what kind of corruption happen to your pool? If y= ou see it twice in a row, it might suggest a software bug that should be inv= estigated. >>>>>>>=20 >>>>>>> All I know is it=E2=80=99s a checksum error on a meta slab (122) and= from what I can gather it=E2=80=99s the spacemap that is corrupt... but I a= m no expert. I don=E2=80=99t believe it=E2=80=99s a software fault as such,= because this was cause by a hard outage (damaged UPSes) whilst resilvering a= single (but completely failed) drive. ...and after the first outage a seco= nd occurred (same as the first but more damaging to the power hardware)... t= he host itself was not damaged nor were the drives or controller. >>>>> ..... >>>>>>> Note that ZFS stores multiple copies of its essential metadata, and i= n my experience with my old, consumer grade crappy hardware (non-ECC RAM, wi= th several faulty, single hard drive pool: bad enough to crash almost monthl= y and damages my data from time to time), >>>>>> This was a top end consumer grade mb with non ecc ram that had been r= unning for 8+ years without fault (except for hard drive platter failures.).= Uptime would have been years if it wasn=E2=80=99t for patching. >>>>> Yuck. >>>>>=20 >>>>> I'm sorry, but that may well be what nailed you. >>>>>=20 >>>>> ECC is not just about the random cosmic ray. It also saves your bacon= >>>>> when there are power glitches. >>>>=20 >>>> No. Sorry no. If the data is only half to disk, ECC isn't going to sav= e >>>> you at all... it's all about power on the drives to complete the write.= >>>=20 >>> ECC RAM isn't about saving the last few seconds' worth of data from >>> before a power crash. It's about not corrupting the data that gets >>> written long before a crash. If you have non-ECC RAM, then a cosmic >>> ray/alpha ray/row hammer attack/bad luck can corrupt data after it's >>> been checksummed but before it gets DMAed to disk. Then disk will >>> contain corrupt data and you won't know it until you try to read it >>> back. >>=20 >> I know this... unless I misread Karl=E2=80=99s message he implied the ECC= would have saved the corruption in the crash... which is patently false... I= think you=E2=80=99ll agree.. >=20 > I don't think that's what Karl meant. I think he meant that the > non-ECC RAM could've caused latent corruption that was only detected > when the crash forced a reboot and resilver. >=20 >>=20 >> Michelle >>=20 >>=20 >>>=20 >>> -Alan >>>=20 >>>>>=20 >>>>> Unfortunately however there is also cache memory on most modern hard >>>>> drives, most of the time (unless you explicitly shut it off) it's on f= or >>>>> write caching, and it'll nail you too. Oh, and it's never, in my >>>>> experience, ECC. >>>=20 >>> Fortunately, ZFS never sends non-checksummed data to the hard drive. >>> So an error in the hard drive's cache ram will usually get detected by >>> the ZFS checksum. >>>=20 >>>>=20 >>>> No comment on that - you're right in the first part, I can't comment if= >>>> there are drives with ECC. >>>>=20 >>>>>=20 >>>>> In addition, however, and this is something I learned a LONG time ago >>>>> (think Z-80 processors!) is that as in so many very important things >>>>> "two is one and one is none." >>>>>=20 >>>>> In other words without a backup you WILL lose data eventually, and it >>>>> WILL be important. >>>>>=20 >>>>> Raidz2 is very nice, but as the name implies it you have two >>>>> redundancies. If you take three errors, or if, God forbid, you *write= * >>>>> a block that has a bad checksum in it because it got scrambled while i= n >>>>> RAM, you're dead if that happens in the wrong place. >>>>=20 >>>> Or in my case you write part data therefore invalidating the checksum..= . >>>>>=20 >>>>>> Yeah.. unlike UFS that has to get really really hosed to restore from= backup with nothing recoverable it seems ZFS can get hosed where issues occ= ur in just the wrong bit... but mostly it is recoverable (and my experience h= as been some nasty shit that always ended up being recoverable.) >>>>>>=20 >>>>>> Michelle >>>>> Oh that is definitely NOT true.... again, from hard experience, >>>>> including (but not limited to) on FreeBSD. >>>>>=20 >>>>> My experience is that ZFS is materially more-resilient but there is no= >>>>> such thing as "can never be corrupted by any set of events." >>>>=20 >>>> The latter part is true - and my blog and my current situation is not >>>> limited to or aimed at FreeBSD specifically, FreeBSD is my experience.= >>>> The former part... it has been very resilient, but I think (based on >>>> this certain set of events) it is easily corruptible and I have just >>>> been lucky. You just have to hit a certain write to activate the issue= , >>>> and whilst that write and issue might be very very difficult (read: hit= >>>> and miss) to hit in normal every day scenarios it can and will >>>> eventually happen. >>>>=20 >>>>> Backup >>>>> strategies for moderately large (e.g. many Terabytes) to very large >>>>> (e.g. Petabytes and beyond) get quite complex but they're also very >>>>> necessary. >>>>>=20 >>>> and there in lies the problem. If you don't have a many 10's of >>>> thousands of dollars backup solutions, you're either: >>>>=20 >>>> 1/ down for a looooong time. >>>> 2/ losing all data and starting again... >>>>=20 >>>> ..and that's the problem... ufs you can recover most (in most >>>> situations) and providing the *data* is there uncorrupted by the fault >>>> you can get it all off with various tools even if it is a complete >>>> mess.... here I am with the data that is apparently ok, but the >>>> metadata is corrupt (and note: as I had stopped writing to the drive >>>> when it started resilvering the data - all of it - should be intact... >>>> even if a mess.) >>>>=20 >>>> Michelle >>>>=20 >>>> -- >>>> Michelle Sullivan >>>> http://www.mhix.org/ >>>>=20 >>>> _______________________________________________ >>>> freebsd-stable@freebsd.org mailing list >>>> https://lists.freebsd.org/mailman/listinfo/freebsd-stable >>>> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.or= g"