Date: Tue, 30 Apr 2019 08:33:47 -0500 From: Karl Denninger <karl@denninger.net> To: freebsd-stable@freebsd.org Subject: Re: ZFS... Message-ID: <f868b452-40e9-f2c8-cdee-dde5e53a214c@denninger.net> In-Reply-To: <17B373DA-4AFC-4D25-B776-0D0DED98B320@sorbs.net> References: <30506b3d-64fb-b327-94ae-d9da522f3a48@sorbs.net> <CAOtMX2gf3AZr1-QOX_6yYQoqE-H%2B8MjOWc=eK1tcwt5M3dCzdw@mail.gmail.com> <56833732-2945-4BD3-95A6-7AF55AB87674@sorbs.net> <3d0f6436-f3d7-6fee-ed81-a24d44223f2f@netfence.it> <17B373DA-4AFC-4D25-B776-0D0DED98B320@sorbs.net>
index | next in thread | previous in thread | raw e-mail
[-- Attachment #1 --]
On 4/30/2019 03:09, Michelle Sullivan wrote:
> Consider..
>
> If one triggers such a fault on a production server, how can one justify transferring from backup multiple terabytes (or even petabytes now) of data to repair an unmountable/faulted array.... because all backup solutions I know currently would take days if not weeks to restore the sort of store ZFS is touted with supporting.
Had it happen on a production server a few years back with ZFS. The
*hardware* went insane (disk adapter) and scribbled on *all* of the vdevs.
The machine crashed and would not come back up -- at all. I insist on
(and had) emergency boot media physically in the box (a USB key) in any
production machine and it was quite-quickly obvious that all of the
vdevs were corrupted beyond repair. There was no rational option other
than to restore.
It was definitely not a pleasant experience, but this is why when you
get into systems and data store sizes where it's a five-alarm pain in
the neck you must figure out some sort of strategy that covers you 99%
of the time without a large amount of downtime involved, and in the 1%
case accept said downtime. In this particular circumstance the customer
didn't want to spend on a doubled-and-transaction-level protected
on-site (in the same DC) redundancy setup originally so restore, as
opposed to fail-over/promote and then restore and build a new
"redundant" box where the old "primary" resided was the most-viable
option. Time to recover essential functions was ~8 hours (and over 24
hours for everything to be restored.)
Incidentally that's not the first time I've had a disk adapter failure
on a production machine in my career as a systems dude; it was, in fact,
the *third* such failure. Then again I've been doing this stuff since
the 1980s and learned long ago that if it can break it eventually will,
and that Murphy is a real b******.
The answer to your question Michelle is that when restore times get into
"seriously disruptive" amounts of time (e.g. hours, days or worse
depending on the application involved and how critical it is) you spend
the time and money to have redundancy in multiple places and via paths
that do not destroy the redundant copies when things go wrong, and you
spend the engineering time to figure out what those potential faults are
and how to design such that a fault which can destroy the data set does
not propagate to the redundant copies before it is detected.
--
Karl Denninger
karl@denninger.net <mailto:karl@denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
[-- Attachment #2 --]
0 *H
010
`He 0 *H
00 H^Ōc!5
H0
*H
010 UUS10UFlorida10U Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA0
170817164217Z
270815164217Z0{10 UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA0"0
*H
0
h-5B>[;olӴ0~͎O9}9Ye*$g!ukvʶLzN`jL>MD'7U 45CB+kY`bd~b*c3Ny-78ju]9HeuέsӬDؽmgwER?&UURj'}9nWD i`XcbGz \gG=u%\Oi13ߝ4
K44pYQr]Ie/r0+eEޝݖ0C15Mݚ@JSZ(zȏ NTa(25DD5.l<g[[ZarQQ%Buȴ~~`IohRbʳڟu2MS8EdFUClCMaѳ !}ș+2k/bųE,n当ꖛ\(8WV8 d]b yXw ܊:I39
00U]^§Q\ӎ0U#0T039N0b010 UUS10UFlorida10U Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA @Ui0U0 0U0
*H
:P U!>vJnio-#ן]WyujǑR̀Q
nƇ!GѦFg\yLxgw=OPycehf[}ܷ['4ڝ\[p 6\o.B&JF"ZC{;*o*mcCcLY߾`
t*S!(`]DHP5A~/NPp6=mhk밣'doA$86hm5ӚS@jެEgl
)0JG`%k35PaC?σ
׳HEt}!P㏏%*BxbQwaKG$6h¦Mve;[o-Iی&
I,Tcߎ#t wPA@l0P+KXBպT zGv;NcI3&JĬUPNa?/%W6G۟N000 k#Xd\=0
*H
0{10 UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA0
170817212120Z
220816212120Z0W10 UUS10UFlorida10U
Cuda Systems LLC10Ukarl@denninger.net0"0
*H
0
T[I-ΆϏ dn;Å@שy.us~_ZG%<MYd\gvfnsa1'6Egyjs"C [{~_K Pn+<*pv#Q+H/7[-vqDV^U>f%GX)H.|l`M(Cr>е͇6#odc"YljҦln8@5SA0&ۖ"OGj?UDWZ5 dDB7k-)9Izs-JAv
J6L$Ն1SmY.Lqw*SH;EF'DĦH]MOgQQ|Mٙג2Z9y@y]}6ٽeY9Y2xˆ$T=eCǺǵbn֛{j|@LLt1[Dk5:$= ` M 00<+00.0,+0 http://ocsp.cudasystems.net:88880 U0 0 `HB0U0U%0++03 `HB
&$OpenSSL Generated Client Certificate0U%՞V=;bzQ0U#0]^§Q\ӎϡ010 UUS10UFlorida10U Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA H^Ōc!5
H0U0karl@denninger.net0
*H
۠A0-j%--$%g2#ޡ1^>{K+uGEv1ş7Af&b&O;.;A5*U)ND2bF|\=]<sˋL!wrw٧>YMÄ3\mWR hSv!_zvl? 3_ xU%\^#O*Gk̍YI_&Fꊛ@&1n } ͬ:{hTP3B.;bU8:Z=^Gw8!k-@xE@i,+'Iᐚ:fhztX7/(hY` O.1}a`%RW^akǂpCAufgDix UTЩ/7}%=jnVZvcF<M=
2^GKH5魉
_O4ެByʈySkw=5@h.0z>
W1000{10 UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA k#Xd\=0
`He E0 *H
1 *H
0 *H
1
190430133347Z0O *H
1B@|Z"dAT
xt.W,'/HNy w[ ;fiz83{G/?0l *H
1_0]0 `He*0 `He0
*H
0*H
0
*H
@0+0
*H
(0 +7100{10 UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA k#Xd\=0*H
10{10 UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA k#Xd\=0
*H
gw$i(R\$Ss+5(fT
2
Ct 2KwK;3lvʕg'ފ;Q=jXs$}zy^H==y{,xS\G*_$P{iq`laEjNCșU"P9_HM[/})Iܨa雒*X8!nAO0Z\,/$bE ux<t^? {F%ͻt(4$܆B-)cۢ5o7YO
;LcDSԵ\|{P<je7(Qm?zOŜ$^W¦C`?>%4-~"g'9H&Tˇ%^`ek5f$
!J9Ϲ,D\F6&P,4w\cQ?$yCY-,
help
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?f868b452-40e9-f2c8-cdee-dde5e53a214c>
