Date: Sun, 05 Oct 2014 17:50:40 +0200 From: InterNetX - Juergen Gotteswinter <jg@internetx.com> To: Dmitry Morozovsky <marck@rinet.ru>, Mikolaj Golub <to.my.trociny@gmail.com> Cc: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>, Matt Churchyard <matt.churchyard@userve.net> Subject: Re: HAST with broken HDD Message-ID: <543168D0.2000705@internetx.com> In-Reply-To: <alpine.BSF.2.00.1410051846480.72273@woozle.rinet.ru> References: <542BC135.1070906@Skynet.be> <542BDDB3.8080805@internetx.com> <CA%2BdUSypO8xTR3sh_KSL9c9FLxbGH%2BbTR9-gPdcCVd%2Bt0UgUF-g@mail.gmail.com> <542BF853.3040604@internetx.com> <CA%2BdUSyp4vMB_qUeqHgXNz2FiQbWzh8MjOEFYw%2BURcN4gUq69nw@mail.gmail.com> <542C019E.2080702@internetx.com> <CA%2BdUSyoEcPdJ1hdR3k1vNROFG7p1kN0HB5S2a_0gYhiV75OLAw@mail.gmail.com> <542C0710.3020402@internetx.com> <CA%2BdUSyr9OK9SvN3wX-O4DeriLBP-EEuAA8TTSYwdGfcR1asdtQ@mail.gmail.com> <97aab72e19d640ebb65c754c858043cc@SERVER.ad.usd-group.com> <20141003175439.GA7664@gmail.com> <alpine.BSF.2.00.1410051846480.72273@woozle.rinet.ru>
next in thread | previous in thread | raw e-mail | index | archive | help
Am 05.10.2014 um 16:50 schrieb Dmitry Morozovsky: > On Fri, 3 Oct 2014, Mikolaj Golub wrote: > >> Disk errors are recorded to syslog. Also error counters are displayed >> in `hastctl list' output. There is snmp_hast(3) in base -- a module >> for bsnmp to retrieve this statistics via snmp protocol (traps are not >> supported though). >> >> For notifications, the hastd can be configured to execute an arbitrary >> command on various HAST events (see description for `exec' in >> hast.conf(5)). Unfortunately, it does not have hooks for I/O error >> events currently. It might be worth adding though. The problem with >> this that it may generate to many events, so some throttling is >> needed. > > And, I it, this should be noted, some kind of error-coalescing or similar > before going from "warning" shate (there are some read error, but otherwise the > disk is useable, and it would be overly hassle to switch to remote component > completely) to "error" state (component is unuseable and needs to be replaced > ASAP; drop it from HAST pair, and switchover if needed). > > Error such as "device lost" is, of course, fatal from the very beginning; but > -- how should we interpret, well, sporadic controller resets with the disk > coming back and catching syncing again? > > Hi Dmitry, since HAST is somehow not so different from DRBD, why dont take their way of Error Handling as "Template". DRBD works pretty well and rock solid since years, a well established Solution. HAST got the potencial to become this also, with some improvements. Just my 2 Cents :)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?543168D0.2000705>