Date: Thu, 25 Oct 2018 16:02:31 +0000 From: bugzilla-noreply@freebsd.org To: geom@FreeBSD.org Subject: [Bug 232671] [gmirror] gmirror fails to recover from degraded mirror sets in some circumstances Message-ID: <bug-232671-14739-zS9D7eEtiI@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-232671-14739@https.bugs.freebsd.org/bugzilla/> References: <bug-232671-14739@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D232671 --- Comment #3 from Conrad Meyer <cem@freebsd.org> --- (In reply to Mark Johnston from comment #2) Yep, I did this code inspection on CURRENT from yesterday-ish, so that revi= sion was present. I'm not sure I want us to flip flop between STARTING and RUNNING in such a case; it seems like both (1) we are allowed to remain in STARTING indefinit= ely by just returning (as long as we can expect some future event to potentially transition us to RUNNING), and (2) we have enough information at STARTING t= ime to know that RUNNING will fail. I.e., I'd like to be slightly more conservative about when we transition to RUNNING. As far as particular code change for the root cause, adding a check for `if (ndisks =3D=3D 0) return;` right before the 'if (dirty =3D=3D 0) {' check s= eems like it *might* be sufficient to fix the correctness issue here (although not the admin-introspection issue(s)). After all, there is no point launching a gmirror with only broken and synchronizing disks ;-). Additionally, for administrability I'd like to record some information on t= he mirror softc about *why* the state is what it is. (Possibly at least two formatted string buffers -- why we last transitioned, and why we haven't yet transitioned to the next logical state. If either is not relevant, "n/a" w= ould be ok.) That way, when we timeout or whatever, that is discoverable (and ideally printed to console). It might also make sense to do a similar thing for g_mirror_disks. It'd al= so be good to add gmirror disk id to almost all of these log messages, since d= aNN devices can be enumerated in a different order between boots, and that was super confusing for this sighting. Certainly adding more test cases would be a good idea along with this revis= ion, thanks for the pointer. I can't promise any time to work on right now, sorry. --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-232671-14739-zS9D7eEtiI>