From owner-freebsd-bugs@freebsd.org Thu Oct 25 01:02:53 2018 Return-Path: Delivered-To: freebsd-bugs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2ACF11075C4F for ; Thu, 25 Oct 2018 01:02:53 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id A660582973 for ; Thu, 25 Oct 2018 01:02:52 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: by mailman.ysv.freebsd.org (Postfix) id 6AFE51075C4E; Thu, 25 Oct 2018 01:02:52 +0000 (UTC) Delivered-To: bugs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 491451075C4D for ; Thu, 25 Oct 2018 01:02:52 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.ysv.freebsd.org (mxrelay.ysv.freebsd.org [IPv6:2001:1900:2254:206a::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) (Client CN "mxrelay.ysv.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id DF27082970 for ; Thu, 25 Oct 2018 01:02:51 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mxrelay.ysv.freebsd.org (Postfix) with ESMTPS id 074FEFA69 for ; Thu, 25 Oct 2018 01:02:51 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id w9P12oqk089798 for ; Thu, 25 Oct 2018 01:02:50 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id w9P12oKA089796 for bugs@FreeBSD.org; Thu, 25 Oct 2018 01:02:50 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 232671] [gmirror] gmirror fails to recover from degraded mirror sets in some circumstances Date: Thu, 25 Oct 2018 01:02:51 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: CURRENT X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: cem@freebsd.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version rep_platform op_sys bug_status bug_severity priority component assigned_to reporter Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Oct 2018 01:02:53 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D232671 Bug ID: 232671 Summary: [gmirror] gmirror fails to recover from degraded mirror sets in some circumstances Product: Base System Version: CURRENT Hardware: Any OS: Any Status: New Severity: Affects Only Me Priority: --- Component: kern Assignee: bugs@FreeBSD.org Reporter: cem@freebsd.org Here is the example scenario: 1. I start with a GMIRROR with two ACTIVE disks called "root0". 2. I essentially disconnect one of the disks: (pass1:pmspcbsd0:0:1:0): CAM_REMOVE_DEVICE da2 at pmspcbsd0 bus 0 scbus0 target 1 lun 0 da2: s/n NHGDJ73Y detached g_access(977): provider da2 has error GEOM_MIRROR[0]: Request failed (error=3D6). da2p5[READ(offset=3D1836285952, length=3D16384)] GEOM_MIRROR[x]: Disk da2p5 state changed from ACTIVE to DISCONNECTED (dev= ice root0). ... GEOM_MIRROR[x]: Device root0: provider da2p5 disconnected. GEOM_MIRROR[x]: Consumer da2p5 destroyed. ... 3. I add a new hot-spare mirror device to the mirrorset. GEOM_MIRROR[1]: Adding disk da16p3 to root0. GEOM_MIRROR[1]: Disk da16p3 state changed from NONE to NEW (device root0). 4. GMIRROR begins synchronizing from the remaining live provider to the NEW one. As a result, the mirrorset's generation and/or sync id is bumped: GEOM_MIRROR[1]: Device root0: provider da16p3 detected. GEOM_MIRROR[1]: Disk da16p3 state changed from NEW to SYNCHRONIZING (devi= ce root0). GEOM_MIRROR[0]: Device root0: rebuilding provider da16p3. 5. The scsi bus is rescanned and da2 comes back: da2 at pmspcbsd0 bus 0 scbus0 target 1 lun 0 6. GEOM_MIRROR rejects it because it has a stale generation or sync id: GEOM_MIRROR[1]: Adding disk da2p5 to root0. GEOM_MIRROR[x]: Component da2p5 (device root0) broken, skipping. GEOM_MIRROR[0]: Cannot add disk da2p5 to root0 (error=3D22). 7. At this point, the mirrorset has two disks (da15p3, ACTIVE, and da16p3, SYNCHRONIZING). The machine is rebooted before synchronization completes. 8. At boot, before mounting root, GEOM happens to detect the mirror disks in the following order: i. da2p5 (the stale mirror that was ejected in (2) ii. da16p3 (the mirror that is partially synchronized) iii. da15p3 (the only good / "ACTIVE" mirror in the set) GEOM_MIRROR[1]: Creating device root0 (id=3D1633884690). GEOM_MIRROR[1]: Device root0 created (2 components, id=3D1633884690). GEOM_MIRROR[1]: root_mount_hold 0xfffff8003f496160 GEOM_MIRROR[1]: Adding disk da2p5 to root0. GEOM_MIRROR[1]: Disk da2p5 state changed from NONE to NEW (device root0). GEOM_MIRROR[1]: Device root0: provider da2p5 detected. GEOM_MIRROR[1]: Adding disk da16p3 to root0. GEOM_MIRROR[1]: Disk da16p3 state changed from NONE to NEW (device root0). GEOM_MIRROR[1]: Device root0: provider da16p3 detected. GEOM_MIRROR[0]: Component da2p5 (device root0) broken, skipping. << the bug is here, if not earlier >> GEOM_MIRROR[1]: Device root0 state changed from STARTING to RUNNING. GEOM_MIRROR[1]: Disk da16p3 state changed from NEW to SYNCHRONIZING (devi= ce root0). GEOM_MIRROR[1]: root_mount_rel[2352] 0xfffff8003f496160 GEOM_MIRROR[1]: Adding disk da15p3 to root0. 9. Unfortunately, at the marked location above, GMIRROR sees the two broken mirrors, and decides to transition the mirror set into RUNNING. i. g_mirror_update_device(force=3Dfalse) is called as a side effect of da= 16p3 transitioning to NEW. ii. We have the right number of mirrors, even though they are all broken: g_mirror_update_device(struct g_mirror_softc *sc, bool force) { ... switch (sc->sc_state) { case G_MIRROR_DEVICE_STATE_STARTING: { ... /* * Are we ready? We are, if all disks are connected or * if we have any disks and 'force' is true. */ ndisks =3D g_mirror_ndisks(sc, -1); if (sc->sc_ndisks =3D=3D ndisks || (force && ndisks > 0))= { ; iii. We don't see any "dirty" mirrors because the logic ignores stale generations and disks mid-synchronization: dirty =3D ndisks =3D 0; pdisk =3D NULL; LIST_FOREACH(disk, &sc->sc_disks, d_next) { if (disk->d_sync.ds_syncid !=3D syncid) continue; if ((disk->d_flags & G_MIRROR_DISK_FLAG_SYNCHRONIZING) !=3D 0) { continue; iv. We interpret this as meaning we have a clean mirror set! if (dirty =3D=3D 0) { /* No dirty disks at all, great. */ v. And jump to RUNNING. state =3D G_MIRROR_DEVICE_STATE_RUNNING; G_MIRROR_DEBUG(1, "Device %s state changed from %s to %s.= ", sc->sc_name, g_mirror_device_state2str(sc->sc_state), g_mirror_device_state2str(state)); sc->sc_state =3D state; 10. Something triggers an event, which causes g_mirror_update_deveice() to = be invoked again.=20 The sc is in the RUNNING state: case G_MIRROR_DEVICE_STATE_RUNNING: if (g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_ACTIVE) =3D= =3D 0 && g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_NEW) =3D=3D 0= ) { /* * No usable disks, so destroy the device. */ sc->sc_flags |=3D G_MIRROR_DEVICE_FLAG_DESTROY; break; 11. And the gmirror destroys itself, even though we had a valid mirror we c= ould have recovered from. --=20 You are receiving this mail because: You are the assignee for the bug.=