Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 05 May 2020 01:44:39 +0000
From:      bugzilla-noreply@freebsd.org
To:        bugs@FreeBSD.org
Subject:   [Bug 246207] [geom] geli livelocks during panic
Message-ID:  <bug-246207-227@https.bugs.freebsd.org/bugzilla/>

next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D246207

            Bug ID: 246207
           Summary: [geom] geli livelocks during panic
           Product: Base System
           Version: 12.1-STABLE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: asomers@FreeBSD.org

Some geli-using machines I administer occasionally panic.  When they do, th=
ey
sometimes dump core but often don't.  When they don't, they simply hang aft=
er
printing the stack trace, but before printing the uptime.

I've traced the problem to geli's shutdown_pre_sync event handler.  It trie=
s to
destroy each geli device.  We can't simply skip that step if a panic is
underway; erasing the keys is necessary to prevent warm-boot attacks.  The
problem lies in the following lines.=20=20

g_eli_destroy:
        sc->sc_flags |=3D G_ELI_FLAG_DESTROY;
        wakeup(sc);
        /*
         * Wait for kernel threads self destruction.
         */
        while (!LIST_EMPTY(&sc->sc_workers)) {
                msleep(&sc->sc_workers, &sc->sc_queue_mtx, PRIBIO,
                    "geli:destroy", 0);
        }

_sleep:
        if (SCHEDULER_STOPPED_TD(td)) {
                if (lock !=3D NULL && priority & PDROP)
                        class->lc_unlock(lock);
                return (0);
        }

As you can see, if the scheduler is stopped for the current thread (which it
will be during a panic), then msleep does nothing, cause g_eli_destroy to l=
oop
indefinitely.  The obvious solution, which I haven't yet tested, would be to
skip that section in g_eli_destroy when the scheduler is stopped.  What I d=
on't
understand is why g_eli_destroy _ever_ works during a panic.  Perhaps it has
something to do with the allocation of worker threads among cores?  Perhaps=
 it
only succeeds when all worker threads happen to be on different cores?  I f=
ind
that unlikely though, because these servers have thousands of worker thread=
s.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-246207-227>