Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 23 Apr 2018 14:00:24 -0400
From:      Mark Johnston <markj@freebsd.org>
To:        "Jonathan T. Looney" <jtl@freebsd.org>
Cc:        cem@freebsd.org, src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r332860 - head/sys/kern
Message-ID:  <20180423180024.GC84833@raichu>
In-Reply-To: <CADrOrmvAxuoadBM==1EEbJc4PAPwtd-vPE4Tg-pM86CvwQnnwA@mail.gmail.com>
References:  <201804211705.w3LH50Dk056339@repo.freebsd.org> <CAG6CVpVAKAou%2B=9dv%2Bef8txmykR%2BMCpbmvteJua7ErhXvM7rCg@mail.gmail.com> <CADrOrmsmfjuj3_rGoVydF9ahvfuCunNyrHRs1VxKZWr=cUmLhQ@mail.gmail.com> <20180422171106.GB84833@raichu> <CADrOrmvAxuoadBM==1EEbJc4PAPwtd-vPE4Tg-pM86CvwQnnwA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Apr 23, 2018 at 11:12:32AM -0400, Jonathan T. Looney wrote:
> Hi Mark,
> 
> Let me start by saying that I appreciate your well-reasoned response. (I
> think) I understand your reasoning, I appreciate your well-explained
> argument, and I respect your opinion. I just wanted to make that clear up
> front.
> 
> On Sun, Apr 22, 2018 at 1:11 PM, Mark Johnston <markj@freebsd.org> wrote:
> >
> > > All too often, my ability to debug assertion violations is hindered
> because
> > > the system trips over yet another assertion while dumping the core. If
> we
> > > skip the assertion, nothing bad happens. (The post-panic debugging code
> > > already needs to deal with systems that are inconsistent, and it does a
> > > pretty good job at it.)
> >
> > I think we make a decent effort to fix such problems as they arise, but
> > you are declaring defeat on behalf of everyone. Did you make some effort
> > to fix or report these issues before resorting to the more drastic
> > measure taken here?
> 
> We try to report or fix them as they arise. However, you don't know there
> is a problem until you actually run into it. And, you don't run into the
> problem until you can't get a core dump due to the assertion.
> 
> (And, with elusive problems, it isn't always easy to duplicate them. So,
> fixing the assertion is sometimes "too late".)

Sure, this is true. But unless it's a problem in practice it's obviously
preferable to keep assertions enabled. Kernel dumping itself is a
fundamentally unreliable mechanism, but it works well enough to be
useful. I basically never see problems with post-panic assertion
failures, and I test the kernel dump code a fair bit. Isilon exercises
that code quite a lot as well without any problems that I'm aware of,
and I can't think of any reports of such assertion failures that weren't
quickly fixed. So I'm wondering what problems exist in your specific
environment that we might instead address surgically.

(I could very well be wrong about how widespread post-panic assertion
failures are. We've had problems of this sort before, e.g., with the
updated DRM graphics drivers, where the code to grab the console after a
panic didn't work properly. There, the bandaid was to just disable that
specific mechanism.)

> > > On the other hand, I really am not sure what you are worried might
> happen
> > > if we skip checking assertions after we've already panic'd. As far as I
> can
> > > tell, the likely worst case is that we hit a true panic of some kind. In
> > > that case, we're no worse off than before.
> > >
> > > I think the one obvious exception is when we're purposely trying to
> > > validate the post-panic debugging code. In that case, you can change the
> > > sysctl/tunable to enable troubleshooting.
> >
> > What about a user whose test system panics and fails to dump? With
> > assertions enabled, a developer has a better chance of spotting the
> > problem. Now we need at least one extra round trip to the user to
> > diagnose the problem, which may not be readily reproducible in the first
> > place.
> 
> That's true. However, this is equally true in the other direction: Prior to
> this change, when a user tripped over an assertion and was unable to get a
> coredump due to a second post-panic assertion, it took (at least) another
> round-trip to get a coredump.
> 
> First, without the capability to ignore assertions after a panic
> (introduced by this commit), you would need to fix the actual assertion to
> enable the user to get a coredump. At minimum, I think this change has
> value in that case. This change gives you a mechanism to get a coredump
> without requiring that you fix the assertion and get the user to recompile
> with the patch.
> 
> But, moreover, if we change the default back to panic'ing on a second
> assertion, we will hamper our ability to get usable reports about elusive
> bugs. If we leave the default "as is", it won't take an extra round-trip to
> tell the user how to get a coredump. If we change the default (or, perhaps
> more correctly, "restore the prior default"), we will still need a second
> round-trip to get coredumps. That makes it tough to chase elusive bugs.

I agree with what you're saying. I'm thinking more of the long-term
effects of this change and am concerned that it increases the potential
for actual bugs to appear in the kernel dump code paths. Those types of
bugs are often quite tricky to track down from a single instance, and
can cause dumps to fail. If that starts to happen, we're basically back
to where we started.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180423180024.GC84833>