Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 15 Dec 2012 22:00:37 -0700
From:      Ian Lepore <freebsd@damnhippie.dyndns.org>
To:        Peter Wemm <peter@wemm.org>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, Peter Jeremy <peter@rulingia.com>
Subject:   Re: svn commit: r244112 - head/sys/kern
Message-ID:  <1355634037.1198.115.camel@revolution.hippie.lan>
In-Reply-To: <CAGE5yCofnCKfJ8kMKrV8fmckPt_WOXc9PGnh3zuVUSGO-%2BrCRQ@mail.gmail.com>
References:  <201212110708.qBB78EWx025288@svn.freebsd.org> <201212121046.43706.jhb@freebsd.org> <CAJ-Vmo=U04GX%2BZyKuzXLwV%2BPpzU6_dm5BCmL=DWfsmhTVAR%2BsA@mail.gmail.com> <201212121658.49048.jhb@freebsd.org> <50C90567.8080406@FreeBSD.org> <50C909BD.9090709@mu.org> <50C91B32.4080904@FreeBSD.org> <20121215205202.GF1411@garage.freebsd.pl> <20121216040717.GG35245@server.rulingia.com> <CAGE5yCofnCKfJ8kMKrV8fmckPt_WOXc9PGnh3zuVUSGO-%2BrCRQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 2012-12-15 at 20:36 -0800, Peter Wemm wrote:
> On Sat, Dec 15, 2012 at 8:07 PM, Peter Jeremy <peter@rulingia.com> wrote:
> > [pruning the CC list]
> > On 2012-Dec-15 21:52:03 +0100, Pawel Jakub Dawidek <pjd@FreeBSD.org> wrote:
> >>On Wed, Dec 12, 2012 at 04:02:58PM -0800, Navdeep Parhar wrote:
> >>> A KASSERT() really is for a condition that should never happen.
> >
> > and can't be practically recovered from.
> >
> >>I have sort of mixed feelings about this change, but in reality we have
> >>three cases:
> >>
> >>1. Fatal conditions that shouldn't happen, but may happen for some
> >>   reason and we definiately want to stop running (corrupted file system
> >>   metadata that can mess up the file system more badly). For those we have
> >>   direct panic(9) calls.
> >>
> >>2. Fatal conditions that cannot happen and for those we have KASSERT(9).
> >
> > I don't see the difference between these two cases.  In both of them,
> > the system has entered a state that the designer/programmer didn't
> > envisage and can't recover from.  The best option is to abort as
> > quickly to minimise further corruption.
> 
> Actually it was a pretty good description of the situation as it
> existed for the last N years (where N > 10).
> 
> Back in the early days we had #ifdef DIAGNOSTIC which was a general
> catch-all bucket to put anything remotely related to diagnostic code.
> The problem was that it gathered so much debugging code that so much
> was changed by all the added code that it was usually the case that a
> DIAGNOSTIC kernel added new problems, if it even booted at all.  It
> had everything from extra functions that existed just for ddb's
> benefit through extra assertions.
> 
> There was also the problem that there was no knob to compile in things
> you could call from ddb to check the state of things as it was all in
> one lump.  That lead to the great DIAGNOSTIC -> INVARIANTS /
> INVARIANT_SUPPORT split.  INVARIANTS was for extra assertions, and
> INVARIANT_SUPPORT was ddb callable functions and code for INVARIANTS
> assertions to call.
> 
> panic remained for fatal conditions, while INVARIANTS (and KASSERT)
> were introduced to add additional non-invasive sanity checks to detect
> things that were going off into the weeds before too big a mess was
> made.
> 
> The point of KASSERT was that it was something you could turn on
> during development and/or or while trying to get a look at a problem
> sooner in DDB before things got too messed up.  It was an early
> warning to get you into ddb faster/sooner.
> 
> We were not supposed to need it on during production use because it
> was for the kinds of problems that would be detected later via other
> means.  eg: a panic or a trap.
> 
> eg: one of the earlier uses was checking priorities before beginning a
> context switch.  If you had an invalid priority state, you could get
> fairly deep into mi_switch() before a run queue slot was unable to be
> found.  But by the time that happened, the context switch code had
> changed things a lot before it reached that fatal condition.  So we'd
> add a KASSERT at the start so we could get into DDB before we'd
> started messing with run queues and had a better chance at seeing how
> we got into the mess.
> 
> I must say I'm rather concerned about the direction of recent changes
> in this area.

To me, that all reads like bigtime support of what Alfred is up to.  The
messages emitted by KASSERT checks give you information about a problem
before it becomes obscured.  In development it can also abort on the
spot, because a developer can do something useful with that abort.

On the other hand, a production server or device that suddenly hangs, or
worse keeps running but badly in some way, might be much easier to deal
with if there was a logged message that let you know that some check
failed and it would have aborted if the abort option were turned on.

The question here isn't whether aborting or continuing beyond that point
is a good idea.  Some developer already made that choice by coding a
KASSERT() instead of a panic().  The developer decided that a production
machine should try to keep running at that point.

The only question here is whether to let people to choose to get more
information when they've also choosen not to immediately abort on a
failed check.  They still can't choose to override the decisions made by
the developer about which situations must abort and which can try to
continue.  

Why not let folks ask for more info, if they can afford the cost of the
extra checks?  So far the only really sane response I've seen in this
thread was "tools, not policy".

-- Ian





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1355634037.1198.115.camel>