Date: Fri, 29 Aug 2008 11:44:54 -0400 From: John Baldwin <jhb@freebsd.org> To: freebsd-current@freebsd.org Cc: Kirk Strauser <kirk@strauser.com> Subject: Re: System, diagnose thyself: auto-documentation for crashes Message-ID: <200808291144.54193.jhb@freebsd.org> In-Reply-To: <BDDFB834-C15F-4E48-B1D1-B644940FBE42@strauser.com> References: <BDDFB834-C15F-4E48-B1D1-B644940FBE42@strauser.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Friday 29 August 2008 11:13:57 am Kirk Strauser wrote:
> I was having flaky system problems that were driving me to
> distraction. Yesterday, I finally got a panic message with an
> instruction pointer, used addr2line to see that the failure was in
> uma_zfree_internal, searched Google, and learned that it was probably
> due to bad RAM. Half any hour later, memtest86 found the defective
> stick and the problem was solved.
>
> This led me to thinking, though: the OS already had all the
> information needed to figure out where the problem was. If there had
> been an explanation inside that function definition, FreeBSD could
> have automatically gone to the file, searched for that explanation,
> and told me why my system had probably crashed.
>
> I propose that we:
>
> 1) Settle on a standard comment format for metainformation. There are
> already standards like Doxygen if we didn't want to home-roll something.
>
> 2) Write a program that takes an instruction pointer and outputs the
> comment for the associated function.
>
> 3) Modify /etc/rc.d/savecore to run the program from #2.
>
> For instance, suppose the comments in sys/vm/uma_core.c looked like:
>
> /*
> * Frees an item to an INTERNAL zone or allocates a free bucket
> *
> * Arguments:
> * zone The zone to free to
> * item The item we're freeing
> * udata User supplied data for the dtor
> * skip Skip dtors and finis
> *
> * Failure:
> * Failures in this function are commonly due to defective RAM.
> */
> static void
> uma_zfree_internal(uma_zone_t zone, void *item, void *udata,
> enum zfreeskip skip, int flags)
> {
> ...
> }
>
> If I'd seen that failure message in my syslog, I would have avoided a
> few days of teeth gnashing. What do you think? I think something
> like this could be extremely useful. Benefits:
>
> - There would be zero impact on performance because it would only
> touch comments and not any running code whatsoever.
> - It would require minimal work.
> - It could be done incrementally. Document known common failure
> points and add others with time.
> - It wouldn't affect any other systems.
See /usr/sbin/crashinfo for a start. I have patches to enable it
from /etc/rc.d/savecore after generating a patch (still need to test them
though).
--
John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200808291144.54193.jhb>
