From owner-freebsd-hackers Fri Jan 15 22:33:11 1999 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id WAA11996 for freebsd-hackers-outgoing; Fri, 15 Jan 1999 22:33:11 -0800 (PST) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from ns1.yes.no (ns1.yes.no [195.204.136.10]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id WAA11975; Fri, 15 Jan 1999 22:33:07 -0800 (PST) (envelope-from eivind@bitbox.follo.net) Received: from bitbox.follo.net (bitbox.follo.net [195.204.143.218]) by ns1.yes.no (8.9.1a/8.9.1) with ESMTP id HAA08781; Sat, 16 Jan 1999 07:33:04 +0100 (CET) Received: (from eivind@localhost) by bitbox.follo.net (8.8.8/8.8.6) id HAA11937; Sat, 16 Jan 1999 07:33:03 +0100 (MET) Date: Sat, 16 Jan 1999 07:33:03 +0100 From: Eivind Eklund To: Archie Cobbs Cc: freebsd-current@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG Subject: Re: Automated debug sanity checkers Message-ID: <19990116073302.B6405@bitbox.follo.net> References: <199901160512.VAA07999@bubba.whistle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.1i In-Reply-To: <199901160512.VAA07999@bubba.whistle.com>; from Archie Cobbs on Fri, Jan 15, 1999 at 09:12:07PM -0800 Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Fri, Jan 15, 1999 at 09:12:07PM -0800, Archie Cobbs wrote: > I was thinking about the DIAGNOSTICS replacement macros and > had a random thought... > > Suppose you're sitting in front of a ddb (or better yet gdb) prompt > because your kernel has just crashed due to who knows what reason. > What do you do to debug this? You start looking at variables, > memory, etc for anything funny going on. > > For example, several times we've spent hours going through a crash > dump to find, for example, that a process was on two queues, or > some mbuf was mangled, etc. > > The thought is that it would be really easy to help automate this > process, by doing the following: > > 1. Define a new kernel option INCLUDE_SANITY_CHECKS (or whatever) INVARIANT_SUPPORT. Hey, I just happen to remember that somebody added this a couple of days ago - hmm, could it have been me? :-) > 2. When this is defined, all the various FreeBSD kernel > submodules (VM, networking, device drivers, etc) would > include a function that exhaustively runs sanity checks -- > ie, validations that all the assumptions in the code are true -- > for that particular submodule. This means checking all queues, > flags, whatever. Ie, invariants. > 4. The function is linked into a linker set SANITY_SET(...) or whatever I've not thought of that - that may be a good idea. > Then by simply calling this function from the debugger you can > much more quickly narrow down on the problem (and hopefully fix > it before you get tired and go to sleep :-) > > Moreover, since the function is running post-mortem, it can do > very detailed checks that would otherwise take way too long. > E.g., check every mbuf, every queue entry, check the filesystem, > etc. Basically a "fsck" for the kernel memory. You do not only want to call this at post-mortem. You often want to selectively use this while the kernel is running. Example: At one point (a year and half or so ago), I was debugging the tty driver in bisdn. For some reason, it was crashing in various ways at various times, with no sane reason - just garbage data. I spent quite a bit of time looking at this, finding no reason for the faults - they "just happened", taking on average perhaps 4 hours hours under load to trigger. As I was getting more and more frustrated with attempting to shotgun debug this, I went back to my normal mode of development - I wrote invariants for all data structures in the vicinity. When I added an invariant for the clist structures (and check of it all over the place), I found that my "crash" (now an invariant incorrect panic) time went down to two minutes - and that it was always the same way, with the same stack backtrace, instead of crashing at various random points. The reason for the bug turned out to be that both I and the implementor of the driver had missed the change of spls from levels in BSD4.4 to masks in FreeBSD. After I had seen the invariant failure, I could see that something was being interrupted between two spls - and after 3 minutes of reading the FreeBSD manpage and three lines of changes I had something that worked. That driver had been non-functional for at least three releases of bisdn (and the userland code to handle it was not even there, which I expect was due to this). I further expect that somebody had tried pretty hard to debug it, as they had spent the time to actually write it. The fact that I (which at that point had little experience with the FreeBSD kernel) was able able to debug that in a couple of hours where others had used more time and failed before me show some of the power of invariants for finding obscure bugs. I would like to have invariants available for all significant data structures, and I'm planning to write them up as I get time for it. > Is this something that people would be motivated enough to make > as "official" FreeBSD kernel good housekeeping policy? I suspect a large number of us will use it, making it likely it will sort of maintain itself. Eivind. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message