Date: Wed, 10 Dec 1997 03:54:50 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: henrich@crh.cl.msu.edu (Charles Henrich) Cc: eivind@yes.no, perhaps@yes.no, freebsd-current@freebsd.org Subject: Re: VM system info Message-ID: <199712100354.UAA07489@usr06.primenet.com> In-Reply-To: <19971209145549.60899@crh.cl.msu.edu> from "Charles Henrich" at Dec 9, 97 02:55:49 pm
next in thread | previous in thread | raw e-mail | index | archive | help
> > There are four ways to cope: (1) Ignore error; return OK, even though the > > function failed to do it's job. (2) Return error code (3) Throw an > > exception of some sort, e.g. longjmp(). (4) panic(), a la assert(). > > That depends greatly on the situation. There is also a (5) that says > take all given known information and continue onward, while logging > the error. In some cases its obviously not possible where a routine > is designed to have no return value. [ ... ] > Im not arguing the trap all errors as soon as possible piece, im arguing in > what you do when you detect one. To shutdown the machine is the worst > solution. I've recently identified (but not isolated a bug in the FreeBSD network code that can apparently spam the kernel stack of anyprocess currently in the kernel. I have yet to track this down because all I can see is the side effect, not the effect that results in the spamming. Another engineer has identified the most probably place that the spam occurred, simply because there's no place else that even looks vaguely like it could result in what I'm seeing: o In select(), selscan() got a page not present error when accessing obits[ 0]. This is not an error I can "ignore and log". The select() was initiated by syslogd for input on its TCP (fd=3) and UDP (fd=4) ports. o Apparently, something is spamming the contents of the kernel stack. You can see this by going into kdb and examining the *ibits[3], *obits[3]; atv values and noting something that looks like a sockaddr with the following attributes: o A sa_len of 0x20 o A sa_family of 0xff o The MAC addr of a remote machine o The MAC addr of the local machine o A protocol value of 0800 (TCP) o There is (apparently) only one place in the kernel (a dereference of *eh members, where eh is an mdata(m...) of an mbuf) where this data could have originated. The only fruitful approach is to check for a *eh < 0xf0000000. With an assert with a panic to stop the processor earlier in the problem. This particular problem could result in random "non-fatal" corruption of data in *your* kernel. It's probably responsible for many "impossible" situation type crashes (hint: random kernel stack stomping of a victim processes stack is not a good thing). If you can think of a way *other* than an assert to find this problem, I'm open to suggestions. > Lets think for a moment about the case if your the computer system > on a F-15 fighter jet, the last thing the pilot wants to see is > "Panic, system halted" as he spirals to his death instead of the > software attempting to cope as best as possible. Probably he would be less happy with "missle launched" as he's landing on a friendly aircraft carrier because of some cascade failure. BTW, to handle: You do a fast reset from ROM and hope the error doesn't occur again. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199712100354.UAA07489>