Date: Sun, 25 Sep 2005 12:08:27 +0100 (BST) From: Robert Watson <rwatson@FreeBSD.org> To: jason@hudson-trading.com Cc: freebsd-hackers@freebsd.org, mikep@hudson-trading.com, freebsd-amd64@freebsd.org, Rob Watt <rob@hudson-trading.com> Subject: Re: freebsd-5.4-stable panics Message-ID: <20050925115912.H11229@fledge.watson.org> In-Reply-To: <da4a53d805092310237d732554@mail.gmail.com> References: <da4a53d805092310237d732554@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 23 Sep 2005, Jason Carroll wrote: 5B > There seem to be 2 types of crashes we see with pretty different stack > traces. What I'll call a type 1 crash, I believe, is often caused by > one of the triggers I mention above. A type 2 crash appears to happen > spontaneously after the machine has been running for a while. > > I poked around using kgdb in a core file from a type 2 crash, and it > appeared the system hung closing sockets (specifically cleaning up > multicast state i think) while cleaning up one of our multicast > applications (note the trace through sys_exit). There's no reason this > application should have been exiting unless it encountered some kind of > error. > > I'm attaching: > dmesg.txt > kernel-conf.txt (kernel config file) > type1-core.txt (a kgdb bt from a type1/triggered crash) > type2-core.txt (a kgdb bt from a type2/spontaneous crash) > > I'm happy to dig for more information, recompile with different options, > apply patches, or do anything else that might help get this problem > diagnosed and fixed! Hi there Jason! Sounds nasty. It's possible the two panics are related, especially if they involve a race in the multicast code, which could result in treading on other kernel memory, potentially leading to the thread related panic. My leaning would be that they are unrelated, but since we may be able to eliminate the multicast one (see below), that would be a good starting point. In the 6.x branch, quite a bit of work has been done to improve locking in the multicast code, and several important races have been fixed relating to IP multicast. These races tended to turn up on the following sorts of situations: (1) Multi-threaded appplications changing the multicast properties, such as membership, or a particular socket in parallel. (2) Changes to multicast membership during high multicast I/O load on the socket. For example, adding or deleting multicast groups on socket on CPU 0 while a packet is delivered to the same socket on CPU 1. (3) Removal of real or synthetic interfaces involved in active multicast, such as removal of pccards, vlans, etc during multicast I/O, or with sockets bound to the interfaces. These changes are not currently scheduled for a backport to 5.x, because they change the kernel network device driver API and ABI, requiring changes to and recompiling of third party device drivers. A subset could be backported, subject to some limitations, but it would be good to confirm whether these changes actually affect the problems you're seeing before working through that. All the changes should appear in the most recent snapshot, BETA5. Make sure to turn off extra kernel debugging features, such as WITNESS, INVARIANTS, and user space malloc debugging, if you start running into performance problems -- they have a big performance impact, although can be quite helpful in testing. Normally we turn these off during the release candidate portion of the release cycle. There are some other known stability nits in 6.x which are being worked on, but in general the network stack stability is higher in 6.x than 5.x when it comes to multicast due to the work I reference above. If you run into any stability problems relating to the file system, set debug.mpsafevfs=0 in loader.conf -- there are a few bug fixes relating to running out of disk space or hitting quota limits that are fixed in HEAD, but not yet backported to 6.x. Thanks, Robert N M Watson
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050925115912.H11229>