Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 25 Sep 2005 12:08:27 +0100 (BST)
From:      Robert Watson <rwatson@FreeBSD.org>
To:        jason@hudson-trading.com
Cc:        freebsd-hackers@freebsd.org, mikep@hudson-trading.com, freebsd-amd64@freebsd.org, Rob Watt <rob@hudson-trading.com>
Subject:   Re: freebsd-5.4-stable panics
Message-ID:  <20050925115912.H11229@fledge.watson.org>
In-Reply-To: <da4a53d805092310237d732554@mail.gmail.com>
References:  <da4a53d805092310237d732554@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On Fri, 23 Sep 2005, Jason Carroll wrote:
5B
> There seem to be 2 types of crashes we see with pretty different stack 
> traces.  What I'll call a type 1 crash, I believe, is often caused by 
> one of the triggers I mention above.  A type 2 crash appears to happen 
> spontaneously after the machine has been running for a while.
>
> I poked around using kgdb in a core file from a type 2 crash, and it 
> appeared the system hung closing sockets (specifically cleaning up 
> multicast state i think) while cleaning up one of our multicast 
> applications (note the trace through sys_exit).  There's no reason this 
> application should have been exiting unless it encountered some kind of 
> error.
>
> I'm attaching:
> dmesg.txt
> kernel-conf.txt (kernel config file)
> type1-core.txt (a kgdb bt from a type1/triggered crash)
> type2-core.txt (a kgdb bt from a type2/spontaneous crash)
>
> I'm happy to dig for more information, recompile with different options, 
> apply patches, or do anything else that might help get this problem 
> diagnosed and fixed!

Hi there Jason!

Sounds nasty.  It's possible the two panics are related, especially if 
they involve a race in the multicast code, which could result in treading 
on other kernel memory, potentially leading to the thread related panic. 
My leaning would be that they are unrelated, but since we may be able to 
eliminate the multicast one (see below), that would be a good starting 
point.

In the 6.x branch, quite a bit of work has been done to improve locking in 
the multicast code, and several important races have been fixed relating 
to IP multicast.  These races tended to turn up on the following sorts of 
situations:

(1) Multi-threaded appplications changing the multicast properties, such
     as membership, or a particular socket in parallel.

(2) Changes to multicast membership during high multicast I/O load on the
     socket.  For example, adding or deleting multicast groups on socket on
     CPU 0 while a packet is delivered to the same socket on CPU 1.

(3) Removal of real or synthetic interfaces involved in active multicast,
     such as removal of pccards, vlans, etc during multicast I/O, or with
     sockets bound to the interfaces.

These changes are not currently scheduled for a backport to 5.x, because 
they change the kernel network device driver API and ABI, requiring 
changes to and recompiling of third party device drivers.  A subset could 
be backported, subject to some limitations, but it would be good to 
confirm whether these changes actually affect the problems you're seeing 
before working through that. All the changes should appear in the most 
recent snapshot, BETA5.  Make sure to turn off extra kernel debugging 
features, such as WITNESS, INVARIANTS, and user space malloc debugging, if 
you start running into performance problems -- they have a big performance 
impact, although can be quite helpful in testing.  Normally we turn these 
off during the release candidate portion of the release cycle.

There are some other known stability nits in 6.x which are being worked 
on, but in general the network stack stability is higher in 6.x than 5.x 
when it comes to multicast due to the work I reference above.  If you run 
into any stability problems relating to the file system, set 
debug.mpsafevfs=0 in loader.conf -- there are a few bug fixes relating to 
running out of disk space or hitting quota limits that are fixed in HEAD, 
but not yet backported to 6.x.

Thanks,

Robert N M Watson



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050925115912.H11229>