From owner-freebsd-current@FreeBSD.ORG Sun Sep 14 12:56:53 2008 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 27D66106564A; Sun, 14 Sep 2008 12:56:53 +0000 (UTC) (envelope-from keramida@freebsd.org) Received: from igloo.linux.gr (igloo.linux.gr [62.1.205.36]) by mx1.freebsd.org (Postfix) with ESMTP id 924CB8FC0C; Sun, 14 Sep 2008 12:56:52 +0000 (UTC) (envelope-from keramida@freebsd.org) Received: from kobe.laptop (adsl172-222.kln.forthnet.gr [62.1.21.222]) (authenticated bits=128) by igloo.linux.gr (8.14.3/8.14.3/Debian-5) with ESMTP id m8ECuQ6v006475 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Sun, 14 Sep 2008 15:56:40 +0300 Received: from kobe.laptop (kobe.laptop [127.0.0.1]) by kobe.laptop (8.14.3/8.14.3) with ESMTP id m8ECuQpO009757; Sun, 14 Sep 2008 15:56:26 +0300 (EEST) (envelope-from keramida@freebsd.org) Received: (from keramida@localhost) by kobe.laptop (8.14.3/8.14.3/Submit) id m8ECuDAv009736; Sun, 14 Sep 2008 15:56:13 +0300 (EEST) (envelope-from keramida@freebsd.org) From: Giorgos Keramidas To: Julian Elischer References: <87prnjh80z.fsf@kobe.laptop> <48CC14AD.4090708@elischer.org> <874p4ju8t3.fsf@kobe.laptop> <87zlmbstv1.fsf@kobe.laptop> <48CCAF23.1010605@elischer.org> Date: Sun, 14 Sep 2008 15:56:12 +0300 Message-ID: <87tzcij383.fsf@kobe.laptop> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.60 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-MailScanner-ID: m8ECuQ6v006475 X-Hellug-MailScanner: Found to be clean X-Hellug-MailScanner-SpamCheck: not spam, SpamAssassin (not cached, score=-4.55, required 5, autolearn=not spam, ALL_TRUSTED -1.80, AWL -0.15, BAYES_00 -2.60) X-Hellug-MailScanner-From: keramida@freebsd.org X-Spam-Status: No Cc: freebsd-current@freebsd.org, Robert Watson Subject: Re: panic in rt_check_fib() X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 14 Sep 2008 12:56:53 -0000 On Sat, 13 Sep 2008 23:28:51 -0700, Julian Elischer wrote: > To recap on this, I rewrote this function a couple of week sagobecause I > couldn't keep track of what was going on, and I thought it might > havesome bad edge cases. a couple of days later Giorgos contacted me > saying hta the had a fairly reproducible situation > where this was triggered and it appeared to be an edge case in > this function that allowed it to try lock the same lock twice. > > I immediatly thought "ah=hah!" I may have a solution to this, > and gave him a copy of my new function and indead it DOES fix that > panic. however after deleting and recreating intefaces a few hundred > times without crashing in rt_check_fib() it then fails somewhere else, > (actually it leacks some resources and eventually networking stops). > > I'm not convinced that is a problem with the new or old rt_check() but > it did stop me from just committing the new code. > > I rereading the way the function (did and still does) work it > occurred to me that there was a large flaw in teh way it worked.. > > It dropped a the lock on one route while it went off an did something > else that might block, On returning it blindly re-grabbed that lock, > completely ignoring the fact that the route might not even be valid any > more. (or any of several other things that may have changed while > it was away (maybe sleeping)). > > the code Giorgos is referring to is a patch I suggested to him to > fix this oversight and not the one that I originally tested and > had suggested to fix the edge case. > > I do however ask that some other people look at this patch! Exactly. Thanks for summarizing this so well :) I have started a kernel with your latest patch (from the quoted message above), and I can't panic my kernel with the script that did it in a semi-reliable manner before: % root@kobe:/root# while true ; do \ % sh home.sh > /dev/null 2>&1 ; \ % vmstat -z | sed -n -e 1p -e /rt/p ; \ % sleep 1 ; \ % done % ITEM SIZE LIMIT USED FREE REQUESTS FAILURES % rtentry: 120, 0, 19, 77, 43, 0 % ITEM SIZE LIMIT USED FREE REQUESTS FAILURES % rtentry: 120, 0, 20, 76, 47, 0 % ITEM SIZE LIMIT USED FREE REQUESTS FAILURES % rtentry: 120, 0, 21, 75, 51, 0 % ITEM SIZE LIMIT USED FREE REQUESTS FAILURES % rtentry: 120, 0, 23, 73, 55, 0 % ITEM SIZE LIMIT USED FREE REQUESTS FAILURES % rtentry: 120, 0, 24, 72, 59, 0 % ITEM SIZE LIMIT USED FREE REQUESTS FAILURES % rtentry: 120, 0, 25, 71, 62, 0 % ITEM SIZE LIMIT USED FREE REQUESTS FAILURES % rtentry: 120, 0, 26, 70, 65, 0 % ITEM SIZE LIMIT USED FREE REQUESTS FAILURES % rtentry: 120, 0, 27, 69, 69, 0 % ITEM SIZE LIMIT USED FREE REQUESTS FAILURES % rtentry: 120, 0, 29, 67, 73, 0 % ITEM SIZE LIMIT USED FREE REQUESTS FAILURES % rtentry: 120, 0, 30, 66, 76, 0 % ^C % root@kobe:/root# sh home.sh rtentries seem to be going up every time I cycle through the script, which essentially brings down both wireless and wired interfaces and then brings up the wired interface of my laptop. The core of the script is currently: # network interface options export ifconfig_re0="inet 192.168.1.10/24" export defaultrouter='192.168.1.1' echo '## Stopping network interfaces.' /etc/rc.d/netif stop re0 && ifconfig re0 delete /etc/rc.d/netif stop iwn0 && ifconfig iwn0 delete echo '## Bringing up network interface.' /etc/rc.d/netif start re0 echo "## Reloading firewall rules." /etc/rc.d/pf reload # The default route may be pointing to another interface. Find out # the IP address of the default gateway, delete it and point to the # default gateway configured as ${defaultrouter}. if [ -n "${defaultrouter}" ]; then echo '## Setting default router.' _oldrouter=`netstat -rn | grep default | awk '{print $2}'` if [ -n "${_oldrouter}" ]; then route delete default "${_oldrouter}" unset _oldrouter fi route add default "$defaultrouter" fi With your version of rt_check_fib() I have no panics so far. This doesn't mean we don't have a bug elsewhere, or that it will not panic tomorrow, but it's nice that thing seem a bit more stable now. The old version of rt_check_fib() used to panic about one third of the time I ran my 'home.sh' script... Now an interesting question is: Is it `normal' that the USED rtentry objects keep going up at every interface restart and are (at least at first glance) not reclaimed as fast as they are acquired?