From owner-freebsd-current@FreeBSD.ORG  Sun Sep 14 12:56:53 2008
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 27D66106564A;
	Sun, 14 Sep 2008 12:56:53 +0000 (UTC)
	(envelope-from keramida@freebsd.org)
Received: from igloo.linux.gr (igloo.linux.gr [62.1.205.36])
	by mx1.freebsd.org (Postfix) with ESMTP id 924CB8FC0C;
	Sun, 14 Sep 2008 12:56:52 +0000 (UTC)
	(envelope-from keramida@freebsd.org)
Received: from kobe.laptop (adsl172-222.kln.forthnet.gr [62.1.21.222])
	(authenticated bits=128)
	by igloo.linux.gr (8.14.3/8.14.3/Debian-5) with ESMTP id m8ECuQ6v006475
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT);
	Sun, 14 Sep 2008 15:56:40 +0300
Received: from kobe.laptop (kobe.laptop [127.0.0.1])
	by kobe.laptop (8.14.3/8.14.3) with ESMTP id m8ECuQpO009757;
	Sun, 14 Sep 2008 15:56:26 +0300 (EEST)
	(envelope-from keramida@freebsd.org)
Received: (from keramida@localhost)
	by kobe.laptop (8.14.3/8.14.3/Submit) id m8ECuDAv009736;
	Sun, 14 Sep 2008 15:56:13 +0300 (EEST)
	(envelope-from keramida@freebsd.org)
From: Giorgos Keramidas <keramida@freebsd.org>
To: Julian Elischer <julian@elischer.org>
References: <87prnjh80z.fsf@kobe.laptop>
	<alpine.BSF.1.10.0809131105280.55411@fledge.watson.org>
	<48CC14AD.4090708@elischer.org> <874p4ju8t3.fsf@kobe.laptop>
	<87zlmbstv1.fsf@kobe.laptop> <48CCAF23.1010605@elischer.org>
Date: Sun, 14 Sep 2008 15:56:12 +0300
Message-ID: <87tzcij383.fsf@kobe.laptop>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.60 (berkeley-unix)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-MailScanner-ID: m8ECuQ6v006475
X-Hellug-MailScanner: Found to be clean
X-Hellug-MailScanner-SpamCheck: not spam, SpamAssassin (not cached,
	score=-4.55, required 5, autolearn=not spam, ALL_TRUSTED -1.80,
	AWL -0.15, BAYES_00 -2.60)
X-Hellug-MailScanner-From: keramida@freebsd.org
X-Spam-Status: No
Cc: freebsd-current@freebsd.org, Robert Watson <rwatson@freebsd.org>
Subject: Re: panic in rt_check_fib()
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 14 Sep 2008 12:56:53 -0000

On Sat, 13 Sep 2008 23:28:51 -0700, Julian Elischer <julian@elischer.org> wrote:
> To recap on this, I rewrote this function a couple of week sagobecause I
> couldn't keep track of what was going on, and I thought it might
> havesome bad edge cases.  a couple of days later Giorgos contacted me
> saying hta the had a fairly reproducible situation
> where this was triggered and it appeared to be an edge case in
> this function that allowed it to try lock the same lock twice.
>
> I immediatly thought "ah=hah!" I may have a solution to this,
> and gave him a copy of my new function and indead it DOES fix that
> panic. however after deleting and recreating intefaces a few hundred
> times without crashing in rt_check_fib() it then fails somewhere else,
> (actually it leacks some resources and eventually networking stops).
>
> I'm not convinced that is a problem with the new or old rt_check() but
> it did stop me from just committing the new code.
>
> I rereading the way the function (did and still does) work it
> occurred to me that there was a large flaw in teh way it worked..
>
> It dropped a the lock on one route while it went off an did something
> else that might block, On returning it blindly re-grabbed that lock,
> completely ignoring the fact that the route might not even be valid any
> more. (or any of several other things that may have changed while
> it was away (maybe sleeping)).
>
> the code Giorgos is referring to is a patch I suggested to him to
> fix this oversight and not the one that I originally tested and
> had suggested to fix the edge case.
>
> I do however ask that some other people look at this patch!

Exactly.  Thanks for summarizing this so well :)

I have started a kernel with your latest patch (from the quoted message
above), and I can't panic my kernel with the script that did it in a
semi-reliable manner before:

% root@kobe:/root# while true ; do \
%         sh home.sh > /dev/null 2>&1 ; \
%         vmstat -z | sed -n -e 1p -e /rt/p ; \
%         sleep 1 ; \
%     done
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       19,       77,       43,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       20,       76,       47,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       21,       75,       51,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       23,       73,       55,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       24,       72,       59,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       25,       71,       62,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       26,       70,       65,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       27,       69,       69,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       29,       67,       73,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       30,       66,       76,        0
% ^C
% root@kobe:/root# sh home.sh

rtentries seem to be going up every time I cycle through the script,
which essentially brings down both wireless and wired interfaces and
then brings up the wired interface of my laptop.  The core of the script
is currently:

  # network interface options
  export ifconfig_re0="inet 192.168.1.10/24"
  export defaultrouter='192.168.1.1'

  echo '## Stopping network interfaces.'
  /etc/rc.d/netif stop re0  && ifconfig re0  delete
  /etc/rc.d/netif stop iwn0 && ifconfig iwn0 delete

  echo '## Bringing up network interface.'
  /etc/rc.d/netif start re0

  echo "## Reloading firewall rules."
  /etc/rc.d/pf reload

  # The default route may be pointing to another interface.  Find out
  # the IP address of the default gateway, delete it and point to the
  # default gateway configured as ${defaultrouter}.
  if [ -n "${defaultrouter}" ]; then
          echo '## Setting default router.'
          _oldrouter=`netstat -rn | grep default | awk '{print $2}'`
          if [ -n "${_oldrouter}" ]; then
                  route delete default "${_oldrouter}"
                  unset _oldrouter
          fi
          route add default "$defaultrouter"
  fi

With your version of rt_check_fib() I have no panics so far.  This
doesn't mean we don't have a bug elsewhere, or that it will not panic
tomorrow, but it's nice that thing seem a bit more stable now.  The old
version of rt_check_fib() used to panic about one third of the time I
ran my 'home.sh' script...

Now an interesting question is: Is it `normal' that the USED rtentry
objects keep going up at every interface restart and are (at least at
first glance) not reclaimed as fast as they are acquired?