Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 20 Feb 2011 14:58:55 +0100
From:      Luigi Rizzo <rizzo@iet.unipi.it>
To:        Pawel Tyll <ptyll@nitronet.pl>
Cc:        Brandon Gooch <jamesbrandongooch@gmail.com>, freebsd-ipfw@freebsd.org, Jack Vogel <jfvogel@gmail.com>, freebsd-net@freebsd.org
Subject:   problem analysys (Re: [Panic] Dummynet/IPFW related recurring crash.)
Message-ID:  <20110220135855.GA4794@onelab2.iet.unipi.it>
In-Reply-To: <1145317277.20110220045434@nitronet.pl>
References:  <410175608.20110220013900@nitronet.pl> <AANLkTimWkWYj=HB5BTM0nwYWgNia-Wq4bYHsRB=OjVP7@mail.gmail.com> <AANLkTi=CLDFGxLQ8rdq3hg0KN9aYZA_YDwDWbqk5gcz2@mail.gmail.com> <1145317277.20110220045434@nitronet.pl>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Feb 20, 2011 at 04:54:34AM +0100, Pawel Tyll wrote:
> > I've never seen a trace like this, and no absolutely nothing about dummynet, sorry.
> > If it is in some way em's fault, then making sure you have the latest code would be
> > a good idea. I have a test driver that is under selective test, it does effect the code
> > path that you seem to be in, so it might be worth a try. If you want to try it early
> > just pipe up and I'll send it.
> I'm less and less sure that it has anything to do with em. I'd like to
> hear Luigi's take on all this. That being said, I'll gladly try the
> new driver -- if I'm right, I'll drop under 7 day reboot threshold
> later into the year anyway, so I really need a permanent solution of
> some kind. Apparently next crash always comes sooner that previous
> one, which coincides with growing traffic.
> 

i fully welcome pawel's (or everyone else's) bug reports, and
consider them as his contribution to improve the system and not as
a way to get free consulting, so no need for him to apolgize. In
fact, i even welcome direct emails if people feel i missed some
reports which i should read.

At the same time, everyone should understand that some bugs are hard
and time-consuming to track down, and so when the presentation
suggests that the problem falls in this category, even developers with
deep knowledge of the subsystems involved may step back because of
lack of time (and this would not be fixable even if monetary incentives
were involved). Conversely, there are cases where somehow one can
quickly identify a problem and a fix, and you see it coming out either
as a commit to the source tree, or as a patch by email. I have done this
myself many times, and have seen the same for many other developers.
 
The way a problem is presented has a big impact on how it gets handled:
in this specific case the poster is pointing out a possible culprit
(which may be helpful or misleading), and gives no hint on other
things that may be relevant: number of interfaces, vlans, tunnels, taps,
bpf etc ? any significant other activity on the machine such as interfaces
going up or down, routing deamons etc ? amount of traffic ?
Without furter details, I can only trust the statements in the report,
and this determines how i classify the bug and decide whether i have time
or ideas to pursue it.

The bug in this case seems to fall in the 'hard, irreproducible' category:
panics *always* need many many days to happen on machines under heavy load,
no panics on similar machines under lighter load.
With a description like this, i am afraid, i can't even start looking
at the problem becaue i have no chance to reproduce it.

Now let's forget what is in the bug report and dig into the
backtrace at http://www.freebsd.org/cgi/query-pr.cgi?pr=152360
assuming that the information there is reliable (which we cannot tell
for sure, as the stack could be corrupt). Note that this is some
analysis that I would expect the poster to make, because it does not
require a huge amount of time and is part of the "fair" sharing of
responsibilities to get a bug fixed in a cooperative enviroment.
The panic seems to occur in

    at /usr/src/sys/amd64/amd64/exception.S:223
    #7 0xffffffff80698abf in in_localaddr (in=Variable "in" is not available.
    ) at /usr/src/sys/netinet/in.c:115

which is a piece of code that scans the list of interfaces.
The argument to in_localaddr() is an ipv4 addr passed by value,
so it is certainly not guilty even if we had a bogus argument.

This seems to suggest that the problem is elsewhere -- perhaps
some piece of code is manipulating the IN_IFADDR list without
locking, causing it to become corrupt ?

	cheers
	luigi
-----------------------------------------+-------------------------------
  Prof. Luigi RIZZO, rizzo@iet.unipi.it  . Dip. di Ing. dell'Informazione
  http://www.iet.unipi.it/~luigi/        . Universita` di Pisa
  TEL      +39-050-2211611               . via Diotisalvi 2
  Mobile   +39-338-6809875               . 56122 PISA (Italy)
-----------------------------------------+-------------------------------



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110220135855.GA4794>