Date: Sun, 20 Feb 2011 14:58:55 +0100 From: Luigi Rizzo <rizzo@iet.unipi.it> To: Pawel Tyll <ptyll@nitronet.pl> Cc: Brandon Gooch <jamesbrandongooch@gmail.com>, freebsd-ipfw@freebsd.org, Jack Vogel <jfvogel@gmail.com>, freebsd-net@freebsd.org Subject: problem analysys (Re: [Panic] Dummynet/IPFW related recurring crash.) Message-ID: <20110220135855.GA4794@onelab2.iet.unipi.it> In-Reply-To: <1145317277.20110220045434@nitronet.pl> References: <410175608.20110220013900@nitronet.pl> <AANLkTimWkWYj=HB5BTM0nwYWgNia-Wq4bYHsRB=OjVP7@mail.gmail.com> <AANLkTi=CLDFGxLQ8rdq3hg0KN9aYZA_YDwDWbqk5gcz2@mail.gmail.com> <1145317277.20110220045434@nitronet.pl>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Feb 20, 2011 at 04:54:34AM +0100, Pawel Tyll wrote: > > I've never seen a trace like this, and no absolutely nothing about dummynet, sorry. > > If it is in some way em's fault, then making sure you have the latest code would be > > a good idea. I have a test driver that is under selective test, it does effect the code > > path that you seem to be in, so it might be worth a try. If you want to try it early > > just pipe up and I'll send it. > I'm less and less sure that it has anything to do with em. I'd like to > hear Luigi's take on all this. That being said, I'll gladly try the > new driver -- if I'm right, I'll drop under 7 day reboot threshold > later into the year anyway, so I really need a permanent solution of > some kind. Apparently next crash always comes sooner that previous > one, which coincides with growing traffic. > i fully welcome pawel's (or everyone else's) bug reports, and consider them as his contribution to improve the system and not as a way to get free consulting, so no need for him to apolgize. In fact, i even welcome direct emails if people feel i missed some reports which i should read. At the same time, everyone should understand that some bugs are hard and time-consuming to track down, and so when the presentation suggests that the problem falls in this category, even developers with deep knowledge of the subsystems involved may step back because of lack of time (and this would not be fixable even if monetary incentives were involved). Conversely, there are cases where somehow one can quickly identify a problem and a fix, and you see it coming out either as a commit to the source tree, or as a patch by email. I have done this myself many times, and have seen the same for many other developers. The way a problem is presented has a big impact on how it gets handled: in this specific case the poster is pointing out a possible culprit (which may be helpful or misleading), and gives no hint on other things that may be relevant: number of interfaces, vlans, tunnels, taps, bpf etc ? any significant other activity on the machine such as interfaces going up or down, routing deamons etc ? amount of traffic ? Without furter details, I can only trust the statements in the report, and this determines how i classify the bug and decide whether i have time or ideas to pursue it. The bug in this case seems to fall in the 'hard, irreproducible' category: panics *always* need many many days to happen on machines under heavy load, no panics on similar machines under lighter load. With a description like this, i am afraid, i can't even start looking at the problem becaue i have no chance to reproduce it. Now let's forget what is in the bug report and dig into the backtrace at http://www.freebsd.org/cgi/query-pr.cgi?pr=152360 assuming that the information there is reliable (which we cannot tell for sure, as the stack could be corrupt). Note that this is some analysis that I would expect the poster to make, because it does not require a huge amount of time and is part of the "fair" sharing of responsibilities to get a bug fixed in a cooperative enviroment. The panic seems to occur in at /usr/src/sys/amd64/amd64/exception.S:223 #7 0xffffffff80698abf in in_localaddr (in=Variable "in" is not available. ) at /usr/src/sys/netinet/in.c:115 which is a piece of code that scans the list of interfaces. The argument to in_localaddr() is an ipv4 addr passed by value, so it is certainly not guilty even if we had a bogus argument. This seems to suggest that the problem is elsewhere -- perhaps some piece of code is manipulating the IN_IFADDR list without locking, causing it to become corrupt ? cheers luigi -----------------------------------------+------------------------------- Prof. Luigi RIZZO, rizzo@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/ . Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -----------------------------------------+-------------------------------
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110220135855.GA4794>