From owner-freebsd-net@FreeBSD.ORG Fri Oct 14 08:49:45 2011 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0EEA2106564A; Fri, 14 Oct 2011 08:49:45 +0000 (UTC) (envelope-from s.dave.jones@gmail.com) Received: from mail-vw0-f54.google.com (mail-vw0-f54.google.com [209.85.212.54]) by mx1.freebsd.org (Postfix) with ESMTP id 525568FC1A; Fri, 14 Oct 2011 08:49:44 +0000 (UTC) Received: by vws11 with SMTP id 11so969809vws.13 for ; Fri, 14 Oct 2011 01:49:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=I5ioUnKLu8HSczsHc3fuY99puN9Mk00LMkLPPG/NMIo=; b=Yu5BLKa+B4sPYFiIF3e5DltMRDaqBgSNalovrBQoXGv9qb1NcwbaJo6rewCJ+APFXt rQbTNCKYBqhsHRJv0OVvxGI+DjlBXUnZbbrSeRcU980Tns4JM2GDn/lNAzqn0SU0p0Td ZeGt6g9RbTmSoZTu8VaE369VHVR/+TQnkJUZ8= MIME-Version: 1.0 Received: by 10.52.94.97 with SMTP id db1mr7618000vdb.67.1318582183627; Fri, 14 Oct 2011 01:49:43 -0700 (PDT) Received: by 10.52.186.170 with HTTP; Fri, 14 Oct 2011 01:49:43 -0700 (PDT) In-Reply-To: <86fwiy91ty.fsf@in138.ua3> References: <8662kcigif.fsf@kopusha.home.net> <86y5x0ooik.fsf@in138.ua3> <86fwiy91ty.fsf@in138.ua3> Date: Fri, 14 Oct 2011 16:49:43 +0800 Message-ID: From: dave jones To: Mikolaj Golub Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: Adrian Chadd , "K. Macy" , "freebsd-net@freebsd.org" , bz@freebsd.org, Robert Watson , Arnaud Lacombe Subject: Re: Kernel panic on FreeBSD 9.0-beta2 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Oct 2011 08:49:45 -0000 2011/10/12 Mikolaj Golub wrote: > > On Wed, 12 Oct 2011 09:53:34 +0800 dave jones wrote: > > =A0dj> On Fri, Oct 7, 2011 at 9:12 AM, dave jones =A0wrote: > =A0>> 2011/10/4 Mikolaj Golub : > =A0>>> > =A0>>> On Sat, 1 Oct 2011 14:15:45 +0800 dave jones wrote: > =A0>>> > =A0>>> =A0dj> On Fri, Sep 30, 2011 at 9:41 PM, Robert Watson wrote: > =A0>>> =A0>> > =A0>>> =A0>> On Wed, 28 Sep 2011, Mikolaj Golub wrote: > =A0>>> =A0>> > =A0>>> =A0>>> On Mon, 26 Sep 2011 16:12:55 +0200 K. Macy wrote: > =A0>>> =A0>>> > =A0>>> =A0>>> KM> Sorry, didn't look at the images (limited bw), I've see= n something KM> > =A0>>> =A0>>> like this before in timewait. This "can't happen" with UDP = so will be KM> > =A0>>> =A0>>> interested in learning more about the bug. > =A0>>> =A0>>> > =A0>>> =A0>>> The panic can be easily triggered by this: > =A0>>> =A0>> > =A0>>> =A0>> Hi: > =A0>>> =A0>> > =A0>>> =A0>> Just catching up on this thread. =A0I think the analysis her= e is generally > =A0>>> =A0>> right: in 9.0, you're much more likely to see an inpcb with = its in_socket > =A0>>> =A0>> pointer cleared in the hash list than in prior releases, and > =A0>>> =A0>> in_pcbbind_setup() trips over this. > =A0>>> =A0>> > =A0>>> =A0>> However, at least on first glance (and from the perspective = of invariants > =A0>>> =A0>> here), I think the bug is actualy that in_pcbbind_setup() is= asking > =A0>>> =A0>> in_pcblookup_local() for an inpcb and then access the return= ed inpcb's > =A0>>> =A0>> in_socket pointer without acquiring a lock on the inpcb. =A0= Structurally, it > =A0>>> =A0>> can't acquire this lock for lock order reasons -- it already= holds the lock > =A0>>> =A0>> on its own inpcb. =A0Therefore, we should only access fields= that are safe to > =A0>>> =A0>> follow in an inpcb when you hold a reference via the hash lo= ck and not a > =A0>>> =A0>> lock on the inpcb itself, which appears generally OK (+/-) f= or all the > =A0>>> =A0>> fields in that clause but the t->inp_socket->so_options dere= ference. > =A0>>> =A0>> > =A0>>> =A0>> A preferred fix would cache the SO_REUSEPORT flag in an inpc= b-layer field, > =A0>>> =A0>> such as inp_flags2, giving us access to its value without ha= ving to walk > =A0>>> =A0>> into the attached (or not) socket. > =A0>>> =A0>> > =A0>>> =A0>> This raises another structural question, which is whether we= need a new > =A0>>> =A0>> inp_foo flags field that is protected explicitly by the hash= lock, and not > =A0>>> =A0>> by the inpcb lock, which could hold fields relevant to addre= ss binding. =A0I > =A0>>> =A0>> don't think we need to solve that problem in this context, a= s a slightly > =A0>>> =A0>> race on SO_REUSEPORT is likely acceptable. > =A0>>> =A0>> > =A0>>> =A0>> The suggested fix does perform the desired function of expli= citly detaching > =A0>>> =A0>> the inpcb from the hash list before the socket is disconnect= ed from the > =A0>>> =A0>> inpcb. However, it's incomplete in that the invariant that's= being broken is > =A0>>> =A0>> also relied on for other protocols (such as raw sockets). = =A0The correct > =A0>>> =A0>> invariant is that inp_socket is safe to follow unconditional= ly if an inpcb > =A0>>> =A0>> is locked and INP_DROPPED isn't set -- the bug is in "locked= " not in > =A0>>> =A0>> "INP_DROPPED", which is why I think this is the wrong fix, e= ven though it > =A0>>> =A0>> prevents a panic :-). > =A0>>> > =A0>>> =A0dj> Hello Robert, > =A0>>> > =A0>>> =A0dj> Thank you for taking your valuable time to find out the pro= blem. > =A0>>> =A0dj> Since I don't have idea about network internals, would you = have a patch > =A0>>> =A0dj> about this? I'd be glad to test it, thanks again. > =A0>>> > =A0>>> Here is the patch that implements what Robert suggests. > =A0>>> > =A0>>> Dave, could you test it? > =A0>> > =A0>> Sure. Thanks for cooking the patch. > =A0>> Machines have been running two days now without panic. > > Thank you for testing it. > > =A0dj> Is there any plan to commit your fix? Thank you. > =A0dj> I'd upgrade to 9.0-release from beta-2 once it's released. > > I have an upgraded version of the patch, which is under review now. I hav= e > been waiting for the response before asking you to test it, but it would = be > great if you try it not waiting :-). > > As it was pointed by Robert the previous version introduced a regression: > SO_REUSEPORT was ignored if setsockopt was called after bind (the old cac= hed > value was still used). So the updated version fixes this and also contain= s > several other fixes, the most important among them is that it fixes the p= anic > for IPv6 bind case too. Thanks for cooking the patch. Machines have been running up days without any panic. > -- > Mikolaj Golub Regards, Dave.