From owner-freebsd-net@FreeBSD.ORG  Fri Sep 30 13:41:14 2011
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 668FF1065670;
	Fri, 30 Sep 2011 13:41:14 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 3AA4A8FC12;
	Fri, 30 Sep 2011 13:41:14 +0000 (UTC)
Received: from fledge.watson.org (fledge.watson.org [65.122.17.41])
	by cyrus.watson.org (Postfix) with ESMTPS id CDD4E46B3C;
	Fri, 30 Sep 2011 09:41:13 -0400 (EDT)
Date: Fri, 30 Sep 2011 14:41:13 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Mikolaj Golub <trociny@freebsd.org>
In-Reply-To: <8662kcigif.fsf@kopusha.home.net>
Message-ID: <alpine.BSF.2.00.1109301432570.65269@fledge.watson.org>
References: <CANf5e8aG4go4M_vsRExUsJB_sjaN5x-QK-TCDAhSH64JSo0mdQ@mail.gmail.com>
	<CACqU3MXStMMEoppvDtZS6hV4WGttbdJiF8E-ORwJ+QSmnTy-Yg@mail.gmail.com>
	<CACqU3MV-t4Va6VWUoXy1Y9FYnNJTUw1X+E7ik-2+tMVuVOV3RA@mail.gmail.com>
	<CAJ-Vmom-177OkdUXjz+ZLqbaqn=p+uTGypiVuMqdeXgdOgb4hQ@mail.gmail.com>
	<CAHM0Q_Mmn3z1V6AtZHQMpgbdY7oQqOChiNt=8NJrZQDnravb7A@mail.gmail.com>
	<CACqU3MU9ZZtOsdBOa+F3SqUaYgO+Eo0v1ACjY0S4rY4fRQyv5Q@mail.gmail.com>
	<CAHM0Q_PZD9_0ZkELZ5XL8Ebh8eD-uFuSjXWKKVpGDeM_JDaqMA@mail.gmail.com>
	<8662kcigif.fsf@kopusha.home.net>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>,
	Adrian Chadd <adrian@freebsd.org>, "K. Macy" <kmacy@freebsd.org>,
	Arnaud Lacombe <lacombar@gmail.com>, dave jones <s.dave.jones@gmail.com>
Subject: Re: Kernel panic on FreeBSD 9.0-beta2
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 30 Sep 2011 13:41:14 -0000


On Wed, 28 Sep 2011, Mikolaj Golub wrote:

> On Mon, 26 Sep 2011 16:12:55 +0200 K. Macy wrote:
>
> KM> Sorry, didn't look at the images (limited bw), I've seen something KM> 
> like this before in timewait. This "can't happen" with UDP so will be KM> 
> interested in learning more about the bug.
>
> The panic can be easily triggered by this:

Hi:

Just catching up on this thread.  I think the analysis here is generally 
right: in 9.0, you're much more likely to see an inpcb with its in_socket 
pointer cleared in the hash list than in prior releases, and 
in_pcbbind_setup() trips over this.

However, at least on first glance (and from the perspective of invariants 
here), I think the bug is actualy that in_pcbbind_setup() is asking 
in_pcblookup_local() for an inpcb and then access the returned inpcb's 
in_socket pointer without acquiring a lock on the inpcb.  Structurally, it 
can't acquire this lock for lock order reasons -- it already holds the lock on 
its own inpcb.  Therefore, we should only access fields that are safe to 
follow in an inpcb when you hold a reference via the hash lock and not a lock 
on the inpcb itself, which appears generally OK (+/-) for all the fields in 
that clause but the t->inp_socket->so_options dereference.

A preferred fix would cache the SO_REUSEPORT flag in an inpcb-layer field, 
such as inp_flags2, giving us access to its value without having to walk into 
the attached (or not) socket.

This raises another structural question, which is whether we need a new 
inp_foo flags field that is protected explicitly by the hash lock, and not by 
the inpcb lock, which could hold fields relevant to address binding.  I don't 
think we need to solve that problem in this context, as a slightly race on 
SO_REUSEPORT is likely acceptable.

The suggested fix does perform the desired function of explicitly detaching 
the inpcb from the hash list before the socket is disconnected from the inpcb. 
However, it's incomplete in that the invariant that's being broken is also 
relied on for other protocols (such as raw sockets).  The correct invariant is 
that inp_socket is safe to follow unconditionally if an inpcb is locked and 
INP_DROPPED isn't set -- the bug is in "locked" not in "INP_DROPPED", which is 
why I think this is the wrong fix, even though it prevents a panic :-).

Robert