From owner-freebsd-net@FreeBSD.ORG Tue Oct 4 13:49:30 2011 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6445E1065670; Tue, 4 Oct 2011 13:49:30 +0000 (UTC) (envelope-from to.my.trociny@gmail.com) Received: from mail-ey0-f182.google.com (mail-ey0-f182.google.com [209.85.215.182]) by mx1.freebsd.org (Postfix) with ESMTP id 568DE8FC15; Tue, 4 Oct 2011 13:49:29 +0000 (UTC) Received: by eyg7 with SMTP id 7so671275eyg.13 for ; Tue, 04 Oct 2011 06:49:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=from:to:cc:subject:organization:references:sender:date:in-reply-to :message-id:user-agent:mime-version:content-type; bh=+EkGuB1UoP8wjwElWNSbF4XWzHfJ2Q4tba2j5wNmID8=; b=ByqzUMZ6YhR7Rqdg38CtexzHAqD9wJtHBFlvOHMRvAyvTIok6JAw4su1aoOIurgL4S /S0Nbbompz/KDY6fsLoJoCbfNH7ngvqmzYTHjR70tOobomEPvPy8D6TQsv5wzgxMXNiX IhV8GWc9FCRCPoIYDLjju09HyaeX8SFPsOsHQ= Received: by 10.223.26.210 with SMTP id f18mr1797655fac.51.1317736168342; Tue, 04 Oct 2011 06:49:28 -0700 (PDT) Received: from localhost ([94.27.39.186]) by mx.google.com with ESMTPS id n12sm25556725fan.9.2011.10.04.06.49.24 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 04 Oct 2011 06:49:25 -0700 (PDT) From: Mikolaj Golub To: dave jones Organization: TOA Ukraine References: <8662kcigif.fsf@kopusha.home.net> Sender: Mikolaj Golub Date: Tue, 04 Oct 2011 16:49:23 +0300 In-Reply-To: (dave jones's message of "Sat, 1 Oct 2011 14:15:45 +0800") Message-ID: <86y5x0ooik.fsf@in138.ua3> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (berkeley-unix) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Cc: "freebsd-net@freebsd.org" , Adrian Chadd , Robert Watson , "K. Macy" , Arnaud Lacombe Subject: Re: Kernel panic on FreeBSD 9.0-beta2 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 04 Oct 2011 13:49:30 -0000 --=-=-= Content-Type: text/plain; charset=koi8-r Content-Transfer-Encoding: 8bit On Sat, 1 Oct 2011 14:15:45 +0800 dave jones wrote: dj> On Fri, Sep 30, 2011 at 9:41 PM, Robert Watson wrote: >> >> On Wed, 28 Sep 2011, Mikolaj Golub wrote: >> >>> On Mon, 26 Sep 2011 16:12:55 +0200 K. Macy wrote: >>> >>> KM> Sorry, didn't look at the images (limited bw), I've seen something KM> >>> like this before in timewait. This "can't happen" with UDP so will be KM> >>> interested in learning more about the bug. >>> >>> The panic can be easily triggered by this: >> >> Hi: >> >> Just catching up on this thread. šI think the analysis here is generally >> right: in 9.0, you're much more likely to see an inpcb with its in_socket >> pointer cleared in the hash list than in prior releases, and >> in_pcbbind_setup() trips over this. >> >> However, at least on first glance (and from the perspective of invariants >> here), I think the bug is actualy that in_pcbbind_setup() is asking >> in_pcblookup_local() for an inpcb and then access the returned inpcb's >> in_socket pointer without acquiring a lock on the inpcb. šStructurally, it >> can't acquire this lock for lock order reasons -- it already holds the lock >> on its own inpcb. šTherefore, we should only access fields that are safe to >> follow in an inpcb when you hold a reference via the hash lock and not a >> lock on the inpcb itself, which appears generally OK (+/-) for all the >> fields in that clause but the t->inp_socket->so_options dereference. >> >> A preferred fix would cache the SO_REUSEPORT flag in an inpcb-layer field, >> such as inp_flags2, giving us access to its value without having to walk >> into the attached (or not) socket. >> >> This raises another structural question, which is whether we need a new >> inp_foo flags field that is protected explicitly by the hash lock, and not >> by the inpcb lock, which could hold fields relevant to address binding. šI >> don't think we need to solve that problem in this context, as a slightly >> race on SO_REUSEPORT is likely acceptable. >> >> The suggested fix does perform the desired function of explicitly detaching >> the inpcb from the hash list before the socket is disconnected from the >> inpcb. However, it's incomplete in that the invariant that's being broken is >> also relied on for other protocols (such as raw sockets). šThe correct >> invariant is that inp_socket is safe to follow unconditionally if an inpcb >> is locked and INP_DROPPED isn't set -- the bug is in "locked" not in >> "INP_DROPPED", which is why I think this is the wrong fix, even though it >> prevents a panic :-). dj> Hello Robert, dj> Thank you for taking your valuable time to find out the problem. dj> Since I don't have idea about network internals, would you have a patch dj> about this? I'd be glad to test it, thanks again. Here is the patch that implements what Robert suggests. Dave, could you test it? >> Robert dj> Best regards, dj> Dave. -- Mikolaj Golub --=-=-= Content-Type: text/x-diff Content-Disposition: attachment; filename=in_pcb.REUSEPORT.patch Index: sys/netinet/in_pcb.c =================================================================== --- sys/netinet/in_pcb.c (revision 225885) +++ sys/netinet/in_pcb.c (working copy) @@ -575,8 +575,7 @@ in_pcbbind_setup(struct inpcb *inp, struct sockadd ntohl(t->inp_faddr.s_addr) == INADDR_ANY) && (ntohl(sin->sin_addr.s_addr) != INADDR_ANY || ntohl(t->inp_laddr.s_addr) != INADDR_ANY || - (t->inp_socket->so_options & - SO_REUSEPORT) == 0) && + (t->inp_flags2 & INP_REUSEPORT) == 0) && (inp->inp_cred->cr_uid != t->inp_cred->cr_uid)) return (EADDRINUSE); @@ -595,14 +594,15 @@ in_pcbbind_setup(struct inpcb *inp, struct sockadd (reuseport & tw->tw_so_options) == 0) return (EADDRINUSE); } else if (t && - (reuseport & t->inp_socket->so_options) == 0) { + (reuseport == 0 || + (t->inp_flags2 & INP_REUSEPORT) == 0)) { #ifdef INET6 if (ntohl(sin->sin_addr.s_addr) != INADDR_ANY || ntohl(t->inp_laddr.s_addr) != INADDR_ANY || - INP_SOCKAF(so) == - INP_SOCKAF(t->inp_socket)) + (inp->inp_vflag & INP_IPV6PROTO) == 0 || + (t->inp_vflag & INP_IPV6PROTO) == 0) #endif return (EADDRINUSE); } @@ -1867,6 +1867,11 @@ in_pcbinshash_internal(struct inpcb *inp, int do_p KASSERT((inp->inp_flags & INP_INHASHLIST) == 0, ("in_pcbinshash: INP_INHASHLIST")); + if ((inp->inp_socket->so_options & SO_REUSEPORT) != 0 || + (IN_MULTICAST(ntohl(inp->inp_laddr.s_addr)) && + (inp->inp_socket->so_options & SO_REUSEADDR) != 0)) + inp->inp_flags2 |= INP_REUSEPORT; + #ifdef INET6 if (inp->inp_vflag & INP_IPV6) hashkey_faddr = inp->in6p_faddr.s6_addr32[3] /* XXX */; Index: sys/netinet/in_pcb.h =================================================================== --- sys/netinet/in_pcb.h (revision 225885) +++ sys/netinet/in_pcb.h (working copy) @@ -540,6 +540,7 @@ void inp_4tuple_get(struct inpcb *inp, uint32_t * #define INP_LLE_VALID 0x00000001 /* cached lle is valid */ #define INP_RT_VALID 0x00000002 /* cached rtentry is valid */ #define INP_PCBGROUPWILD 0x00000004 /* in pcbgroup wildcard list */ +#define INP_REUSEPORT 0x00000008 /* socket SO_REUSEPORT option is set */ /* * Flags passed to in_pcblookup*() functions. --=-=-=--