Date: Tue, 04 Oct 2011 16:49:23 +0300 From: Mikolaj Golub <trociny@freebsd.org> To: dave jones <s.dave.jones@gmail.com> Cc: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>, Adrian Chadd <adrian@freebsd.org>, Robert Watson <rwatson@freebsd.org>, "K. Macy" <kmacy@freebsd.org>, Arnaud Lacombe <lacombar@gmail.com> Subject: Re: Kernel panic on FreeBSD 9.0-beta2 Message-ID: <86y5x0ooik.fsf@in138.ua3> In-Reply-To: <CANf5e8ab=mUw-AJuRXZy1T6%2BZcryxjKfuCOsakPPfqatuA3HdA@mail.gmail.com> (dave jones's message of "Sat, 1 Oct 2011 14:15:45 %2B0800") References: <CANf5e8aG4go4M_vsRExUsJB_sjaN5x-QK-TCDAhSH64JSo0mdQ@mail.gmail.com> <CACqU3MXStMMEoppvDtZS6hV4WGttbdJiF8E-ORwJ%2BQSmnTy-Yg@mail.gmail.com> <CACqU3MV-t4Va6VWUoXy1Y9FYnNJTUw1X%2BE7ik-2%2BtMVuVOV3RA@mail.gmail.com> <CAJ-Vmom-177OkdUXjz%2BZLqbaqn=p%2BuTGypiVuMqdeXgdOgb4hQ@mail.gmail.com> <CAHM0Q_Mmn3z1V6AtZHQMpgbdY7oQqOChiNt=8NJrZQDnravb7A@mail.gmail.com> <CACqU3MU9ZZtOsdBOa%2BF3SqUaYgO%2BEo0v1ACjY0S4rY4fRQyv5Q@mail.gmail.com> <CAHM0Q_PZD9_0ZkELZ5XL8Ebh8eD-uFuSjXWKKVpGDeM_JDaqMA@mail.gmail.com> <8662kcigif.fsf@kopusha.home.net> <alpine.BSF.2.00.1109301432570.65269@fledge.watson.org> <CANf5e8ab=mUw-AJuRXZy1T6%2BZcryxjKfuCOsakPPfqatuA3HdA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--=-=-=
Content-Type: text/plain; charset=koi8-r
Content-Transfer-Encoding: 8bit
On Sat, 1 Oct 2011 14:15:45 +0800 dave jones wrote:
dj> On Fri, Sep 30, 2011 at 9:41 PM, Robert Watson wrote:
>>
>> On Wed, 28 Sep 2011, Mikolaj Golub wrote:
>>
>>> On Mon, 26 Sep 2011 16:12:55 +0200 K. Macy wrote:
>>>
>>> KM> Sorry, didn't look at the images (limited bw), I've seen something KM>
>>> like this before in timewait. This "can't happen" with UDP so will be KM>
>>> interested in learning more about the bug.
>>>
>>> The panic can be easily triggered by this:
>>
>> Hi:
>>
>> Just catching up on this thread. šI think the analysis here is generally
>> right: in 9.0, you're much more likely to see an inpcb with its in_socket
>> pointer cleared in the hash list than in prior releases, and
>> in_pcbbind_setup() trips over this.
>>
>> However, at least on first glance (and from the perspective of invariants
>> here), I think the bug is actualy that in_pcbbind_setup() is asking
>> in_pcblookup_local() for an inpcb and then access the returned inpcb's
>> in_socket pointer without acquiring a lock on the inpcb. šStructurally, it
>> can't acquire this lock for lock order reasons -- it already holds the lock
>> on its own inpcb. šTherefore, we should only access fields that are safe to
>> follow in an inpcb when you hold a reference via the hash lock and not a
>> lock on the inpcb itself, which appears generally OK (+/-) for all the
>> fields in that clause but the t->inp_socket->so_options dereference.
>>
>> A preferred fix would cache the SO_REUSEPORT flag in an inpcb-layer field,
>> such as inp_flags2, giving us access to its value without having to walk
>> into the attached (or not) socket.
>>
>> This raises another structural question, which is whether we need a new
>> inp_foo flags field that is protected explicitly by the hash lock, and not
>> by the inpcb lock, which could hold fields relevant to address binding. šI
>> don't think we need to solve that problem in this context, as a slightly
>> race on SO_REUSEPORT is likely acceptable.
>>
>> The suggested fix does perform the desired function of explicitly detaching
>> the inpcb from the hash list before the socket is disconnected from the
>> inpcb. However, it's incomplete in that the invariant that's being broken is
>> also relied on for other protocols (such as raw sockets). šThe correct
>> invariant is that inp_socket is safe to follow unconditionally if an inpcb
>> is locked and INP_DROPPED isn't set -- the bug is in "locked" not in
>> "INP_DROPPED", which is why I think this is the wrong fix, even though it
>> prevents a panic :-).
dj> Hello Robert,
dj> Thank you for taking your valuable time to find out the problem.
dj> Since I don't have idea about network internals, would you have a patch
dj> about this? I'd be glad to test it, thanks again.
Here is the patch that implements what Robert suggests.
Dave, could you test it?
>> Robert
dj> Best regards,
dj> Dave.
--
Mikolaj Golub
--=-=-=
Content-Type: text/x-diff
Content-Disposition: attachment; filename=in_pcb.REUSEPORT.patch
Index: sys/netinet/in_pcb.c
===================================================================
--- sys/netinet/in_pcb.c (revision 225885)
+++ sys/netinet/in_pcb.c (working copy)
@@ -575,8 +575,7 @@ in_pcbbind_setup(struct inpcb *inp, struct sockadd
ntohl(t->inp_faddr.s_addr) == INADDR_ANY) &&
(ntohl(sin->sin_addr.s_addr) != INADDR_ANY ||
ntohl(t->inp_laddr.s_addr) != INADDR_ANY ||
- (t->inp_socket->so_options &
- SO_REUSEPORT) == 0) &&
+ (t->inp_flags2 & INP_REUSEPORT) == 0) &&
(inp->inp_cred->cr_uid !=
t->inp_cred->cr_uid))
return (EADDRINUSE);
@@ -595,14 +594,15 @@ in_pcbbind_setup(struct inpcb *inp, struct sockadd
(reuseport & tw->tw_so_options) == 0)
return (EADDRINUSE);
} else if (t &&
- (reuseport & t->inp_socket->so_options) == 0) {
+ (reuseport == 0 ||
+ (t->inp_flags2 & INP_REUSEPORT) == 0)) {
#ifdef INET6
if (ntohl(sin->sin_addr.s_addr) !=
INADDR_ANY ||
ntohl(t->inp_laddr.s_addr) !=
INADDR_ANY ||
- INP_SOCKAF(so) ==
- INP_SOCKAF(t->inp_socket))
+ (inp->inp_vflag & INP_IPV6PROTO) == 0 ||
+ (t->inp_vflag & INP_IPV6PROTO) == 0)
#endif
return (EADDRINUSE);
}
@@ -1867,6 +1867,11 @@ in_pcbinshash_internal(struct inpcb *inp, int do_p
KASSERT((inp->inp_flags & INP_INHASHLIST) == 0,
("in_pcbinshash: INP_INHASHLIST"));
+ if ((inp->inp_socket->so_options & SO_REUSEPORT) != 0 ||
+ (IN_MULTICAST(ntohl(inp->inp_laddr.s_addr)) &&
+ (inp->inp_socket->so_options & SO_REUSEADDR) != 0))
+ inp->inp_flags2 |= INP_REUSEPORT;
+
#ifdef INET6
if (inp->inp_vflag & INP_IPV6)
hashkey_faddr = inp->in6p_faddr.s6_addr32[3] /* XXX */;
Index: sys/netinet/in_pcb.h
===================================================================
--- sys/netinet/in_pcb.h (revision 225885)
+++ sys/netinet/in_pcb.h (working copy)
@@ -540,6 +540,7 @@ void inp_4tuple_get(struct inpcb *inp, uint32_t *
#define INP_LLE_VALID 0x00000001 /* cached lle is valid */
#define INP_RT_VALID 0x00000002 /* cached rtentry is valid */
#define INP_PCBGROUPWILD 0x00000004 /* in pcbgroup wildcard list */
+#define INP_REUSEPORT 0x00000008 /* socket SO_REUSEPORT option is set */
/*
* Flags passed to in_pcblookup*() functions.
--=-=-=--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86y5x0ooik.fsf>
