Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 30 May 2011 09:43:55 +0000 (UTC)
From:      Robert Watson <rwatson@FreeBSD.org>
To:        src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   svn commit: r222488 - in head/sys: contrib/pf/net netinet netinet/ipfw netinet6
Message-ID:  <201105300943.p4U9htjI070096@svn.freebsd.org>

next in thread | raw e-mail | index | archive | help
Author: rwatson
Date: Mon May 30 09:43:55 2011
New Revision: 222488
URL: http://svn.freebsd.org/changeset/base/222488

Log:
  Decompose the current single inpcbinfo lock into two locks:
  
  - The existing ipi_lock continues to protect the global inpcb list and
    inpcb counter.  This lock is now relegated to a small number of
    allocation and free operations, and occasional operations that walk
    all connections (including, awkwardly, certain UDP multicast receive
    operations -- something to revisit).
  
  - A new ipi_hash_lock protects the two inpcbinfo hash tables for
    looking up connections and bound sockets, manipulated using new
    INP_HASH_*() macros.  This lock, combined with inpcb locks, protects
    the 4-tuple address space.
  
  Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
  connection locks, so may be acquired while manipulating a connection on
  which a lock is already held, avoiding the need to acquire the inpcbinfo
  lock preemptively when a binding change might later be required.  As a
  result, however, lookup operations necessarily go through a reference
  acquire while holding the lookup lock, later acquiring an inpcb lock --
  if required.
  
  A new function in_pcblookup() looks up connections, and accepts flags
  indicating how to return the inpcb.  Due to lock order changes, callers
  no longer need acquire locks before performing a lookup: the lookup
  routine will acquire the ipi_hash_lock as needed.  In the future, it will
  also be able to use alternative lookup and locking strategies
  transparently to callers, such as pcbgroup lookup.  New lookup flags are,
  supplementing the existing INPLOOKUP_WILDCARD flag:
  
    INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
    INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
  
  Callers must pass exactly one of these flags (for the time being).
  
  Some notes:
  
  - All protocols are updated to work within the new regime; especially,
    TCP, UDPv4, and UDPv6.  pcbinfo ipi_lock acquisitions are largely
    eliminated, and global hash lock hold times are dramatically reduced
    compared to previous locking.
  - The TCP syncache still relies on the pcbinfo lock, something that we
    may want to revisit.
  - Support for reverting to the FreeBSD 7.x locking strategy in TCP input
    is no longer available -- hash lookup locks are now held only very
    briefly during inpcb lookup, rather than for potentially extended
    periods.  However, the pcbinfo ipi_lock will still be acquired if a
    connection state might change such that a connection is added or
    removed.
  - Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
    due to maintaining their own hash tables.
  - The interface in6_pcblookup_hash_locked() is maintained, which allows
    callers to acquire hash locks and perform one or more lookups atomically
    with 4-tuple allocation: this is required only for TCPv6, as there is no
    in6_pcbconnect_setup(), which there should be.
  - UDPv6 locking remains significantly more conservative than UDPv4
    locking, which relates to source address selection.  This needs
    attention, as it likely significantly reduces parallelism in this code
    for multithreaded socket use (such as in BIND).
  - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
    somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
    is no longer sufficient.  A second check once the inpcb lock is held
    should do the trick, keeping the general case from requiring the inpcb
    lock for every inpcb visited.
  - This work reminds us that we need to revisit locking of the v4/v6 flags,
    which may be accessed lock-free both before and after this change.
  - Right now, a single lock name is used for the pcbhash lock -- this is
    undesirable, and probably another argument is required to take care of
    this (or a char array name field in the pcbinfo?).
  
  This is not an MFC candidate for 8.x due to its impact on lookup and
  locking semantics.  It's possible some of these issues could be worked
  around with compatibility wrappers, if necessary.
  
  Reviewed by:    bz
  Sponsored by:   Juniper Networks, Inc.

Modified:
  head/sys/contrib/pf/net/pf.c
  head/sys/netinet/in_pcb.c
  head/sys/netinet/in_pcb.h
  head/sys/netinet/ip_divert.c
  head/sys/netinet/ipfw/ip_fw2.c
  head/sys/netinet/raw_ip.c
  head/sys/netinet/siftr.c
  head/sys/netinet/tcp_input.c
  head/sys/netinet/tcp_subr.c
  head/sys/netinet/tcp_syncache.c
  head/sys/netinet/tcp_timer.c
  head/sys/netinet/tcp_usrreq.c
  head/sys/netinet/udp_usrreq.c
  head/sys/netinet6/in6_pcb.c
  head/sys/netinet6/in6_pcb.h
  head/sys/netinet6/in6_src.c
  head/sys/netinet6/udp6_usrreq.c

Modified: head/sys/contrib/pf/net/pf.c
==============================================================================
--- head/sys/contrib/pf/net/pf.c	Mon May 30 09:41:38 2011	(r222487)
+++ head/sys/contrib/pf/net/pf.c	Mon May 30 09:43:55 2011	(r222488)
@@ -3034,16 +3034,14 @@ pf_socket_lookup(int direction, struct p
 #ifdef INET
 	case AF_INET:
 #ifdef __FreeBSD__
-		INP_INFO_RLOCK(pi);	/* XXX LOR */
-		inp = in_pcblookup_hash(pi, saddr->v4, sport, daddr->v4,
-			dport, 0, NULL);
+		inp = in_pcblookup(pi, saddr->v4, sport, daddr->v4,
+			dport, INPLOOKUP_RLOCKPCB, NULL);
 		if (inp == NULL) {
-			inp = in_pcblookup_hash(pi, saddr->v4, sport,
-			   daddr->v4, dport, INPLOOKUP_WILDCARD, NULL);
-			if(inp == NULL) {
-				INP_INFO_RUNLOCK(pi);
+			inp = in_pcblookup(pi, saddr->v4, sport,
+			   daddr->v4, dport, INPLOOKUP_WILDCARD |
+			   INPLOOKUP_RLOCKPCB, NULL);
+			if (inp == NULL)
 				return (-1);
-			}
 		}
 #else
 		inp = in_pcbhashlookup(tb, saddr->v4, sport, daddr->v4, dport);
@@ -3058,16 +3056,14 @@ pf_socket_lookup(int direction, struct p
 #ifdef INET6
 	case AF_INET6:
 #ifdef __FreeBSD__
-		INP_INFO_RLOCK(pi);
-		inp = in6_pcblookup_hash(pi, &saddr->v6, sport,
-			&daddr->v6, dport, 0, NULL);
+		inp = in6_pcblookup(pi, &saddr->v6, sport,
+			&daddr->v6, dport, INPLOOKUP_RLOCKPCB, NULL);
 		if (inp == NULL) {
-			inp = in6_pcblookup_hash(pi, &saddr->v6, sport,
-			&daddr->v6, dport, INPLOOKUP_WILDCARD, NULL);
-			if (inp == NULL) {
-				INP_INFO_RUNLOCK(pi);
+			inp = in6_pcblookup(pi, &saddr->v6, sport,
+			    &daddr->v6, dport, INPLOOKUP_WILDCARD |
+			    INPLOOKUP_RLOCKPCB, NULL);
+			if (inp == NULL)
 				return (-1);
-			}
 		}
 #else
 		inp = in6_pcbhashlookup(tb, &saddr->v6, sport, &daddr->v6,
@@ -3085,9 +3081,10 @@ pf_socket_lookup(int direction, struct p
 		return (-1);
 	}
 #ifdef __FreeBSD__
+	INP_RLOCK_ASSERT(inp);
 	pd->lookup.uid = inp->inp_cred->cr_uid;
 	pd->lookup.gid = inp->inp_cred->cr_groups[0];
-	INP_INFO_RUNLOCK(pi);
+	INP_RUNLOCK(inp);
 #else
 	pd->lookup.uid = inp->inp_socket->so_euid;
 	pd->lookup.gid = inp->inp_socket->so_egid;

Modified: head/sys/netinet/in_pcb.c
==============================================================================
--- head/sys/netinet/in_pcb.c	Mon May 30 09:41:38 2011	(r222487)
+++ head/sys/netinet/in_pcb.c	Mon May 30 09:43:55 2011	(r222488)
@@ -127,6 +127,10 @@ static VNET_DEFINE(int, ipport_tcplastco
 
 #define	V_ipport_tcplastcount		VNET(ipport_tcplastcount)
 
+static struct inpcb	*in_pcblookup_hash_locked(struct inpcbinfo *pcbinfo,
+			    struct in_addr faddr, u_int fport_arg,
+			    struct in_addr laddr, u_int lport_arg,
+			    int lookupflags, struct ifnet *ifp);
 static void	in_pcbremlists(struct inpcb *inp);
 
 #ifdef INET
@@ -212,11 +216,13 @@ in_pcbinfo_init(struct inpcbinfo *pcbinf
 {
 
 	INP_INFO_LOCK_INIT(pcbinfo, name);
+	INP_HASH_LOCK_INIT(pcbinfo, "pcbinfohash");	/* XXXRW: argument? */
 #ifdef VIMAGE
 	pcbinfo->ipi_vnet = curvnet;
 #endif
 	pcbinfo->ipi_listhead = listhead;
 	LIST_INIT(pcbinfo->ipi_listhead);
+	pcbinfo->ipi_count = 0;
 	pcbinfo->ipi_hashbase = hashinit(hash_nelements, M_PCB,
 	    &pcbinfo->ipi_hashmask);
 	pcbinfo->ipi_porthashbase = hashinit(porthash_nelements, M_PCB,
@@ -234,10 +240,14 @@ void
 in_pcbinfo_destroy(struct inpcbinfo *pcbinfo)
 {
 
+	KASSERT(pcbinfo->ipi_count == 0,
+	    ("%s: ipi_count = %u", __func__, pcbinfo->ipi_count));
+
 	hashdestroy(pcbinfo->ipi_hashbase, M_PCB, pcbinfo->ipi_hashmask);
 	hashdestroy(pcbinfo->ipi_porthashbase, M_PCB,
 	    pcbinfo->ipi_porthashmask);
 	uma_zdestroy(pcbinfo->ipi_zone);
+	INP_HASH_LOCK_DESTROY(pcbinfo);
 	INP_INFO_LOCK_DESTROY(pcbinfo);
 }
 
@@ -309,8 +319,8 @@ in_pcbbind(struct inpcb *inp, struct soc
 {
 	int anonport, error;
 
-	INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo);
 	INP_WLOCK_ASSERT(inp);
+	INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
 
 	if (inp->inp_lport != 0 || inp->inp_laddr.s_addr != INADDR_ANY)
 		return (EINVAL);
@@ -351,8 +361,8 @@ in_pcb_lport(struct inpcb *inp, struct i
 	 * Because no actual state changes occur here, a global write lock on
 	 * the pcbinfo isn't required.
 	 */
-	INP_INFO_LOCK_ASSERT(pcbinfo);
 	INP_LOCK_ASSERT(inp);
+	INP_HASH_LOCK_ASSERT(pcbinfo);
 
 	if (inp->inp_flags & INP_HIGHPORT) {
 		first = V_ipport_hifirstauto;	/* sysctl */
@@ -473,11 +483,10 @@ in_pcbbind_setup(struct inpcb *inp, stru
 	int error;
 
 	/*
-	 * Because no actual state changes occur here, a global write lock on
-	 * the pcbinfo isn't required.
+	 * No state changes, so read locks are sufficient here.
 	 */
-	INP_INFO_LOCK_ASSERT(pcbinfo);
 	INP_LOCK_ASSERT(inp);
+	INP_HASH_LOCK_ASSERT(pcbinfo);
 
 	if (TAILQ_EMPTY(&V_in_ifaddrhead)) /* XXX broken! */
 		return (EADDRNOTAVAIL);
@@ -618,8 +627,8 @@ in_pcbconnect(struct inpcb *inp, struct 
 	in_addr_t laddr, faddr;
 	int anonport, error;
 
-	INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo);
 	INP_WLOCK_ASSERT(inp);
+	INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
 
 	lport = inp->inp_lport;
 	laddr = inp->inp_laddr.s_addr;
@@ -907,8 +916,8 @@ in_pcbconnect_setup(struct inpcb *inp, s
 	 * Because a global state change doesn't actually occur here, a read
 	 * lock is sufficient.
 	 */
-	INP_INFO_LOCK_ASSERT(inp->inp_pcbinfo);
 	INP_LOCK_ASSERT(inp);
+	INP_HASH_LOCK_ASSERT(inp->inp_pcbinfo);
 
 	if (oinpp != NULL)
 		*oinpp = NULL;
@@ -983,8 +992,8 @@ in_pcbconnect_setup(struct inpcb *inp, s
 		if (error)
 			return (error);
 	}
-	oinp = in_pcblookup_hash(inp->inp_pcbinfo, faddr, fport, laddr, lport,
-	    0, NULL);
+	oinp = in_pcblookup_hash_locked(inp->inp_pcbinfo, faddr, fport,
+	    laddr, lport, 0, NULL);
 	if (oinp != NULL) {
 		if (oinpp != NULL)
 			*oinpp = oinp;
@@ -1007,8 +1016,8 @@ void
 in_pcbdisconnect(struct inpcb *inp)
 {
 
-	INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo);
 	INP_WLOCK_ASSERT(inp);
+	INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
 
 	inp->inp_faddr.s_addr = INADDR_ANY;
 	inp->inp_fport = 0;
@@ -1187,19 +1196,24 @@ void
 in_pcbdrop(struct inpcb *inp)
 {
 
-	INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo);
 	INP_WLOCK_ASSERT(inp);
 
+	/*
+	 * XXXRW: Possibly we should protect the setting of INP_DROPPED with
+	 * the hash lock...?
+	 */
 	inp->inp_flags |= INP_DROPPED;
 	if (inp->inp_flags & INP_INHASHLIST) {
 		struct inpcbport *phd = inp->inp_phd;
 
+		INP_HASH_WLOCK(inp->inp_pcbinfo);
 		LIST_REMOVE(inp, inp_hash);
 		LIST_REMOVE(inp, inp_portlist);
 		if (LIST_FIRST(&phd->phd_pcblist) == NULL) {
 			LIST_REMOVE(phd, phd_hash);
 			free(phd, M_PCB);
 		}
+		INP_HASH_WUNLOCK(inp->inp_pcbinfo);
 		inp->inp_flags &= ~INP_INHASHLIST;
 	}
 }
@@ -1328,7 +1342,8 @@ in_pcbpurgeif0(struct inpcbinfo *pcbinfo
 }
 
 /*
- * Lookup a PCB based on the local address and port.
+ * Lookup a PCB based on the local address and port.  Caller must hold the
+ * hash lock.  No inpcb locks or references are acquired.
  */
 #define INP_LOOKUP_MAPPED_PCB_COST	3
 struct inpcb *
@@ -1346,7 +1361,7 @@ in_pcblookup_local(struct inpcbinfo *pcb
 	KASSERT((lookupflags & ~(INPLOOKUP_WILDCARD)) == 0,
 	    ("%s: invalid lookup flags %d", __func__, lookupflags));
 
-	INP_INFO_LOCK_ASSERT(pcbinfo);
+	INP_HASH_LOCK_ASSERT(pcbinfo);
 
 	if ((lookupflags & INPLOOKUP_WILDCARD) == 0) {
 		struct inpcbhead *head;
@@ -1450,10 +1465,12 @@ in_pcblookup_local(struct inpcbinfo *pcb
 #undef INP_LOOKUP_MAPPED_PCB_COST
 
 /*
- * Lookup PCB in hash list.
+ * Lookup PCB in hash list, using pcbinfo tables.  This variation assumes
+ * that the caller has locked the hash list, and will not perform any further
+ * locking or reference operations on either the hash list or the connection.
  */
-struct inpcb *
-in_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in_addr faddr,
+static struct inpcb *
+in_pcblookup_hash_locked(struct inpcbinfo *pcbinfo, struct in_addr faddr,
     u_int fport_arg, struct in_addr laddr, u_int lport_arg, int lookupflags,
     struct ifnet *ifp)
 {
@@ -1464,7 +1481,7 @@ in_pcblookup_hash(struct inpcbinfo *pcbi
 	KASSERT((lookupflags & ~(INPLOOKUP_WILDCARD)) == 0,
 	    ("%s: invalid lookup flags %d", __func__, lookupflags));
 
-	INP_INFO_LOCK_ASSERT(pcbinfo);
+	INP_HASH_LOCK_ASSERT(pcbinfo);
 
 	/*
 	 * First look for an exact match.
@@ -1574,6 +1591,56 @@ in_pcblookup_hash(struct inpcbinfo *pcbi
 
 	return (NULL);
 }
+
+/*
+ * Lookup PCB in hash list, using pcbinfo tables.  This variation locks the
+ * hash list lock, and will return the inpcb locked (i.e., requires
+ * INPLOOKUP_LOCKPCB).
+ */
+static struct inpcb *
+in_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in_addr faddr,
+    u_int fport, struct in_addr laddr, u_int lport, int lookupflags,
+    struct ifnet *ifp)
+{
+	struct inpcb *inp;
+
+	INP_HASH_RLOCK(pcbinfo);
+	inp = in_pcblookup_hash_locked(pcbinfo, faddr, fport, laddr, lport,
+	    (lookupflags & ~(INPLOOKUP_RLOCKPCB | INPLOOKUP_WLOCKPCB)), ifp);
+	if (inp != NULL) {
+		in_pcbref(inp);
+		INP_HASH_RUNLOCK(pcbinfo);
+		if (lookupflags & INPLOOKUP_WLOCKPCB) {
+			INP_WLOCK(inp);
+			if (in_pcbrele_wlocked(inp))
+				return (NULL);
+		} else if (lookupflags & INPLOOKUP_RLOCKPCB) {
+			INP_RLOCK(inp);
+			if (in_pcbrele_rlocked(inp))
+				return (NULL);
+		} else
+			panic("%s: locking bug", __func__);
+	} else
+		INP_HASH_RUNLOCK(pcbinfo);
+	return (inp);
+}
+
+/*
+ * Public inpcb lookup routines, accepting a 4-tuple.
+ */
+struct inpcb *
+in_pcblookup(struct inpcbinfo *pcbinfo, struct in_addr faddr, u_int fport,
+    struct in_addr laddr, u_int lport, int lookupflags, struct ifnet *ifp)
+{
+
+	KASSERT((lookupflags & ~INPLOOKUP_MASK) == 0,
+	    ("%s: invalid lookup flags %d", __func__, lookupflags));
+	KASSERT((lookupflags & (INPLOOKUP_RLOCKPCB | INPLOOKUP_WLOCKPCB)) != 0,
+	    ("%s: LOCKPCB not set", __func__));
+
+	return (in_pcblookup_hash(pcbinfo, faddr, fport, laddr, lport,
+	    lookupflags, ifp));
+}
 #endif /* INET */
 
 /*
@@ -1588,8 +1655,9 @@ in_pcbinshash(struct inpcb *inp)
 	struct inpcbport *phd;
 	u_int32_t hashkey_faddr;
 
-	INP_INFO_WLOCK_ASSERT(pcbinfo);
 	INP_WLOCK_ASSERT(inp);
+	INP_HASH_WLOCK_ASSERT(pcbinfo);
+
 	KASSERT((inp->inp_flags & INP_INHASHLIST) == 0,
 	    ("in_pcbinshash: INP_INHASHLIST"));
 
@@ -1645,8 +1713,9 @@ in_pcbrehash(struct inpcb *inp)
 	struct inpcbhead *head;
 	u_int32_t hashkey_faddr;
 
-	INP_INFO_WLOCK_ASSERT(pcbinfo);
 	INP_WLOCK_ASSERT(inp);
+	INP_HASH_WLOCK_ASSERT(pcbinfo);
+
 	KASSERT(inp->inp_flags & INP_INHASHLIST,
 	    ("in_pcbrehash: !INP_INHASHLIST"));
 
@@ -1679,12 +1748,14 @@ in_pcbremlists(struct inpcb *inp)
 	if (inp->inp_flags & INP_INHASHLIST) {
 		struct inpcbport *phd = inp->inp_phd;
 
+		INP_HASH_WLOCK(pcbinfo);
 		LIST_REMOVE(inp, inp_hash);
 		LIST_REMOVE(inp, inp_portlist);
 		if (LIST_FIRST(&phd->phd_pcblist) == NULL) {
 			LIST_REMOVE(phd, phd_hash);
 			free(phd, M_PCB);
 		}
+		INP_HASH_WUNLOCK(pcbinfo);
 		inp->inp_flags &= ~INP_INHASHLIST;
 	}
 	LIST_REMOVE(inp, inp_list);

Modified: head/sys/netinet/in_pcb.h
==============================================================================
--- head/sys/netinet/in_pcb.h	Mon May 30 09:41:38 2011	(r222487)
+++ head/sys/netinet/in_pcb.h	Mon May 30 09:43:55 2011	(r222488)
@@ -268,22 +268,22 @@ struct inpcbport {
  * Global data structure for each high-level protocol (UDP, TCP, ...) in both
  * IPv4 and IPv6.  Holds inpcb lists and information for managing them.
  *
- * Each pcbinfo is protected by ipi_lock, covering mutable global fields (such
- * as the global pcb list) and hashed lookup tables.  The lock order is:
+ * Each pcbinfo is protected by two locks: ipi_lock and ipi_hash_lock,
+ * the former covering mutable global fields (such as the global pcb list),
+ * and the latter covering the hashed lookup tables.  The lock order is:
  *
- *    ipi_lock (before) inpcb locks
+ *    ipi_lock (before) inpcb locks (before) ipi_hash_lock
  *
  * Locking key:
  *
  * (c) Constant or nearly constant after initialisation
  * (g) Locked by ipi_lock
- * (h) Read using either ipi_lock or inpcb lock; write requires both.
+ * (h) Read using either ipi_hash_lock or inpcb lock; write requires both.
  * (x) Synchronisation properties poorly defined
  */
 struct inpcbinfo {
 	/*
-	 * Global lock protecting global inpcb list, inpcb count, hash tables,
-	 * etc.
+	 * Global lock protecting global inpcb list, inpcb count, etc.
 	 */
 	struct rwlock		 ipi_lock;
 
@@ -312,17 +312,22 @@ struct inpcbinfo {
 	struct	uma_zone	*ipi_zone;		/* (c) */
 
 	/*
+	 * Global lock protecting hash lookup tables.
+	 */
+	struct rwlock		 ipi_hash_lock;
+
+	/*
 	 * Global hash of inpcbs, hashed by local and foreign addresses and
 	 * port numbers.
 	 */
-	struct inpcbhead	*ipi_hashbase;		/* (g) */
-	u_long			 ipi_hashmask;		/* (g) */
+	struct inpcbhead	*ipi_hashbase;		/* (h) */
+	u_long			 ipi_hashmask;		/* (h) */
 
 	/*
 	 * Global hash of inpcbs, hashed by only local port number.
 	 */
-	struct inpcbporthead	*ipi_porthashbase;	/* (g) */
-	u_long			 ipi_porthashmask;	/* (g) */
+	struct inpcbporthead	*ipi_porthashbase;	/* (h) */
+	u_long			 ipi_porthashmask;	/* (h) */
 
 	/*
 	 * Pointer to network stack instance
@@ -406,6 +411,18 @@ void 	inp_4tuple_get(struct inpcb *inp, 
 #define INP_INFO_WLOCK_ASSERT(ipi)	rw_assert(&(ipi)->ipi_lock, RA_WLOCKED)
 #define INP_INFO_UNLOCK_ASSERT(ipi)	rw_assert(&(ipi)->ipi_lock, RA_UNLOCKED)
 
+#define	INP_HASH_LOCK_INIT(ipi, d) \
+	rw_init_flags(&(ipi)->ipi_hash_lock, (d), 0)
+#define	INP_HASH_LOCK_DESTROY(ipi)	rw_destroy(&(ipi)->ipi_hash_lock)
+#define	INP_HASH_RLOCK(ipi)		rw_rlock(&(ipi)->ipi_hash_lock)
+#define	INP_HASH_WLOCK(ipi)		rw_wlock(&(ipi)->ipi_hash_lock)
+#define	INP_HASH_RUNLOCK(ipi)		rw_runlock(&(ipi)->ipi_hash_lock)
+#define	INP_HASH_WUNLOCK(ipi)		rw_wunlock(&(ipi)->ipi_hash_lock)
+#define	INP_HASH_LOCK_ASSERT(ipi)	rw_assert(&(ipi)->ipi_hash_lock, \
+					    RA_LOCKED)
+#define	INP_HASH_WLOCK_ASSERT(ipi)	rw_assert(&(ipi)->ipi_hash_lock, \
+					    RA_WLOCKED)
+
 #define INP_PCBHASH(faddr, lport, fport, mask) \
 	(((faddr) ^ ((faddr) >> 16) ^ ntohs((lport) ^ (fport))) & (mask))
 #define INP_PCBPORTHASH(lport, mask) \
@@ -466,7 +483,16 @@ void 	inp_4tuple_get(struct inpcb *inp, 
 #define	INP_LLE_VALID		0x00000001 /* cached lle is valid */	
 #define	INP_RT_VALID		0x00000002 /* cached rtentry is valid */
 
-#define	INPLOOKUP_WILDCARD	1
+/*
+ * Flags passed to in_pcblookup*() functions.
+ */
+#define	INPLOOKUP_WILDCARD	0x00000001	/* Allow wildcard sockets. */
+#define	INPLOOKUP_RLOCKPCB	0x00000002	/* Return inpcb read-locked. */
+#define	INPLOOKUP_WLOCKPCB	0x00000004	/* Return inpcb write-locked. */
+
+#define	INPLOOKUP_MASK	(INPLOOKUP_WILDCARD | INPLOOKUP_RLOCKPCB | \
+			    INPLOOKUP_WLOCKPCB)
+
 #define	sotoinpcb(so)	((struct inpcb *)(so)->so_pcb)
 #define	sotoin6pcb(so)	sotoinpcb(so) /* for KAME src sync over BSD*'s */
 
@@ -527,7 +553,7 @@ struct inpcb *
 	in_pcblookup_local(struct inpcbinfo *,
 	    struct in_addr, u_short, int, struct ucred *);
 struct inpcb *
-	in_pcblookup_hash(struct inpcbinfo *, struct in_addr, u_int,
+	in_pcblookup(struct inpcbinfo *, struct in_addr, u_int,
 	    struct in_addr, u_int, int, struct ifnet *);
 void	in_pcbnotifyall(struct inpcbinfo *pcbinfo, struct in_addr,
 	    int, struct inpcb *(*)(struct inpcb *, int));

Modified: head/sys/netinet/ip_divert.c
==============================================================================
--- head/sys/netinet/ip_divert.c	Mon May 30 09:41:38 2011	(r222487)
+++ head/sys/netinet/ip_divert.c	Mon May 30 09:43:55 2011	(r222488)
@@ -659,9 +659,9 @@ div_pcblist(SYSCTL_HANDLER_ARGS)
 	INP_INFO_WLOCK(&V_divcbinfo);
 	for (i = 0; i < n; i++) {
 		inp = inp_list[i];
-		INP_WLOCK(inp);
-		if (!in_pcbrele(inp))
-			INP_WUNLOCK(inp);
+		INP_RLOCK(inp);
+		if (!in_pcbrele_rlocked(inp))
+			INP_RUNLOCK(inp);
 	}
 	INP_INFO_WUNLOCK(&V_divcbinfo);
 

Modified: head/sys/netinet/ipfw/ip_fw2.c
==============================================================================
--- head/sys/netinet/ipfw/ip_fw2.c	Mon May 30 09:41:38 2011	(r222487)
+++ head/sys/netinet/ipfw/ip_fw2.c	Mon May 30 09:43:55 2011	(r222488)
@@ -657,7 +657,7 @@ check_uidgid(ipfw_insn_u32 *insn, int pr
 	    (struct bsd_ucred *)uc, ugid_lookupp, ((struct mbuf *)inp)->m_skb);
 #else  /* FreeBSD */
 	struct inpcbinfo *pi;
-	int wildcard;
+	int lookupflags;
 	struct inpcb *pcb;
 	int match;
 
@@ -682,30 +682,31 @@ check_uidgid(ipfw_insn_u32 *insn, int pr
 	if (*ugid_lookupp == -1)
 		return (0);
 	if (proto == IPPROTO_TCP) {
-		wildcard = 0;
+		lookupflags = 0;
 		pi = &V_tcbinfo;
 	} else if (proto == IPPROTO_UDP) {
-		wildcard = INPLOOKUP_WILDCARD;
+		lookupflags = INPLOOKUP_WILDCARD;
 		pi = &V_udbinfo;
 	} else
 		return 0;
+	lookupflags |= INPLOOKUP_RLOCKPCB;
 	match = 0;
 	if (*ugid_lookupp == 0) {
-		INP_INFO_RLOCK(pi);
 		pcb =  (oif) ?
-			in_pcblookup_hash(pi,
+			in_pcblookup(pi,
 				dst_ip, htons(dst_port),
 				src_ip, htons(src_port),
-				wildcard, oif) :
-			in_pcblookup_hash(pi,
+				lookupflags, oif) :
+			in_pcblookup(pi,
 				src_ip, htons(src_port),
 				dst_ip, htons(dst_port),
-				wildcard, NULL);
+				lookupflags, NULL);
 		if (pcb != NULL) {
+			INP_RLOCK_ASSERT(pcb);
 			*uc = crhold(pcb->inp_cred);
 			*ugid_lookupp = 1;
+			INP_RUNLOCK(pcb);
 		}
-		INP_INFO_RUNLOCK(pi);
 		if (*ugid_lookupp == 0) {
 			/*
 			 * We tried and failed, set the variable to -1
@@ -1827,21 +1828,32 @@ do {								\
 				else
 					break;
 
+				/*
+				 * XXXRW: so_user_cookie should almost
+				 * certainly be inp_user_cookie?
+				 */
+
 				/* For incomming packet, lookup up the 
 				inpcb using the src/dest ip/port tuple */
 				if (inp == NULL) {
-					INP_INFO_RLOCK(pi);
-					inp = in_pcblookup_hash(pi, 
+					inp = in_pcblookup(pi, 
 						src_ip, htons(src_port),
 						dst_ip, htons(dst_port),
-						0, NULL);
-					INP_INFO_RUNLOCK(pi);
-				}
-				
-				if (inp && inp->inp_socket) {
-					tablearg = inp->inp_socket->so_user_cookie;
-					if (tablearg)
-						match = 1;
+						INPLOOKUP_RLOCKPCB, NULL);
+					if (inp != NULL) {
+						tablearg =
+						    inp->inp_socket->so_user_cookie;
+						if (tablearg)
+							match = 1;
+						INP_RUNLOCK(inp);
+					}
+				} else {
+					if (inp->inp_socket) {
+						tablearg =
+						    inp->inp_socket->so_user_cookie;
+						if (tablearg)
+							match = 1;
+					}
 				}
 				break;
 			}

Modified: head/sys/netinet/raw_ip.c
==============================================================================
--- head/sys/netinet/raw_ip.c	Mon May 30 09:41:38 2011	(r222487)
+++ head/sys/netinet/raw_ip.c	Mon May 30 09:43:55 2011	(r222488)
@@ -226,7 +226,7 @@ rip_append(struct inpcb *last, struct ip
 {
 	int policyfail = 0;
 
-	INP_RLOCK_ASSERT(last);
+	INP_LOCK_ASSERT(last);
 
 #ifdef IPSEC
 	/* check AH/ESP integrity. */
@@ -834,16 +834,19 @@ rip_detach(struct socket *so)
 static void
 rip_dodisconnect(struct socket *so, struct inpcb *inp)
 {
+	struct inpcbinfo *pcbinfo;
 
-	INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo);
-	INP_WLOCK_ASSERT(inp);
-
+	pcbinfo = inp->inp_pcbinfo;
+	INP_INFO_WLOCK(pcbinfo);
+	INP_WLOCK(inp);
 	rip_delhash(inp);
 	inp->inp_faddr.s_addr = INADDR_ANY;
 	rip_inshash(inp);
 	SOCK_LOCK(so);
 	so->so_state &= ~SS_ISCONNECTED;
 	SOCK_UNLOCK(so);
+	INP_WUNLOCK(inp);
+	INP_INFO_WUNLOCK(pcbinfo);
 }
 
 static void
@@ -854,11 +857,7 @@ rip_abort(struct socket *so)
 	inp = sotoinpcb(so);
 	KASSERT(inp != NULL, ("rip_abort: inp == NULL"));
 
-	INP_INFO_WLOCK(&V_ripcbinfo);
-	INP_WLOCK(inp);
 	rip_dodisconnect(so, inp);
-	INP_WUNLOCK(inp);
-	INP_INFO_WUNLOCK(&V_ripcbinfo);
 }
 
 static void
@@ -869,11 +868,7 @@ rip_close(struct socket *so)
 	inp = sotoinpcb(so);
 	KASSERT(inp != NULL, ("rip_close: inp == NULL"));
 
-	INP_INFO_WLOCK(&V_ripcbinfo);
-	INP_WLOCK(inp);
 	rip_dodisconnect(so, inp);
-	INP_WUNLOCK(inp);
-	INP_INFO_WUNLOCK(&V_ripcbinfo);
 }
 
 static int
@@ -887,11 +882,7 @@ rip_disconnect(struct socket *so)
 	inp = sotoinpcb(so);
 	KASSERT(inp != NULL, ("rip_disconnect: inp == NULL"));
 
-	INP_INFO_WLOCK(&V_ripcbinfo);
-	INP_WLOCK(inp);
 	rip_dodisconnect(so, inp);
-	INP_WUNLOCK(inp);
-	INP_INFO_WUNLOCK(&V_ripcbinfo);
 	return (0);
 }
 
@@ -1077,9 +1068,9 @@ rip_pcblist(SYSCTL_HANDLER_ARGS)
 	INP_INFO_WLOCK(&V_ripcbinfo);
 	for (i = 0; i < n; i++) {
 		inp = inp_list[i];
-		INP_WLOCK(inp);
-		if (!in_pcbrele(inp))
-			INP_WUNLOCK(inp);
+		INP_RLOCK(inp);
+		if (!in_pcbrele_rlocked(inp))
+			INP_RUNLOCK(inp);
 	}
 	INP_INFO_WUNLOCK(&V_ripcbinfo);
 

Modified: head/sys/netinet/siftr.c
==============================================================================
--- head/sys/netinet/siftr.c	Mon May 30 09:41:38 2011	(r222487)
+++ head/sys/netinet/siftr.c	Mon May 30 09:43:55 2011	(r222488)
@@ -696,17 +696,16 @@ siftr_findinpcb(int ipver, struct ip *ip
 
 	/* We need the tcbinfo lock. */
 	INP_INFO_UNLOCK_ASSERT(&V_tcbinfo);
-	INP_INFO_RLOCK(&V_tcbinfo);
 
 	if (dir == PFIL_IN)
 		inp = (ipver == INP_IPV4 ?
-		    in_pcblookup_hash(&V_tcbinfo, ip->ip_src, sport, ip->ip_dst,
-		    dport, 0, m->m_pkthdr.rcvif)
+		    in_pcblookup(&V_tcbinfo, ip->ip_src, sport, ip->ip_dst,
+		    dport, INPLOOKUP_RLOCKPCB, m->m_pkthdr.rcvif)
 		    :
 #ifdef SIFTR_IPV6
-		    in6_pcblookup_hash(&V_tcbinfo,
+		    in6_pcblookup(&V_tcbinfo,
 		    &((struct ip6_hdr *)ip)->ip6_src, sport,
-		    &((struct ip6_hdr *)ip)->ip6_dst, dport, 0,
+		    &((struct ip6_hdr *)ip)->ip6_dst, dport, INPLOOKUP_RLOCKPCB,
 		    m->m_pkthdr.rcvif)
 #else
 		    NULL
@@ -715,13 +714,13 @@ siftr_findinpcb(int ipver, struct ip *ip
 
 	else
 		inp = (ipver == INP_IPV4 ?
-		    in_pcblookup_hash(&V_tcbinfo, ip->ip_dst, dport, ip->ip_src,
-		    sport, 0, m->m_pkthdr.rcvif)
+		    in_pcblookup(&V_tcbinfo, ip->ip_dst, dport, ip->ip_src,
+		    sport, INPLOOKUP_RLOCKPCB, m->m_pkthdr.rcvif)
 		    :
 #ifdef SIFTR_IPV6
-		    in6_pcblookup_hash(&V_tcbinfo,
+		    in6_pcblookup(&V_tcbinfo,
 		    &((struct ip6_hdr *)ip)->ip6_dst, dport,
-		    &((struct ip6_hdr *)ip)->ip6_src, sport, 0,
+		    &((struct ip6_hdr *)ip)->ip6_src, sport, INPLOOKUP_RLOCKPCB,
 		    m->m_pkthdr.rcvif)
 #else
 		    NULL
@@ -734,12 +733,7 @@ siftr_findinpcb(int ipver, struct ip *ip
 			ss->nskip_in_inpcb++;
 		else
 			ss->nskip_out_inpcb++;
-	} else {
-		/* Acquire the inpcb lock. */
-		INP_UNLOCK_ASSERT(inp);
-		INP_RLOCK(inp);
 	}
-	INP_INFO_RUNLOCK(&V_tcbinfo);
 
 	return (inp);
 }

Modified: head/sys/netinet/tcp_input.c
==============================================================================
--- head/sys/netinet/tcp_input.c	Mon May 30 09:41:38 2011	(r222487)
+++ head/sys/netinet/tcp_input.c	Mon May 30 09:43:55 2011	(r222488)
@@ -5,6 +5,7 @@
  *	Swinburne University of Technology, Melbourne, Australia.
  * Copyright (c) 2009-2010 Lawrence Stewart <lstewart@freebsd.org>
  * Copyright (c) 2010 The FreeBSD Foundation
+ * Copyright (c) 2010-2011 Juniper Networks, Inc.
  * All rights reserved.
  *
  * Portions of this software were developed at the Centre for Advanced Internet
@@ -16,6 +17,9 @@
  * Internet Architectures, Swinburne University of Technology, Melbourne,
  * Australia by David Hayes under sponsorship from the FreeBSD Foundation.
  *
+ * Portions of this software were developed by Robert N. M. Watson under
+ * contract to Juniper Networks, Inc.
+ *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
@@ -197,10 +201,6 @@ SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO,
     &VNET_NAME(tcp_autorcvbuf_max), 0,
     "Max size of automatic receive buffer");
 
-int	tcp_read_locking = 1;
-SYSCTL_INT(_net_inet_tcp, OID_AUTO, read_locking, CTLFLAG_RW,
-    &tcp_read_locking, 0, "Enable read locking strategy");
-
 VNET_DEFINE(struct inpcbhead, tcb);
 #define	tcb6	tcb  /* for KAME src sync over BSD*'s */
 VNET_DEFINE(struct inpcbinfo, tcbinfo);
@@ -591,8 +591,7 @@ tcp_input(struct mbuf *m, int off0)
 	char *s = NULL;			/* address and port logging */
 	int ti_locked;
 #define	TI_UNLOCKED	1
-#define	TI_RLOCKED	2
-#define	TI_WLOCKED	3
+#define	TI_WLOCKED	2
 
 #ifdef TCPDEBUG
 	/*
@@ -756,30 +755,25 @@ tcp_input(struct mbuf *m, int off0)
 	drop_hdrlen = off0 + off;
 
 	/*
-	 * Locate pcb for segment, which requires a lock on tcbinfo.
-	 * Optimisticaly acquire a global read lock rather than a write lock
-	 * unless header flags necessarily imply a state change.  There are
-	 * two cases where we might discover later we need a write lock
-	 * despite the flags: ACKs moving a connection out of the syncache,
-	 * and ACKs for a connection in TIMEWAIT.
+	 * Locate pcb for segment; if we're likely to add or remove a
+	 * connection then first acquire pcbinfo lock.  There are two cases
+	 * where we might discover later we need a write lock despite the
+	 * flags: ACKs moving a connection out of the syncache, and ACKs for
+	 * a connection in TIMEWAIT.
 	 */
-	if ((thflags & (TH_SYN | TH_FIN | TH_RST)) != 0 ||
-	    tcp_read_locking == 0) {
+	if ((thflags & (TH_SYN | TH_FIN | TH_RST)) != 0) {
 		INP_INFO_WLOCK(&V_tcbinfo);
 		ti_locked = TI_WLOCKED;
-	} else {
-		INP_INFO_RLOCK(&V_tcbinfo);
-		ti_locked = TI_RLOCKED;
-	}
+	} else
+		ti_locked = TI_UNLOCKED;
 
 findpcb:
 #ifdef INVARIANTS
-	if (ti_locked == TI_RLOCKED)
-		INP_INFO_RLOCK_ASSERT(&V_tcbinfo);
-	else if (ti_locked == TI_WLOCKED)
+	if (ti_locked == TI_WLOCKED) {
 		INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
-	else
-		panic("%s: findpcb ti_locked %d\n", __func__, ti_locked);
+	} else {
+		INP_INFO_UNLOCK_ASSERT(&V_tcbinfo);
+	}
 #endif
 
 #ifdef INET
@@ -797,20 +791,18 @@ findpcb:
 		 * Transparently forwarded. Pretend to be the destination.
 		 * already got one like this?
 		 */
-		inp = in_pcblookup_hash(&V_tcbinfo,
-					ip->ip_src, th->th_sport,
-					ip->ip_dst, th->th_dport,
-					0, m->m_pkthdr.rcvif);
+		inp = in_pcblookup(&V_tcbinfo, ip->ip_src, th->th_sport,
+		    ip->ip_dst, th->th_dport, INPLOOKUP_WLOCKPCB,
+		    m->m_pkthdr.rcvif);
 		if (!inp) {
-			/* It's new.  Try to find the ambushing socket. */
-			inp = in_pcblookup_hash(&V_tcbinfo,
-						ip->ip_src, th->th_sport,
-						next_hop->sin_addr,
-						next_hop->sin_port ?
-						    ntohs(next_hop->sin_port) :
-						    th->th_dport,
-						INPLOOKUP_WILDCARD,
-						m->m_pkthdr.rcvif);
+			/*
+			 * It's new.  Try to find the ambushing socket.
+			 */
+			inp = in_pcblookup(&V_tcbinfo, ip->ip_src,
+			    th->th_sport, next_hop->sin_addr,
+			    next_hop->sin_port ? ntohs(next_hop->sin_port) :
+			    th->th_dport, INPLOOKUP_WILDCARD |
+			    INPLOOKUP_WLOCKPCB, m->m_pkthdr.rcvif);
 		}
 		/* Remove the tag from the packet.  We don't need it anymore. */
 		m_tag_delete(m, fwd_tag);
@@ -820,21 +812,19 @@ findpcb:
 	{
 #ifdef INET6
 		if (isipv6)
-			inp = in6_pcblookup_hash(&V_tcbinfo,
-						 &ip6->ip6_src, th->th_sport,
-						 &ip6->ip6_dst, th->th_dport,
-						 INPLOOKUP_WILDCARD,
-						 m->m_pkthdr.rcvif);
+			inp = in6_pcblookup(&V_tcbinfo, &ip6->ip6_src,
+			    th->th_sport, &ip6->ip6_dst, th->th_dport,
+			    INPLOOKUP_WILDCARD | INPLOOKUP_WLOCKPCB,
+			    m->m_pkthdr.rcvif);
 #endif
 #if defined(INET) && defined(INET6)
 		else
 #endif
 #ifdef INET
-			inp = in_pcblookup_hash(&V_tcbinfo,
-						ip->ip_src, th->th_sport,
-						ip->ip_dst, th->th_dport,
-						INPLOOKUP_WILDCARD,
-						m->m_pkthdr.rcvif);
+			inp = in_pcblookup(&V_tcbinfo, ip->ip_src,
+			    th->th_sport, ip->ip_dst, th->th_dport,
+			    INPLOOKUP_WILDCARD | INPLOOKUP_WLOCKPCB,
+			    m->m_pkthdr.rcvif);
 #endif
 	}
 
@@ -865,7 +855,7 @@ findpcb:
 		rstreason = BANDLIM_RST_CLOSEDPORT;
 		goto dropwithreset;
 	}
-	INP_WLOCK(inp);
+	INP_WLOCK_ASSERT(inp);
 	if (!(inp->inp_flags & INP_HW_FLOWID)
 	    && (m->m_flags & M_FLOWID)
 	    && ((inp->inp_socket == NULL)
@@ -906,28 +896,26 @@ findpcb:
 	 * legitimate new connection attempt the old INPCB gets removed and
 	 * we can try again to find a listening socket.
 	 *
-	 * At this point, due to earlier optimism, we may hold a read lock on
-	 * the inpcbinfo, rather than a write lock.  If so, we need to
-	 * upgrade, or if that fails, acquire a reference on the inpcb, drop
-	 * all locks, acquire a global write lock, and then re-acquire the
-	 * inpcb lock.  We may at that point discover that another thread has
-	 * tried to free the inpcb, in which case we need to loop back and
-	 * try to find a new inpcb to deliver to.
+	 * At this point, due to earlier optimism, we may hold only an inpcb
+	 * lock, and not the inpcbinfo write lock.  If so, we need to try to
+	 * acquire it, or if that fails, acquire a reference on the inpcb,
+	 * drop all locks, acquire a global write lock, and then re-acquire
+	 * the inpcb lock.  We may at that point discover that another thread
+	 * has tried to free the inpcb, in which case we need to loop back
+	 * and try to find a new inpcb to deliver to.
+	 *
+	 * XXXRW: It may be time to rethink timewait locking.
 	 */
 relocked:
 	if (inp->inp_flags & INP_TIMEWAIT) {
-		KASSERT(ti_locked == TI_RLOCKED || ti_locked == TI_WLOCKED,
-		    ("%s: INP_TIMEWAIT ti_locked %d", __func__, ti_locked));
-
-		if (ti_locked == TI_RLOCKED) {
-			if (INP_INFO_TRY_UPGRADE(&V_tcbinfo) == 0) {
+		if (ti_locked == TI_UNLOCKED) {
+			if (INP_INFO_TRY_WLOCK(&V_tcbinfo) == 0) {
 				in_pcbref(inp);
 				INP_WUNLOCK(inp);
-				INP_INFO_RUNLOCK(&V_tcbinfo);
 				INP_INFO_WLOCK(&V_tcbinfo);
 				ti_locked = TI_WLOCKED;
 				INP_WLOCK(inp);
-				if (in_pcbrele(inp)) {
+				if (in_pcbrele_wlocked(inp)) {
 					inp = NULL;
 					goto findpcb;
 				}
@@ -975,26 +963,24 @@ relocked:
 
 	/*
 	 * We've identified a valid inpcb, but it could be that we need an
-	 * inpcbinfo write lock and have only a read lock.  In this case,
-	 * attempt to upgrade/relock using the same strategy as the TIMEWAIT
-	 * case above.  If we relock, we have to jump back to 'relocked' as
-	 * the connection might now be in TIMEWAIT.
-	 */
-	if (tp->t_state != TCPS_ESTABLISHED ||
-	    (thflags & (TH_SYN | TH_FIN | TH_RST)) != 0 ||
-	    tcp_read_locking == 0) {
-		KASSERT(ti_locked == TI_RLOCKED || ti_locked == TI_WLOCKED,
-		    ("%s: upgrade check ti_locked %d", __func__, ti_locked));
-
-		if (ti_locked == TI_RLOCKED) {
-			if (INP_INFO_TRY_UPGRADE(&V_tcbinfo) == 0) {
+	 * inpcbinfo write lock but don't hold it.  In this case, attempt to
+	 * acquire using the same strategy as the TIMEWAIT case above.  If we
+	 * relock, we have to jump back to 'relocked' as the connection might
+	 * now be in TIMEWAIT.
+	 */
+#ifdef INVARIANTS
+	if ((thflags & (TH_SYN | TH_FIN | TH_RST)) != 0)
+		INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
+#endif
+	if (tp->t_state != TCPS_ESTABLISHED) {
+		if (ti_locked == TI_UNLOCKED) {
+			if (INP_INFO_TRY_WLOCK(&V_tcbinfo) == 0) {
 				in_pcbref(inp);
 				INP_WUNLOCK(inp);
-				INP_INFO_RUNLOCK(&V_tcbinfo);
 				INP_INFO_WLOCK(&V_tcbinfo);
 				ti_locked = TI_WLOCKED;
 				INP_WLOCK(inp);
-				if (in_pcbrele(inp)) {
+				if (in_pcbrele_wlocked(inp)) {
 					inp = NULL;
 					goto findpcb;
 				}
@@ -1027,13 +1013,16 @@ relocked:
 	/*
 	 * When the socket is accepting connections (the INPCB is in LISTEN
 	 * state) we look into the SYN cache if this is a new connection
-	 * attempt or the completion of a previous one.
+	 * attempt or the completion of a previous one.  Because listen
+	 * sockets are never in TCPS_ESTABLISHED, the V_tcbinfo lock will be
+	 * held in this case.
 	 */
 	if (so->so_options & SO_ACCEPTCONN) {
 		struct in_conninfo inc;
 
 		KASSERT(tp->t_state == TCPS_LISTEN, ("%s: so accepting but "
 		    "tp not listening", __func__));
+		INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
 
 		bzero(&inc, sizeof(inc));
 #ifdef INET6
@@ -1371,13 +1360,17 @@ relocked:
 	return;
 
 dropwithreset:
-	if (ti_locked == TI_RLOCKED)
-		INP_INFO_RUNLOCK(&V_tcbinfo);
-	else if (ti_locked == TI_WLOCKED)
+	if (ti_locked == TI_WLOCKED) {
 		INP_INFO_WUNLOCK(&V_tcbinfo);
-	else
-		panic("%s: dropwithreset ti_locked %d", __func__, ti_locked);
-	ti_locked = TI_UNLOCKED;
+		ti_locked = TI_UNLOCKED;
+	}
+#ifdef INVARIANTS
+	else {
+		KASSERT(ti_locked == TI_UNLOCKED, ("%s: dropwithreset "
+		    "ti_locked: %d", __func__, ti_locked));
+		INP_INFO_UNLOCK_ASSERT(&V_tcbinfo);
+	}
+#endif
 
 	if (inp != NULL) {
 		tcp_dropwithreset(m, th, tp, tlen, rstreason);
@@ -1388,13 +1381,17 @@ dropwithreset:
 	goto drop;
 
 dropunlock:
-	if (ti_locked == TI_RLOCKED)
-		INP_INFO_RUNLOCK(&V_tcbinfo);
-	else if (ti_locked == TI_WLOCKED)
+	if (ti_locked == TI_WLOCKED) {
 		INP_INFO_WUNLOCK(&V_tcbinfo);
-	else
-		panic("%s: dropunlock ti_locked %d", __func__, ti_locked);
-	ti_locked = TI_UNLOCKED;
+		ti_locked = TI_UNLOCKED;
+	}
+#ifdef INVARIANTS
+	else {
+		KASSERT(ti_locked == TI_UNLOCKED, ("%s: dropunlock "
+		    "ti_locked: %d", __func__, ti_locked));
+		INP_INFO_UNLOCK_ASSERT(&V_tcbinfo);
+	}
+#endif
 
 	if (inp != NULL)
 		INP_WUNLOCK(inp);
@@ -1449,13 +1446,13 @@ tcp_do_segment(struct mbuf *m, struct tc
 		INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
 	} else {
 #ifdef INVARIANTS
-		if (ti_locked == TI_RLOCKED)
-			INP_INFO_RLOCK_ASSERT(&V_tcbinfo);

*** DIFF OUTPUT TRUNCATED AT 1000 LINES ***



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201105300943.p4U9htjI070096>