Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 4 Jun 2011 16:30:43 +0200
From:      Kristof Provost <kristof@sigsegv.be>
To:        Robert Watson <rwatson@FreeBSD.org>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org
Subject:   Re: svn commit: r222488 - in head/sys: contrib/pf/net netinet netinet/ipfw netinet6
Message-ID:  <20110604143043.GE17764@nereid>
In-Reply-To: <201105300943.p4U9htjI070096@svn.freebsd.org>
References:  <201105300943.p4U9htjI070096@svn.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi,

I'm seeing a panic when I start natd, which I suspect is related to this
commit:

panic: Lock pcbinfohash not exclusively locked @
/usr/src/sys/netinet/in_pcb.c:323

Backtrace:
panic() ...
_rw_assert() at _rw_assert+0x18d
in_pcbbind() at in_pcbbind+0xf5
div_bind() at div_bind+0xb9
sobind() at sobind()+0x79
kern_bind() at kern_bind+0xde
bind() at bind()+0x41
syscallenter() at syscallenter+0x1cb
syscall() at syscall+0x4c
Xfast_syscall() at Xfast_syscall+0xdd

div_bind probably also needs to surround the call to in_pcbbind with
INP_HASHW(UN)LOCK(...)

I'm currently running 222680. I've only now seen the issue, but I've also
just now activated INVARIANTS.

Regards,
Kristof

On 2011-05-30 09:43:55 (+0000), Robert Watson <rwatson@FreeBSD.org> wrote:
> Author: rwatson
> Date: Mon May 30 09:43:55 2011
> New Revision: 222488
> URL: http://svn.freebsd.org/changeset/base/222488
> 
> Log:
>   Decompose the current single inpcbinfo lock into two locks:
>   
>   - The existing ipi_lock continues to protect the global inpcb list and
>     inpcb counter.  This lock is now relegated to a small number of
>     allocation and free operations, and occasional operations that walk
>     all connections (including, awkwardly, certain UDP multicast receive
>     operations -- something to revisit).
>   
>   - A new ipi_hash_lock protects the two inpcbinfo hash tables for
>     looking up connections and bound sockets, manipulated using new
>     INP_HASH_*() macros.  This lock, combined with inpcb locks, protects
>     the 4-tuple address space.
>   
>   Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
>   connection locks, so may be acquired while manipulating a connection on
>   which a lock is already held, avoiding the need to acquire the inpcbinfo
>   lock preemptively when a binding change might later be required.  As a
>   result, however, lookup operations necessarily go through a reference
>   acquire while holding the lookup lock, later acquiring an inpcb lock --
>   if required.
>   
>   A new function in_pcblookup() looks up connections, and accepts flags
>   indicating how to return the inpcb.  Due to lock order changes, callers
>   no longer need acquire locks before performing a lookup: the lookup
>   routine will acquire the ipi_hash_lock as needed.  In the future, it will
>   also be able to use alternative lookup and locking strategies
>   transparently to callers, such as pcbgroup lookup.  New lookup flags are,
>   supplementing the existing INPLOOKUP_WILDCARD flag:
>   
>     INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
>     INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
>   
>   Callers must pass exactly one of these flags (for the time being).
>   
>   Some notes:
>   
>   - All protocols are updated to work within the new regime; especially,
>     TCP, UDPv4, and UDPv6.  pcbinfo ipi_lock acquisitions are largely
>     eliminated, and global hash lock hold times are dramatically reduced
>     compared to previous locking.
>   - The TCP syncache still relies on the pcbinfo lock, something that we
>     may want to revisit.
>   - Support for reverting to the FreeBSD 7.x locking strategy in TCP input
>     is no longer available -- hash lookup locks are now held only very
>     briefly during inpcb lookup, rather than for potentially extended
>     periods.  However, the pcbinfo ipi_lock will still be acquired if a
>     connection state might change such that a connection is added or
>     removed.
>   - Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
>     due to maintaining their own hash tables.
>   - The interface in6_pcblookup_hash_locked() is maintained, which allows
>     callers to acquire hash locks and perform one or more lookups atomically
>     with 4-tuple allocation: this is required only for TCPv6, as there is no
>     in6_pcbconnect_setup(), which there should be.
>   - UDPv6 locking remains significantly more conservative than UDPv4
>     locking, which relates to source address selection.  This needs
>     attention, as it likely significantly reduces parallelism in this code
>     for multithreaded socket use (such as in BIND).
>   - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
>     somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
>     is no longer sufficient.  A second check once the inpcb lock is held
>     should do the trick, keeping the general case from requiring the inpcb
>     lock for every inpcb visited.
>   - This work reminds us that we need to revisit locking of the v4/v6 flags,
>     which may be accessed lock-free both before and after this change.
>   - Right now, a single lock name is used for the pcbhash lock -- this is
>     undesirable, and probably another argument is required to take care of
>     this (or a char array name field in the pcbinfo?).
>   
>   This is not an MFC candidate for 8.x due to its impact on lookup and
>   locking semantics.  It's possible some of these issues could be worked
>   around with compatibility wrappers, if necessary.
>   
>   Reviewed by:    bz
>   Sponsored by:   Juniper Networks, Inc.
> 
> Modified:
>   head/sys/contrib/pf/net/pf.c
>   head/sys/netinet/in_pcb.c
>   head/sys/netinet/in_pcb.h
>   head/sys/netinet/ip_divert.c
>   head/sys/netinet/ipfw/ip_fw2.c
>   head/sys/netinet/raw_ip.c
>   head/sys/netinet/siftr.c
>   head/sys/netinet/tcp_input.c
>   head/sys/netinet/tcp_subr.c
>   head/sys/netinet/tcp_syncache.c
>   head/sys/netinet/tcp_timer.c
>   head/sys/netinet/tcp_usrreq.c
>   head/sys/netinet/udp_usrreq.c
>   head/sys/netinet6/in6_pcb.c
>   head/sys/netinet6/in6_pcb.h
>   head/sys/netinet6/in6_src.c
>   head/sys/netinet6/udp6_usrreq.c
> 
> Modified: head/sys/contrib/pf/net/pf.c
> ==============================================================================
> --- head/sys/contrib/pf/net/pf.c	Mon May 30 09:41:38 2011	(r222487)
> +++ head/sys/contrib/pf/net/pf.c	Mon May 30 09:43:55 2011	(r222488)
> @@ -3034,16 +3034,14 @@ pf_socket_lookup(int direction, struct p
>  #ifdef INET
>  	case AF_INET:
>  #ifdef __FreeBSD__
> -		INP_INFO_RLOCK(pi);	/* XXX LOR */
> -		inp = in_pcblookup_hash(pi, saddr->v4, sport, daddr->v4,
> -			dport, 0, NULL);
> +		inp = in_pcblookup(pi, saddr->v4, sport, daddr->v4,
> +			dport, INPLOOKUP_RLOCKPCB, NULL);
>  		if (inp == NULL) {
> -			inp = in_pcblookup_hash(pi, saddr->v4, sport,
> -			   daddr->v4, dport, INPLOOKUP_WILDCARD, NULL);
> -			if(inp == NULL) {
> -				INP_INFO_RUNLOCK(pi);
> +			inp = in_pcblookup(pi, saddr->v4, sport,
> +			   daddr->v4, dport, INPLOOKUP_WILDCARD |
> +			   INPLOOKUP_RLOCKPCB, NULL);
> +			if (inp == NULL)
>  				return (-1);
> -			}
>  		}
>  #else
>  		inp = in_pcbhashlookup(tb, saddr->v4, sport, daddr->v4, dport);
> @@ -3058,16 +3056,14 @@ pf_socket_lookup(int direction, struct p
>  #ifdef INET6
>  	case AF_INET6:
>  #ifdef __FreeBSD__
> -		INP_INFO_RLOCK(pi);
> -		inp = in6_pcblookup_hash(pi, &saddr->v6, sport,
> -			&daddr->v6, dport, 0, NULL);
> +		inp = in6_pcblookup(pi, &saddr->v6, sport,
> +			&daddr->v6, dport, INPLOOKUP_RLOCKPCB, NULL);
>  		if (inp == NULL) {
> -			inp = in6_pcblookup_hash(pi, &saddr->v6, sport,
> -			&daddr->v6, dport, INPLOOKUP_WILDCARD, NULL);
> -			if (inp == NULL) {
> -				INP_INFO_RUNLOCK(pi);
> +			inp = in6_pcblookup(pi, &saddr->v6, sport,
> +			    &daddr->v6, dport, INPLOOKUP_WILDCARD |
> +			    INPLOOKUP_RLOCKPCB, NULL);
> +			if (inp == NULL)
>  				return (-1);
> -			}
>  		}
>  #else
>  		inp = in6_pcbhashlookup(tb, &saddr->v6, sport, &daddr->v6,
> @@ -3085,9 +3081,10 @@ pf_socket_lookup(int direction, struct p
>  		return (-1);
>  	}
>  #ifdef __FreeBSD__
> +	INP_RLOCK_ASSERT(inp);
>  	pd->lookup.uid = inp->inp_cred->cr_uid;
>  	pd->lookup.gid = inp->inp_cred->cr_groups[0];
> -	INP_INFO_RUNLOCK(pi);
> +	INP_RUNLOCK(inp);
>  #else
>  	pd->lookup.uid = inp->inp_socket->so_euid;
>  	pd->lookup.gid = inp->inp_socket->so_egid;
> 
> Modified: head/sys/netinet/in_pcb.c
> ==============================================================================
> --- head/sys/netinet/in_pcb.c	Mon May 30 09:41:38 2011	(r222487)
> +++ head/sys/netinet/in_pcb.c	Mon May 30 09:43:55 2011	(r222488)
> @@ -127,6 +127,10 @@ static VNET_DEFINE(int, ipport_tcplastco
>  
>  #define	V_ipport_tcplastcount		VNET(ipport_tcplastcount)
>  
> +static struct inpcb	*in_pcblookup_hash_locked(struct inpcbinfo *pcbinfo,
> +			    struct in_addr faddr, u_int fport_arg,
> +			    struct in_addr laddr, u_int lport_arg,
> +			    int lookupflags, struct ifnet *ifp);
>  static void	in_pcbremlists(struct inpcb *inp);
>  
>  #ifdef INET
> @@ -212,11 +216,13 @@ in_pcbinfo_init(struct inpcbinfo *pcbinf
>  {
>  
>  	INP_INFO_LOCK_INIT(pcbinfo, name);
> +	INP_HASH_LOCK_INIT(pcbinfo, "pcbinfohash");	/* XXXRW: argument? */
>  #ifdef VIMAGE
>  	pcbinfo->ipi_vnet = curvnet;
>  #endif
>  	pcbinfo->ipi_listhead = listhead;
>  	LIST_INIT(pcbinfo->ipi_listhead);
> +	pcbinfo->ipi_count = 0;
>  	pcbinfo->ipi_hashbase = hashinit(hash_nelements, M_PCB,
>  	    &pcbinfo->ipi_hashmask);
>  	pcbinfo->ipi_porthashbase = hashinit(porthash_nelements, M_PCB,
> @@ -234,10 +240,14 @@ void
>  in_pcbinfo_destroy(struct inpcbinfo *pcbinfo)
>  {
>  
> +	KASSERT(pcbinfo->ipi_count == 0,
> +	    ("%s: ipi_count = %u", __func__, pcbinfo->ipi_count));
> +
>  	hashdestroy(pcbinfo->ipi_hashbase, M_PCB, pcbinfo->ipi_hashmask);
>  	hashdestroy(pcbinfo->ipi_porthashbase, M_PCB,
>  	    pcbinfo->ipi_porthashmask);
>  	uma_zdestroy(pcbinfo->ipi_zone);
> +	INP_HASH_LOCK_DESTROY(pcbinfo);
>  	INP_INFO_LOCK_DESTROY(pcbinfo);
>  }
>  
> @@ -309,8 +319,8 @@ in_pcbbind(struct inpcb *inp, struct soc
>  {
>  	int anonport, error;
>  
> -	INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo);
>  	INP_WLOCK_ASSERT(inp);
> +	INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
>  
>  	if (inp->inp_lport != 0 || inp->inp_laddr.s_addr != INADDR_ANY)
>  		return (EINVAL);
> @@ -351,8 +361,8 @@ in_pcb_lport(struct inpcb *inp, struct i
>  	 * Because no actual state changes occur here, a global write lock on
>  	 * the pcbinfo isn't required.
>  	 */
> -	INP_INFO_LOCK_ASSERT(pcbinfo);
>  	INP_LOCK_ASSERT(inp);
> +	INP_HASH_LOCK_ASSERT(pcbinfo);
>  
>  	if (inp->inp_flags & INP_HIGHPORT) {
>  		first = V_ipport_hifirstauto;	/* sysctl */
> @@ -473,11 +483,10 @@ in_pcbbind_setup(struct inpcb *inp, stru
>  	int error;
>  
>  	/*
> -	 * Because no actual state changes occur here, a global write lock on
> -	 * the pcbinfo isn't required.
> +	 * No state changes, so read locks are sufficient here.
>  	 */
> -	INP_INFO_LOCK_ASSERT(pcbinfo);
>  	INP_LOCK_ASSERT(inp);
> +	INP_HASH_LOCK_ASSERT(pcbinfo);
>  
>  	if (TAILQ_EMPTY(&V_in_ifaddrhead)) /* XXX broken! */
>  		return (EADDRNOTAVAIL);
> @@ -618,8 +627,8 @@ in_pcbconnect(struct inpcb *inp, struct 
>  	in_addr_t laddr, faddr;
>  	int anonport, error;
>  
> -	INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo);
>  	INP_WLOCK_ASSERT(inp);
> +	INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
>  
>  	lport = inp->inp_lport;
>  	laddr = inp->inp_laddr.s_addr;
> @@ -907,8 +916,8 @@ in_pcbconnect_setup(struct inpcb *inp, s
>  	 * Because a global state change doesn't actually occur here, a read
>  	 * lock is sufficient.
>  	 */
> -	INP_INFO_LOCK_ASSERT(inp->inp_pcbinfo);
>  	INP_LOCK_ASSERT(inp);
> +	INP_HASH_LOCK_ASSERT(inp->inp_pcbinfo);
>  
>  	if (oinpp != NULL)
>  		*oinpp = NULL;
> @@ -983,8 +992,8 @@ in_pcbconnect_setup(struct inpcb *inp, s
>  		if (error)
>  			return (error);
>  	}
> -	oinp = in_pcblookup_hash(inp->inp_pcbinfo, faddr, fport, laddr, lport,
> -	    0, NULL);
> +	oinp = in_pcblookup_hash_locked(inp->inp_pcbinfo, faddr, fport,
> +	    laddr, lport, 0, NULL);
>  	if (oinp != NULL) {
>  		if (oinpp != NULL)
>  			*oinpp = oinp;
> @@ -1007,8 +1016,8 @@ void
>  in_pcbdisconnect(struct inpcb *inp)
>  {
>  
> -	INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo);
>  	INP_WLOCK_ASSERT(inp);
> +	INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
>  
>  	inp->inp_faddr.s_addr = INADDR_ANY;
>  	inp->inp_fport = 0;
> @@ -1187,19 +1196,24 @@ void
>  in_pcbdrop(struct inpcb *inp)
>  {
>  
> -	INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo);
>  	INP_WLOCK_ASSERT(inp);
>  
> +	/*
> +	 * XXXRW: Possibly we should protect the setting of INP_DROPPED with
> +	 * the hash lock...?
> +	 */
>  	inp->inp_flags |= INP_DROPPED;
>  	if (inp->inp_flags & INP_INHASHLIST) {
>  		struct inpcbport *phd = inp->inp_phd;
>  
> +		INP_HASH_WLOCK(inp->inp_pcbinfo);
>  		LIST_REMOVE(inp, inp_hash);
>  		LIST_REMOVE(inp, inp_portlist);
>  		if (LIST_FIRST(&phd->phd_pcblist) == NULL) {
>  			LIST_REMOVE(phd, phd_hash);
>  			free(phd, M_PCB);
>  		}
> +		INP_HASH_WUNLOCK(inp->inp_pcbinfo);
>  		inp->inp_flags &= ~INP_INHASHLIST;
>  	}
>  }
> @@ -1328,7 +1342,8 @@ in_pcbpurgeif0(struct inpcbinfo *pcbinfo
>  }
>  
>  /*
> - * Lookup a PCB based on the local address and port.
> + * Lookup a PCB based on the local address and port.  Caller must hold the
> + * hash lock.  No inpcb locks or references are acquired.
>   */
>  #define INP_LOOKUP_MAPPED_PCB_COST	3
>  struct inpcb *
> @@ -1346,7 +1361,7 @@ in_pcblookup_local(struct inpcbinfo *pcb
>  	KASSERT((lookupflags & ~(INPLOOKUP_WILDCARD)) == 0,
>  	    ("%s: invalid lookup flags %d", __func__, lookupflags));
>  
> -	INP_INFO_LOCK_ASSERT(pcbinfo);
> +	INP_HASH_LOCK_ASSERT(pcbinfo);
>  
>  	if ((lookupflags & INPLOOKUP_WILDCARD) == 0) {
>  		struct inpcbhead *head;
> @@ -1450,10 +1465,12 @@ in_pcblookup_local(struct inpcbinfo *pcb
>  #undef INP_LOOKUP_MAPPED_PCB_COST
>  
>  /*
> - * Lookup PCB in hash list.
> + * Lookup PCB in hash list, using pcbinfo tables.  This variation assumes
> + * that the caller has locked the hash list, and will not perform any further
> + * locking or reference operations on either the hash list or the connection.
>   */
> -struct inpcb *
> -in_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in_addr faddr,
> +static struct inpcb *
> +in_pcblookup_hash_locked(struct inpcbinfo *pcbinfo, struct in_addr faddr,
>      u_int fport_arg, struct in_addr laddr, u_int lport_arg, int lookupflags,
>      struct ifnet *ifp)
>  {
> @@ -1464,7 +1481,7 @@ in_pcblookup_hash(struct inpcbinfo *pcbi
>  	KASSERT((lookupflags & ~(INPLOOKUP_WILDCARD)) == 0,
>  	    ("%s: invalid lookup flags %d", __func__, lookupflags));
>  
> -	INP_INFO_LOCK_ASSERT(pcbinfo);
> +	INP_HASH_LOCK_ASSERT(pcbinfo);
>  
>  	/*
>  	 * First look for an exact match.
> @@ -1574,6 +1591,56 @@ in_pcblookup_hash(struct inpcbinfo *pcbi
>  
>  	return (NULL);
>  }
> +
> +/*
> + * Lookup PCB in hash list, using pcbinfo tables.  This variation locks the
> + * hash list lock, and will return the inpcb locked (i.e., requires
> + * INPLOOKUP_LOCKPCB).
> + */
> +static struct inpcb *
> +in_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in_addr faddr,
> +    u_int fport, struct in_addr laddr, u_int lport, int lookupflags,
> +    struct ifnet *ifp)
> +{
> +	struct inpcb *inp;
> +
> +	INP_HASH_RLOCK(pcbinfo);
> +	inp = in_pcblookup_hash_locked(pcbinfo, faddr, fport, laddr, lport,
> +	    (lookupflags & ~(INPLOOKUP_RLOCKPCB | INPLOOKUP_WLOCKPCB)), ifp);
> +	if (inp != NULL) {
> +		in_pcbref(inp);
> +		INP_HASH_RUNLOCK(pcbinfo);
> +		if (lookupflags & INPLOOKUP_WLOCKPCB) {
> +			INP_WLOCK(inp);
> +			if (in_pcbrele_wlocked(inp))
> +				return (NULL);
> +		} else if (lookupflags & INPLOOKUP_RLOCKPCB) {
> +			INP_RLOCK(inp);
> +			if (in_pcbrele_rlocked(inp))
> +				return (NULL);
> +		} else
> +			panic("%s: locking bug", __func__);
> +	} else
> +		INP_HASH_RUNLOCK(pcbinfo);
> +	return (inp);
> +}
> +
> +/*
> + * Public inpcb lookup routines, accepting a 4-tuple.
> + */
> +struct inpcb *
> +in_pcblookup(struct inpcbinfo *pcbinfo, struct in_addr faddr, u_int fport,
> +    struct in_addr laddr, u_int lport, int lookupflags, struct ifnet *ifp)
> +{
> +
> +	KASSERT((lookupflags & ~INPLOOKUP_MASK) == 0,
> +	    ("%s: invalid lookup flags %d", __func__, lookupflags));
> +	KASSERT((lookupflags & (INPLOOKUP_RLOCKPCB | INPLOOKUP_WLOCKPCB)) != 0,
> +	    ("%s: LOCKPCB not set", __func__));
> +
> +	return (in_pcblookup_hash(pcbinfo, faddr, fport, laddr, lport,
> +	    lookupflags, ifp));
> +}
>  #endif /* INET */
>  
>  /*
> @@ -1588,8 +1655,9 @@ in_pcbinshash(struct inpcb *inp)
>  	struct inpcbport *phd;
>  	u_int32_t hashkey_faddr;
>  
> -	INP_INFO_WLOCK_ASSERT(pcbinfo);
>  	INP_WLOCK_ASSERT(inp);
> +	INP_HASH_WLOCK_ASSERT(pcbinfo);
> +
>  	KASSERT((inp->inp_flags & INP_INHASHLIST) == 0,
>  	    ("in_pcbinshash: INP_INHASHLIST"));
>  
> @@ -1645,8 +1713,9 @@ in_pcbrehash(struct inpcb *inp)
>  	struct inpcbhead *head;
>  	u_int32_t hashkey_faddr;
>  
> -	INP_INFO_WLOCK_ASSERT(pcbinfo);
>  	INP_WLOCK_ASSERT(inp);
> +	INP_HASH_WLOCK_ASSERT(pcbinfo);
> +
>  	KASSERT(inp->inp_flags & INP_INHASHLIST,
>  	    ("in_pcbrehash: !INP_INHASHLIST"));
>  
> @@ -1679,12 +1748,14 @@ in_pcbremlists(struct inpcb *inp)
>  	if (inp->inp_flags & INP_INHASHLIST) {
>  		struct inpcbport *phd = inp->inp_phd;
>  
> +		INP_HASH_WLOCK(pcbinfo);
>  		LIST_REMOVE(inp, inp_hash);
>  		LIST_REMOVE(inp, inp_portlist);
>  		if (LIST_FIRST(&phd->phd_pcblist) == NULL) {
>  			LIST_REMOVE(phd, phd_hash);
>  			free(phd, M_PCB);
>  		}
> +		INP_HASH_WUNLOCK(pcbinfo);
>  		inp->inp_flags &= ~INP_INHASHLIST;
>  	}
>  	LIST_REMOVE(inp, inp_list);
> 
> Modified: head/sys/netinet/in_pcb.h
> ==============================================================================
> --- head/sys/netinet/in_pcb.h	Mon May 30 09:41:38 2011	(r222487)
> +++ head/sys/netinet/in_pcb.h	Mon May 30 09:43:55 2011	(r222488)
> @@ -268,22 +268,22 @@ struct inpcbport {
>   * Global data structure for each high-level protocol (UDP, TCP, ...) in both
>   * IPv4 and IPv6.  Holds inpcb lists and information for managing them.
>   *
> - * Each pcbinfo is protected by ipi_lock, covering mutable global fields (such
> - * as the global pcb list) and hashed lookup tables.  The lock order is:
> + * Each pcbinfo is protected by two locks: ipi_lock and ipi_hash_lock,
> + * the former covering mutable global fields (such as the global pcb list),
> + * and the latter covering the hashed lookup tables.  The lock order is:
>   *
> - *    ipi_lock (before) inpcb locks
> + *    ipi_lock (before) inpcb locks (before) ipi_hash_lock
>   *
>   * Locking key:
>   *
>   * (c) Constant or nearly constant after initialisation
>   * (g) Locked by ipi_lock
> - * (h) Read using either ipi_lock or inpcb lock; write requires both.
> + * (h) Read using either ipi_hash_lock or inpcb lock; write requires both.
>   * (x) Synchronisation properties poorly defined
>   */
>  struct inpcbinfo {
>  	/*
> -	 * Global lock protecting global inpcb list, inpcb count, hash tables,
> -	 * etc.
> +	 * Global lock protecting global inpcb list, inpcb count, etc.
>  	 */
>  	struct rwlock		 ipi_lock;
>  
> @@ -312,17 +312,22 @@ struct inpcbinfo {
>  	struct	uma_zone	*ipi_zone;		/* (c) */
>  
>  	/*
> +	 * Global lock protecting hash lookup tables.
> +	 */
> +	struct rwlock		 ipi_hash_lock;
> +
> +	/*
>  	 * Global hash of inpcbs, hashed by local and foreign addresses and
>  	 * port numbers.
>  	 */
> -	struct inpcbhead	*ipi_hashbase;		/* (g) */
> -	u_long			 ipi_hashmask;		/* (g) */
> +	struct inpcbhead	*ipi_hashbase;		/* (h) */
> +	u_long			 ipi_hashmask;		/* (h) */
>  
>  	/*
>  	 * Global hash of inpcbs, hashed by only local port number.
>  	 */
> -	struct inpcbporthead	*ipi_porthashbase;	/* (g) */
> -	u_long			 ipi_porthashmask;	/* (g) */
> +	struct inpcbporthead	*ipi_porthashbase;	/* (h) */
> +	u_long			 ipi_porthashmask;	/* (h) */
>  
>  	/*
>  	 * Pointer to network stack instance
> @@ -406,6 +411,18 @@ void 	inp_4tuple_get(struct inpcb *inp, 
>  #define INP_INFO_WLOCK_ASSERT(ipi)	rw_assert(&(ipi)->ipi_lock, RA_WLOCKED)
>  #define INP_INFO_UNLOCK_ASSERT(ipi)	rw_assert(&(ipi)->ipi_lock, RA_UNLOCKED)
>  
> +#define	INP_HASH_LOCK_INIT(ipi, d) \
> +	rw_init_flags(&(ipi)->ipi_hash_lock, (d), 0)
> +#define	INP_HASH_LOCK_DESTROY(ipi)	rw_destroy(&(ipi)->ipi_hash_lock)
> +#define	INP_HASH_RLOCK(ipi)		rw_rlock(&(ipi)->ipi_hash_lock)
> +#define	INP_HASH_WLOCK(ipi)		rw_wlock(&(ipi)->ipi_hash_lock)
> +#define	INP_HASH_RUNLOCK(ipi)		rw_runlock(&(ipi)->ipi_hash_lock)
> +#define	INP_HASH_WUNLOCK(ipi)		rw_wunlock(&(ipi)->ipi_hash_lock)
> +#define	INP_HASH_LOCK_ASSERT(ipi)	rw_assert(&(ipi)->ipi_hash_lock, \
> +					    RA_LOCKED)
> +#define	INP_HASH_WLOCK_ASSERT(ipi)	rw_assert(&(ipi)->ipi_hash_lock, \
> +					    RA_WLOCKED)
> +
>  #define INP_PCBHASH(faddr, lport, fport, mask) \
>  	(((faddr) ^ ((faddr) >> 16) ^ ntohs((lport) ^ (fport))) & (mask))
>  #define INP_PCBPORTHASH(lport, mask) \
> @@ -466,7 +483,16 @@ void 	inp_4tuple_get(struct inpcb *inp, 
>  #define	INP_LLE_VALID		0x00000001 /* cached lle is valid */	
>  #define	INP_RT_VALID		0x00000002 /* cached rtentry is valid */
>  
> -#define	INPLOOKUP_WILDCARD	1
> +/*
> + * Flags passed to in_pcblookup*() functions.
> + */
> +#define	INPLOOKUP_WILDCARD	0x00000001	/* Allow wildcard sockets. */
> +#define	INPLOOKUP_RLOCKPCB	0x00000002	/* Return inpcb read-locked. */
> +#define	INPLOOKUP_WLOCKPCB	0x00000004	/* Return inpcb write-locked. */
> +
> +#define	INPLOOKUP_MASK	(INPLOOKUP_WILDCARD | INPLOOKUP_RLOCKPCB | \
> +			    INPLOOKUP_WLOCKPCB)
> +
>  #define	sotoinpcb(so)	((struct inpcb *)(so)->so_pcb)
>  #define	sotoin6pcb(so)	sotoinpcb(so) /* for KAME src sync over BSD*'s */
>  
> @@ -527,7 +553,7 @@ struct inpcb *
>  	in_pcblookup_local(struct inpcbinfo *,
>  	    struct in_addr, u_short, int, struct ucred *);
>  struct inpcb *
> -	in_pcblookup_hash(struct inpcbinfo *, struct in_addr, u_int,
> +	in_pcblookup(struct inpcbinfo *, struct in_addr, u_int,
>  	    struct in_addr, u_int, int, struct ifnet *);
>  void	in_pcbnotifyall(struct inpcbinfo *pcbinfo, struct in_addr,
>  	    int, struct inpcb *(*)(struct inpcb *, int));
> 
> Modified: head/sys/netinet/ip_divert.c
> ==============================================================================
> --- head/sys/netinet/ip_divert.c	Mon May 30 09:41:38 2011	(r222487)
> +++ head/sys/netinet/ip_divert.c	Mon May 30 09:43:55 2011	(r222488)
> @@ -659,9 +659,9 @@ div_pcblist(SYSCTL_HANDLER_ARGS)
>  	INP_INFO_WLOCK(&V_divcbinfo);
>  	for (i = 0; i < n; i++) {
>  		inp = inp_list[i];
> -		INP_WLOCK(inp);
> -		if (!in_pcbrele(inp))
> -			INP_WUNLOCK(inp);
> +		INP_RLOCK(inp);
> +		if (!in_pcbrele_rlocked(inp))
> +			INP_RUNLOCK(inp);
>  	}
>  	INP_INFO_WUNLOCK(&V_divcbinfo);
>  
> 
> Modified: head/sys/netinet/ipfw/ip_fw2.c
> ==============================================================================
> --- head/sys/netinet/ipfw/ip_fw2.c	Mon May 30 09:41:38 2011	(r222487)
> +++ head/sys/netinet/ipfw/ip_fw2.c	Mon May 30 09:43:55 2011	(r222488)
> @@ -657,7 +657,7 @@ check_uidgid(ipfw_insn_u32 *insn, int pr
>  	    (struct bsd_ucred *)uc, ugid_lookupp, ((struct mbuf *)inp)->m_skb);
>  #else  /* FreeBSD */
>  	struct inpcbinfo *pi;
> -	int wildcard;
> +	int lookupflags;
>  	struct inpcb *pcb;
>  	int match;
>  
> @@ -682,30 +682,31 @@ check_uidgid(ipfw_insn_u32 *insn, int pr
>  	if (*ugid_lookupp == -1)
>  		return (0);
>  	if (proto == IPPROTO_TCP) {
> -		wildcard = 0;
> +		lookupflags = 0;
>  		pi = &V_tcbinfo;
>  	} else if (proto == IPPROTO_UDP) {
> -		wildcard = INPLOOKUP_WILDCARD;
> +		lookupflags = INPLOOKUP_WILDCARD;
>  		pi = &V_udbinfo;
>  	} else
>  		return 0;
> +	lookupflags |= INPLOOKUP_RLOCKPCB;
>  	match = 0;
>  	if (*ugid_lookupp == 0) {
> -		INP_INFO_RLOCK(pi);
>  		pcb =  (oif) ?
> -			in_pcblookup_hash(pi,
> +			in_pcblookup(pi,
>  				dst_ip, htons(dst_port),
>  				src_ip, htons(src_port),
> -				wildcard, oif) :
> -			in_pcblookup_hash(pi,
> +				lookupflags, oif) :
> +			in_pcblookup(pi,
>  				src_ip, htons(src_port),
>  				dst_ip, htons(dst_port),
> -				wildcard, NULL);
> +				lookupflags, NULL);
>  		if (pcb != NULL) {
> +			INP_RLOCK_ASSERT(pcb);
>  			*uc = crhold(pcb->inp_cred);
>  			*ugid_lookupp = 1;
> +			INP_RUNLOCK(pcb);
>  		}
> -		INP_INFO_RUNLOCK(pi);
>  		if (*ugid_lookupp == 0) {
>  			/*
>  			 * We tried and failed, set the variable to -1
> @@ -1827,21 +1828,32 @@ do {								\
>  				else
>  					break;
>  
> +				/*
> +				 * XXXRW: so_user_cookie should almost
> +				 * certainly be inp_user_cookie?
> +				 */
> +
>  				/* For incomming packet, lookup up the 
>  				inpcb using the src/dest ip/port tuple */
>  				if (inp == NULL) {
> -					INP_INFO_RLOCK(pi);
> -					inp = in_pcblookup_hash(pi, 
> +					inp = in_pcblookup(pi, 
>  						src_ip, htons(src_port),
>  						dst_ip, htons(dst_port),
> -						0, NULL);
> -					INP_INFO_RUNLOCK(pi);
> -				}
> -				
> -				if (inp && inp->inp_socket) {
> -					tablearg = inp->inp_socket->so_user_cookie;
> -					if (tablearg)
> -						match = 1;
> +						INPLOOKUP_RLOCKPCB, NULL);
> +					if (inp != NULL) {
> +						tablearg =
> +						    inp->inp_socket->so_user_cookie;
> +						if (tablearg)
> +							match = 1;
> +						INP_RUNLOCK(inp);
> +					}
> +				} else {
> +					if (inp->inp_socket) {
> +						tablearg =
> +						    inp->inp_socket->so_user_cookie;
> +						if (tablearg)
> +							match = 1;
> +					}
>  				}
>  				break;
>  			}
> 
> Modified: head/sys/netinet/raw_ip.c
> ==============================================================================
> --- head/sys/netinet/raw_ip.c	Mon May 30 09:41:38 2011	(r222487)
> +++ head/sys/netinet/raw_ip.c	Mon May 30 09:43:55 2011	(r222488)
> @@ -226,7 +226,7 @@ rip_append(struct inpcb *last, struct ip
>  {
>  	int policyfail = 0;
>  
> -	INP_RLOCK_ASSERT(last);
> +	INP_LOCK_ASSERT(last);
>  
>  #ifdef IPSEC
>  	/* check AH/ESP integrity. */
> @@ -834,16 +834,19 @@ rip_detach(struct socket *so)
>  static void
>  rip_dodisconnect(struct socket *so, struct inpcb *inp)
>  {
> +	struct inpcbinfo *pcbinfo;
>  
> -	INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo);
> -	INP_WLOCK_ASSERT(inp);
> -
> +	pcbinfo = inp->inp_pcbinfo;
> +	INP_INFO_WLOCK(pcbinfo);
> +	INP_WLOCK(inp);
>  	rip_delhash(inp);
>  	inp->inp_faddr.s_addr = INADDR_ANY;
>  	rip_inshash(inp);
>  	SOCK_LOCK(so);
>  	so->so_state &= ~SS_ISCONNECTED;
>  	SOCK_UNLOCK(so);
> +	INP_WUNLOCK(inp);
> +	INP_INFO_WUNLOCK(pcbinfo);
>  }
>  
>  static void
> @@ -854,11 +857,7 @@ rip_abort(struct socket *so)
>  	inp = sotoinpcb(so);
>  	KASSERT(inp != NULL, ("rip_abort: inp == NULL"));
>  
> -	INP_INFO_WLOCK(&V_ripcbinfo);
> -	INP_WLOCK(inp);
>  	rip_dodisconnect(so, inp);
> -	INP_WUNLOCK(inp);
> -	INP_INFO_WUNLOCK(&V_ripcbinfo);
>  }
>  
>  static void
> @@ -869,11 +868,7 @@ rip_close(struct socket *so)
>  	inp = sotoinpcb(so);
>  	KASSERT(inp != NULL, ("rip_close: inp == NULL"));
>  
> -	INP_INFO_WLOCK(&V_ripcbinfo);
> -	INP_WLOCK(inp);
>  	rip_dodisconnect(so, inp);
> -	INP_WUNLOCK(inp);
> -	INP_INFO_WUNLOCK(&V_ripcbinfo);
>  }
>  
>  static int
> @@ -887,11 +882,7 @@ rip_disconnect(struct socket *so)
>  	inp = sotoinpcb(so);
>  	KASSERT(inp != NULL, ("rip_disconnect: inp == NULL"));
>  
> -	INP_INFO_WLOCK(&V_ripcbinfo);
> -	INP_WLOCK(inp);
>  	rip_dodisconnect(so, inp);
> -	INP_WUNLOCK(inp);
> -	INP_INFO_WUNLOCK(&V_ripcbinfo);
>  	return (0);
>  }
>  
> @@ -1077,9 +1068,9 @@ rip_pcblist(SYSCTL_HANDLER_ARGS)
>  	INP_INFO_WLOCK(&V_ripcbinfo);
>  	for (i = 0; i < n; i++) {
>  		inp = inp_list[i];
> -		INP_WLOCK(inp);
> -		if (!in_pcbrele(inp))
> -			INP_WUNLOCK(inp);
> +		INP_RLOCK(inp);
> +		if (!in_pcbrele_rlocked(inp))
> +			INP_RUNLOCK(inp);
>  	}
>  	INP_INFO_WUNLOCK(&V_ripcbinfo);
>  
> 
> Modified: head/sys/netinet/siftr.c
> ==============================================================================
> --- head/sys/netinet/siftr.c	Mon May 30 09:41:38 2011	(r222487)
> +++ head/sys/netinet/siftr.c	Mon May 30 09:43:55 2011	(r222488)
> @@ -696,17 +696,16 @@ siftr_findinpcb(int ipver, struct ip *ip
>  
>  	/* We need the tcbinfo lock. */
>  	INP_INFO_UNLOCK_ASSERT(&V_tcbinfo);
> -	INP_INFO_RLOCK(&V_tcbinfo);
>  
>  	if (dir == PFIL_IN)
>  		inp = (ipver == INP_IPV4 ?
> -		    in_pcblookup_hash(&V_tcbinfo, ip->ip_src, sport, ip->ip_dst,
> -		    dport, 0, m->m_pkthdr.rcvif)
> +		    in_pcblookup(&V_tcbinfo, ip->ip_src, sport, ip->ip_dst,
> +		    dport, INPLOOKUP_RLOCKPCB, m->m_pkthdr.rcvif)
>  		    :
>  #ifdef SIFTR_IPV6
> -		    in6_pcblookup_hash(&V_tcbinfo,
> +		    in6_pcblookup(&V_tcbinfo,
>  		    &((struct ip6_hdr *)ip)->ip6_src, sport,
> -		    &((struct ip6_hdr *)ip)->ip6_dst, dport, 0,
> +		    &((struct ip6_hdr *)ip)->ip6_dst, dport, INPLOOKUP_RLOCKPCB,
>  		    m->m_pkthdr.rcvif)
>  #else
>  		    NULL
> @@ -715,13 +714,13 @@ siftr_findinpcb(int ipver, struct ip *ip
>  
>  	else
>  		inp = (ipver == INP_IPV4 ?
> -		    in_pcblookup_hash(&V_tcbinfo, ip->ip_dst, dport, ip->ip_src,
> -		    sport, 0, m->m_pkthdr.rcvif)
> +		    in_pcblookup(&V_tcbinfo, ip->ip_dst, dport, ip->ip_src,
> +		    sport, INPLOOKUP_RLOCKPCB, m->m_pkthdr.rcvif)
>  		    :
>  #ifdef SIFTR_IPV6
> -		    in6_pcblookup_hash(&V_tcbinfo,
> +		    in6_pcblookup(&V_tcbinfo,
>  		    &((struct ip6_hdr *)ip)->ip6_dst, dport,
> -		    &((struct ip6_hdr *)ip)->ip6_src, sport, 0,
> +		    &((struct ip6_hdr *)ip)->ip6_src, sport, INPLOOKUP_RLOCKPCB,
>  		    m->m_pkthdr.rcvif)
>  #else
>  		    NULL
> @@ -734,12 +733,7 @@ siftr_findinpcb(int ipver, struct ip *ip
>  			ss->nskip_in_inpcb++;
>  		else
>  			ss->nskip_out_inpcb++;
> -	} else {
> -		/* Acquire the inpcb lock. */
> -		INP_UNLOCK_ASSERT(inp);
> -		INP_RLOCK(inp);
>  	}
> -	INP_INFO_RUNLOCK(&V_tcbinfo);
>  
>  	return (inp);
>  }
> 
> Modified: head/sys/netinet/tcp_input.c
> ==============================================================================
> --- head/sys/netinet/tcp_input.c	Mon May 30 09:41:38 2011	(r222487)
> +++ head/sys/netinet/tcp_input.c	Mon May 30 09:43:55 2011	(r222488)
> @@ -5,6 +5,7 @@
>   *	Swinburne University of Technology, Melbourne, Australia.
>   * Copyright (c) 2009-2010 Lawrence Stewart <lstewart@freebsd.org>
>   * Copyright (c) 2010 The FreeBSD Foundation
> + * Copyright (c) 2010-2011 Juniper Networks, Inc.
>   * All rights reserved.
>   *
>   * Portions of this software were developed at the Centre for Advanced Internet
> @@ -16,6 +17,9 @@
>   * Internet Architectures, Swinburne University of Technology, Melbourne,
>   * Australia by David Hayes under sponsorship from the FreeBSD Foundation.
>   *
> + * Portions of this software were developed by Robert N. M. Watson under
> + * contract to Juniper Networks, Inc.
> + *
>   * Redistribution and use in source and binary forms, with or without
>   * modification, are permitted provided that the following conditions
>   * are met:
> @@ -197,10 +201,6 @@ SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO,
>      &VNET_NAME(tcp_autorcvbuf_max), 0,
>      "Max size of automatic receive buffer");
>  
> -int	tcp_read_locking = 1;
> -SYSCTL_INT(_net_inet_tcp, OID_AUTO, read_locking, CTLFLAG_RW,
> -    &tcp_read_locking, 0, "Enable read locking strategy");
> -
>  VNET_DEFINE(struct inpcbhead, tcb);
>  #define	tcb6	tcb  /* for KAME src sync over BSD*'s */
>  VNET_DEFINE(struct inpcbinfo, tcbinfo);
> @@ -591,8 +591,7 @@ tcp_input(struct mbuf *m, int off0)
>  	char *s = NULL;			/* address and port logging */
>  	int ti_locked;
>  #define	TI_UNLOCKED	1
> -#define	TI_RLOCKED	2
> -#define	TI_WLOCKED	3
> +#define	TI_WLOCKED	2
>  
>  #ifdef TCPDEBUG
>  	/*
> @@ -756,30 +755,25 @@ tcp_input(struct mbuf *m, int off0)
>  	drop_hdrlen = off0 + off;
>  
>  	/*
> -	 * Locate pcb for segment, which requires a lock on tcbinfo.
> -	 * Optimisticaly acquire a global read lock rather than a write lock
> -	 * unless header flags necessarily imply a state change.  There are
> -	 * two cases where we might discover later we need a write lock
> -	 * despite the flags: ACKs moving a connection out of the syncache,
> -	 * and ACKs for a connection in TIMEWAIT.
> +	 * Locate pcb for segment; if we're likely to add or remove a
> +	 * connection then first acquire pcbinfo lock.  There are two cases
> +	 * where we might discover later we need a write lock despite the
> +	 * flags: ACKs moving a connection out of the syncache, and ACKs for
> +	 * a connection in TIMEWAIT.
>  	 */
> -	if ((thflags & (TH_SYN | TH_FIN | TH_RST)) != 0 ||
> -	    tcp_read_locking == 0) {
> +	if ((thflags & (TH_SYN | TH_FIN | TH_RST)) != 0) {
>  		INP_INFO_WLOCK(&V_tcbinfo);
>  		ti_locked = TI_WLOCKED;
> -	} else {
> -		INP_INFO_RLOCK(&V_tcbinfo);
> -		ti_locked = TI_RLOCKED;
> -	}
> +	} else
> +		ti_locked = TI_UNLOCKED;
>  
>  findpcb:
>  #ifdef INVARIANTS
> -	if (ti_locked == TI_RLOCKED)
> -		INP_INFO_RLOCK_ASSERT(&V_tcbinfo);
> -	else if (ti_locked == TI_WLOCKED)
> +	if (ti_locked == TI_WLOCKED) {
>  		INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
> -	else
> -		panic("%s: findpcb ti_locked %d\n", __func__, ti_locked);
> +	} else {
> +		INP_INFO_UNLOCK_ASSERT(&V_tcbinfo);
> +	}
>  #endif
>  
>  #ifdef INET
> @@ -797,20 +791,18 @@ findpcb:
>  		 * Transparently forwarded. Pretend to be the destination.
>  		 * already got one like this?
>  		 */
> -		inp = in_pcblookup_hash(&V_tcbinfo,
> -					ip->ip_src, th->th_sport,
> -					ip->ip_dst, th->th_dport,
> -					0, m->m_pkthdr.rcvif);
> +		inp = in_pcblookup(&V_tcbinfo, ip->ip_src, th->th_sport,
> +		    ip->ip_dst, th->th_dport, INPLOOKUP_WLOCKPCB,
> +		    m->m_pkthdr.rcvif);
>  		if (!inp) {
> -			/* It's new.  Try to find the ambushing socket. */
> -			inp = in_pcblookup_hash(&V_tcbinfo,
> -						ip->ip_src, th->th_sport,
> -						next_hop->sin_addr,
> -						next_hop->sin_port ?
> -						    ntohs(next_hop->sin_port) :
> -						    th->th_dport,
> -						INPLOOKUP_WILDCARD,
> -						m->m_pkthdr.rcvif);
> +			/*
> +			 * It's new.  Try to find the ambushing socket.
> +			 */
> +			inp = in_pcblookup(&V_tcbinfo, ip->ip_src,
> +			    th->th_sport, next_hop->sin_addr,
> +			    next_hop->sin_port ? ntohs(next_hop->sin_port) :
> +			    th->th_dport, INPLOOKUP_WILDCARD |
> +			    INPLOOKUP_WLOCKPCB, m->m_pkthdr.rcvif);
>  		}
>  		/* Remove the tag from the packet.  We don't need it anymore. */
>  		m_tag_delete(m, fwd_tag);
> @@ -820,21 +812,19 @@ findpcb:
>  	{
>  #ifdef INET6
>  		if (isipv6)
> -			inp = in6_pcblookup_hash(&V_tcbinfo,
> -						 &ip6->ip6_src, th->th_sport,
> -						 &ip6->ip6_dst, th->th_dport,
> -						 INPLOOKUP_WILDCARD,
> -						 m->m_pkthdr.rcvif);
> +			inp = in6_pcblookup(&V_tcbinfo, &ip6->ip6_src,
> +			    th->th_sport, &ip6->ip6_dst, th->th_dport,
> +			    INPLOOKUP_WILDCARD | INPLOOKUP_WLOCKPCB,
> +			    m->m_pkthdr.rcvif);
>  #endif
>  #if defined(INET) && defined(INET6)
>  		else
>  #endif
>  #ifdef INET
> -			inp = in_pcblookup_hash(&V_tcbinfo,
> -						ip->ip_src, th->th_sport,
> -						ip->ip_dst, th->th_dport,
> -						INPLOOKUP_WILDCARD,
> -						m->m_pkthdr.rcvif);
> +			inp = in_pcblookup(&V_tcbinfo, ip->ip_src,
> +			    th->th_sport, ip->ip_dst, th->th_dport,
> +			    INPLOOKUP_WILDCARD | INPLOOKUP_WLOCKPCB,
> +			    m->m_pkthdr.rcvif);
>  #endif
>  	}
>  
> @@ -865,7 +855,7 @@ findpcb:
>  		rstreason = BANDLIM_RST_CLOSEDPORT;
>  		goto dropwithreset;
>  	}
> -	INP_WLOCK(inp);
> +	INP_WLOCK_ASSERT(inp);
>  	if (!(inp->inp_flags & INP_HW_FLOWID)
>  	    && (m->m_flags & M_FLOWID)
>  	    && ((inp->inp_socket == NULL)
> @@ -906,28 +896,26 @@ findpcb:
>  	 * legitimate new connection attempt the old INPCB gets removed and
>  	 * we can try again to find a listening socket.
>  	 *
> -	 * At this point, due to earlier optimism, we may hold a read lock on
> -	 * the inpcbinfo, rather than a write lock.  If so, we need to
> -	 * upgrade, or if that fails, acquire a reference on the inpcb, drop
> -	 * all locks, acquire a global write lock, and then re-acquire the
> -	 * inpcb lock.  We may at that point discover that another thread has
> -	 * tried to free the inpcb, in which case we need to loop back and
> -	 * try to find a new inpcb to deliver to.
> +	 * At this point, due to earlier optimism, we may hold only an inpcb
> +	 * lock, and not the inpcbinfo write lock.  If so, we need to try to
> +	 * acquire it, or if that fails, acquire a reference on the inpcb,
> +	 * drop all locks, acquire a global write lock, and then re-acquire
> +	 * the inpcb lock.  We may at that point discover that another thread
> +	 * has tried to free the inpcb, in which case we need to loop back
> +	 * and try to find a new inpcb to deliver to.
> +	 *
> +	 * XXXRW: It may be time to rethink timewait locking.
>  	 */
>  relocked:
>  	if (inp->inp_flags & INP_TIMEWAIT) {
> -		KASSERT(ti_locked == TI_RLOCKED || ti_locked == TI_WLOCKED,
> -		    ("%s: INP_TIMEWAIT ti_locked %d", __func__, ti_locked));
> -
> -		if (ti_locked == TI_RLOCKED) {
> -			if (INP_INFO_TRY_UPGRADE(&V_tcbinfo) == 0) {
> +		if (ti_locked == TI_UNLOCKED) {
> +			if (INP_INFO_TRY_WLOCK(&V_tcbinfo) == 0) {
>  				in_pcbref(inp);
>  				INP_WUNLOCK(inp);
> -				INP_INFO_RUNLOCK(&V_tcbinfo);
>  				INP_INFO_WLOCK(&V_tcbinfo);
>  				ti_locked = TI_WLOCKED;
>  				INP_WLOCK(inp);
> -				if (in_pcbrele(inp)) {
> +				if (in_pcbrele_wlocked(inp)) {
>  					inp = NULL;
>  					goto findpcb;
>  				}
> @@ -975,26 +963,24 @@ relocked:
>  
>  	/*
>  	 * We've identified a valid inpcb, but it could be that we need an
> -	 * inpcbinfo write lock and have only a read lock.  In this case,
> -	 * attempt to upgrade/relock using the same strategy as the TIMEWAIT
> -	 * case above.  If we relock, we have to jump back to 'relocked' as
> -	 * the connection might now be in TIMEWAIT.
> -	 */
> -	if (tp->t_state != TCPS_ESTABLISHED ||
> -	    (thflags & (TH_SYN | TH_FIN | TH_RST)) != 0 ||
> -	    tcp_read_locking == 0) {
> -		KASSERT(ti_locked == TI_RLOCKED || ti_locked == TI_WLOCKED,
> -		    ("%s: upgrade check ti_locked %d", __func__, ti_locked));
> -
> -		if (ti_locked == TI_RLOCKED) {
> -			if (INP_INFO_TRY_UPGRADE(&V_tcbinfo) == 0) {
> +	 * inpcbinfo write lock but don't hold it.  In this case, attempt to
> +	 * acquire using the same strategy as the TIMEWAIT case above.  If we
> +	 * relock, we have to jump back to 'relocked' as the connection might
> +	 * now be in TIMEWAIT.
> +	 */
> +#ifdef INVARIANTS
> +	if ((thflags & (TH_SYN | TH_FIN | TH_RST)) != 0)
> +		INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
> +#endif
> +	if (tp->t_state != TCPS_ESTABLISHED) {
> +		if (ti_locked == TI_UNLOCKED) {
> +			if (INP_INFO_TRY_WLOCK(&V_tcbinfo) == 0) {
>  				in_pcbref(inp);
>  				INP_WUNLOCK(inp);
> -				INP_INFO_RUNLOCK(&V_tcbinfo);
>  				INP_INFO_WLOCK(&V_tcbinfo);
>  				ti_locked = TI_WLOCKED;
>  				INP_WLOCK(inp);
> -				if (in_pcbrele(inp)) {
> +				if (in_pcbrele_wlocked(inp)) {
>  					inp = NULL;
>  					goto findpcb;
>  				}
> @@ -1027,13 +1013,16 @@ relocked:
>  	/*
>  	 * When the socket is accepting connections (the INPCB is in LISTEN
>  	 * state) we look into the SYN cache if this is a new connection
> -	 * attempt or the completion of a previous one.
> +	 * attempt or the completion of a previous one.  Because listen
> +	 * sockets are never in TCPS_ESTABLISHED, the V_tcbinfo lock will be
> +	 * held in this case.
>  	 */
>  	if (so->so_options & SO_ACCEPTCONN) {
>  		struct in_conninfo inc;
>  
>  		KASSERT(tp->t_state == TCPS_LISTEN, ("%s: so accepting but "
>  		    "tp not listening", __func__));
> +		INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
>  
>  		bzero(&inc, sizeof(inc));
>  #ifdef INET6
> @@ -1371,13 +1360,17 @@ relocked:
>  	return;
>  
>  dropwithreset:
> -	if (ti_locked == TI_RLOCKED)
> -		INP_INFO_RUNLOCK(&V_tcbinfo);
> -	else if (ti_locked == TI_WLOCKED)
> +	if (ti_locked == TI_WLOCKED) {
>  		INP_INFO_WUNLOCK(&V_tcbinfo);
> -	else
> -		panic("%s: dropwithreset ti_locked %d", __func__, ti_locked);
> -	ti_locked = TI_UNLOCKED;
> +		ti_locked = TI_UNLOCKED;
> +	}
> +#ifdef INVARIANTS
> +	else {
> +		KASSERT(ti_locked == TI_UNLOCKED, ("%s: dropwithreset "
> +		    "ti_locked: %d", __func__, ti_locked));
> +		INP_INFO_UNLOCK_ASSERT(&V_tcbinfo);
> +	}
> +#endif
>  
>  	if (inp != NULL) {
>  		tcp_dropwithreset(m, th, tp, tlen, rstreason);
> @@ -1388,13 +1381,17 @@ dropwithreset:
>  	goto drop;
>  
>  dropunlock:
> -	if (ti_locked == TI_RLOCKED)
> -		INP_INFO_RUNLOCK(&V_tcbinfo);
> -	else if (ti_locked == TI_WLOCKED)
> +	if (ti_locked == TI_WLOCKED) {
>  		INP_INFO_WUNLOCK(&V_tcbinfo);
> -	else
> -		panic("%s: dropunlock ti_locked %d", __func__, ti_locked);
> -	ti_locked = TI_UNLOCKED;
> +		ti_locked = TI_UNLOCKED;
> +	}
> +#ifdef INVARIANTS
> +	else {
> +		KASSERT(ti_locked == TI_UNLOCKED, ("%s: dropunlock "
> +		    "ti_locked: %d", __func__, ti_locked));
> +		INP_INFO_UNLOCK_ASSERT(&V_tcbinfo);
> +	}
> +#endif
>  
>  	if (inp != NULL)
>  		INP_WUNLOCK(inp);
> @@ -1449,13 +1446,13 @@ tcp_do_segment(struct mbuf *m, struct tc
>  		INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
>  	} else {
>  #ifdef INVARIANTS
> -		if (ti_locked == TI_RLOCKED)
> -			INP_INFO_RLOCK_ASSERT(&V_tcbinfo);
> 
> *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***
> _______________________________________________
> svn-src-head@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/svn-src-head
> To unsubscribe, send any mail to "svn-src-head-unsubscribe@freebsd.org"
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110604143043.GE17764>