Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 4 Jun 2011 17:27:56 +0100 (BST)
From:      Robert Watson <rwatson@FreeBSD.org>
To:        Kristof Provost <kristof@sigsegv.be>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org
Subject:   Re: svn commit: r222488 - in head/sys: contrib/pf/net netinet netinet/ipfw netinet6
Message-ID:  <alpine.BSF.2.00.1106041726570.74164@fledge.watson.org>
In-Reply-To: <20110604143043.GE17764@nereid>
References:  <201105300943.p4U9htjI070096@svn.freebsd.org> <20110604143043.GE17764@nereid>

next in thread | previous in thread | raw e-mail | index | archive | help

On Sat, 4 Jun 2011, Kristof Provost wrote:

> I'm seeing a panic when I start natd, which I suspect is related to this 
> commit:

And, I believe now fixed in the just-committed r222690.  My suggestion is that 
people running -CURRENT should generally keep INVARIANTS and WITNESS on -- 
they introduce some overhead, but are invaluable in tracking down problems.

Thanks!

Robert

>
> panic: Lock pcbinfohash not exclusively locked @
> /usr/src/sys/netinet/in_pcb.c:323
>
> Backtrace:
> panic() ...
> _rw_assert() at _rw_assert+0x18d
> in_pcbbind() at in_pcbbind+0xf5
> div_bind() at div_bind+0xb9
> sobind() at sobind()+0x79
> kern_bind() at kern_bind+0xde
> bind() at bind()+0x41
> syscallenter() at syscallenter+0x1cb
> syscall() at syscall+0x4c
> Xfast_syscall() at Xfast_syscall+0xdd
>
> div_bind probably also needs to surround the call to in_pcbbind with
> INP_HASHW(UN)LOCK(...)
>
> I'm currently running 222680. I've only now seen the issue, but I've also
> just now activated INVARIANTS.
>
> Regards,
> Kristof
>
> On 2011-05-30 09:43:55 (+0000), Robert Watson <rwatson@FreeBSD.org> wrote:
>> Author: rwatson
>> Date: Mon May 30 09:43:55 2011
>> New Revision: 222488
>> URL: http://svn.freebsd.org/changeset/base/222488
>>
>> Log:
>>   Decompose the current single inpcbinfo lock into two locks:
>>
>>   - The existing ipi_lock continues to protect the global inpcb list and
>>     inpcb counter.  This lock is now relegated to a small number of
>>     allocation and free operations, and occasional operations that walk
>>     all connections (including, awkwardly, certain UDP multicast receive
>>     operations -- something to revisit).
>>
>>   - A new ipi_hash_lock protects the two inpcbinfo hash tables for
>>     looking up connections and bound sockets, manipulated using new
>>     INP_HASH_*() macros.  This lock, combined with inpcb locks, protects
>>     the 4-tuple address space.
>>
>>   Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
>>   connection locks, so may be acquired while manipulating a connection on
>>   which a lock is already held, avoiding the need to acquire the inpcbinfo
>>   lock preemptively when a binding change might later be required.  As a
>>   result, however, lookup operations necessarily go through a reference
>>   acquire while holding the lookup lock, later acquiring an inpcb lock --
>>   if required.
>>
>>   A new function in_pcblookup() looks up connections, and accepts flags
>>   indicating how to return the inpcb.  Due to lock order changes, callers
>>   no longer need acquire locks before performing a lookup: the lookup
>>   routine will acquire the ipi_hash_lock as needed.  In the future, it will
>>   also be able to use alternative lookup and locking strategies
>>   transparently to callers, such as pcbgroup lookup.  New lookup flags are,
>>   supplementing the existing INPLOOKUP_WILDCARD flag:
>>
>>     INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
>>     INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
>>
>>   Callers must pass exactly one of these flags (for the time being).
>>
>>   Some notes:
>>
>>   - All protocols are updated to work within the new regime; especially,
>>     TCP, UDPv4, and UDPv6.  pcbinfo ipi_lock acquisitions are largely
>>     eliminated, and global hash lock hold times are dramatically reduced
>>     compared to previous locking.
>>   - The TCP syncache still relies on the pcbinfo lock, something that we
>>     may want to revisit.
>>   - Support for reverting to the FreeBSD 7.x locking strategy in TCP input
>>     is no longer available -- hash lookup locks are now held only very
>>     briefly during inpcb lookup, rather than for potentially extended
>>     periods.  However, the pcbinfo ipi_lock will still be acquired if a
>>     connection state might change such that a connection is added or
>>     removed.
>>   - Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
>>     due to maintaining their own hash tables.
>>   - The interface in6_pcblookup_hash_locked() is maintained, which allows
>>     callers to acquire hash locks and perform one or more lookups atomically
>>     with 4-tuple allocation: this is required only for TCPv6, as there is no
>>     in6_pcbconnect_setup(), which there should be.
>>   - UDPv6 locking remains significantly more conservative than UDPv4
>>     locking, which relates to source address selection.  This needs
>>     attention, as it likely significantly reduces parallelism in this code
>>     for multithreaded socket use (such as in BIND).
>>   - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
>>     somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
>>     is no longer sufficient.  A second check once the inpcb lock is held
>>     should do the trick, keeping the general case from requiring the inpcb
>>     lock for every inpcb visited.
>>   - This work reminds us that we need to revisit locking of the v4/v6 flags,
>>     which may be accessed lock-free both before and after this change.
>>   - Right now, a single lock name is used for the pcbhash lock -- this is
>>     undesirable, and probably another argument is required to take care of
>>     this (or a char array name field in the pcbinfo?).
>>
>>   This is not an MFC candidate for 8.x due to its impact on lookup and
>>   locking semantics.  It's possible some of these issues could be worked
>>   around with compatibility wrappers, if necessary.
>>
>>   Reviewed by:    bz
>>   Sponsored by:   Juniper Networks, Inc.
>>
>> Modified:
>>   head/sys/contrib/pf/net/pf.c
>>   head/sys/netinet/in_pcb.c
>>   head/sys/netinet/in_pcb.h
>>   head/sys/netinet/ip_divert.c
>>   head/sys/netinet/ipfw/ip_fw2.c
>>   head/sys/netinet/raw_ip.c
>>   head/sys/netinet/siftr.c
>>   head/sys/netinet/tcp_input.c
>>   head/sys/netinet/tcp_subr.c
>>   head/sys/netinet/tcp_syncache.c
>>   head/sys/netinet/tcp_timer.c
>>   head/sys/netinet/tcp_usrreq.c
>>   head/sys/netinet/udp_usrreq.c
>>   head/sys/netinet6/in6_pcb.c
>>   head/sys/netinet6/in6_pcb.h
>>   head/sys/netinet6/in6_src.c
>>   head/sys/netinet6/udp6_usrreq.c
>>
>> Modified: head/sys/contrib/pf/net/pf.c
>> ==============================================================================
>> --- head/sys/contrib/pf/net/pf.c	Mon May 30 09:41:38 2011	(r222487)
>> +++ head/sys/contrib/pf/net/pf.c	Mon May 30 09:43:55 2011	(r222488)
>> @@ -3034,16 +3034,14 @@ pf_socket_lookup(int direction, struct p
>>  #ifdef INET
>>  	case AF_INET:
>>  #ifdef __FreeBSD__
>> -		INP_INFO_RLOCK(pi);	/* XXX LOR */
>> -		inp = in_pcblookup_hash(pi, saddr->v4, sport, daddr->v4,
>> -			dport, 0, NULL);
>> +		inp = in_pcblookup(pi, saddr->v4, sport, daddr->v4,
>> +			dport, INPLOOKUP_RLOCKPCB, NULL);
>>  		if (inp == NULL) {
>> -			inp = in_pcblookup_hash(pi, saddr->v4, sport,
>> -			   daddr->v4, dport, INPLOOKUP_WILDCARD, NULL);
>> -			if(inp == NULL) {
>> -				INP_INFO_RUNLOCK(pi);
>> +			inp = in_pcblookup(pi, saddr->v4, sport,
>> +			   daddr->v4, dport, INPLOOKUP_WILDCARD |
>> +			   INPLOOKUP_RLOCKPCB, NULL);
>> +			if (inp == NULL)
>>  				return (-1);
>> -			}
>>  		}
>>  #else
>>  		inp = in_pcbhashlookup(tb, saddr->v4, sport, daddr->v4, dport);
>> @@ -3058,16 +3056,14 @@ pf_socket_lookup(int direction, struct p
>>  #ifdef INET6
>>  	case AF_INET6:
>>  #ifdef __FreeBSD__
>> -		INP_INFO_RLOCK(pi);
>> -		inp = in6_pcblookup_hash(pi, &saddr->v6, sport,
>> -			&daddr->v6, dport, 0, NULL);
>> +		inp = in6_pcblookup(pi, &saddr->v6, sport,
>> +			&daddr->v6, dport, INPLOOKUP_RLOCKPCB, NULL);
>>  		if (inp == NULL) {
>> -			inp = in6_pcblookup_hash(pi, &saddr->v6, sport,
>> -			&daddr->v6, dport, INPLOOKUP_WILDCARD, NULL);
>> -			if (inp == NULL) {
>> -				INP_INFO_RUNLOCK(pi);
>> +			inp = in6_pcblookup(pi, &saddr->v6, sport,
>> +			    &daddr->v6, dport, INPLOOKUP_WILDCARD |
>> +			    INPLOOKUP_RLOCKPCB, NULL);
>> +			if (inp == NULL)
>>  				return (-1);
>> -			}
>>  		}
>>  #else
>>  		inp = in6_pcbhashlookup(tb, &saddr->v6, sport, &daddr->v6,
>> @@ -3085,9 +3081,10 @@ pf_socket_lookup(int direction, struct p
>>  		return (-1);
>>  	}
>>  #ifdef __FreeBSD__
>> +	INP_RLOCK_ASSERT(inp);
>>  	pd->lookup.uid = inp->inp_cred->cr_uid;
>>  	pd->lookup.gid = inp->inp_cred->cr_groups[0];
>> -	INP_INFO_RUNLOCK(pi);
>> +	INP_RUNLOCK(inp);
>>  #else
>>  	pd->lookup.uid = inp->inp_socket->so_euid;
>>  	pd->lookup.gid = inp->inp_socket->so_egid;
>>
>> Modified: head/sys/netinet/in_pcb.c
>> ==============================================================================
>> --- head/sys/netinet/in_pcb.c	Mon May 30 09:41:38 2011	(r222487)
>> +++ head/sys/netinet/in_pcb.c	Mon May 30 09:43:55 2011	(r222488)
>> @@ -127,6 +127,10 @@ static VNET_DEFINE(int, ipport_tcplastco
>>
>>  #define	V_ipport_tcplastcount		VNET(ipport_tcplastcount)
>>
>> +static struct inpcb	*in_pcblookup_hash_locked(struct inpcbinfo *pcbinfo,
>> +			    struct in_addr faddr, u_int fport_arg,
>> +			    struct in_addr laddr, u_int lport_arg,
>> +			    int lookupflags, struct ifnet *ifp);
>>  static void	in_pcbremlists(struct inpcb *inp);
>>
>>  #ifdef INET
>> @@ -212,11 +216,13 @@ in_pcbinfo_init(struct inpcbinfo *pcbinf
>>  {
>>
>>  	INP_INFO_LOCK_INIT(pcbinfo, name);
>> +	INP_HASH_LOCK_INIT(pcbinfo, "pcbinfohash");	/* XXXRW: argument? */
>>  #ifdef VIMAGE
>>  	pcbinfo->ipi_vnet = curvnet;
>>  #endif
>>  	pcbinfo->ipi_listhead = listhead;
>>  	LIST_INIT(pcbinfo->ipi_listhead);
>> +	pcbinfo->ipi_count = 0;
>>  	pcbinfo->ipi_hashbase = hashinit(hash_nelements, M_PCB,
>>  	    &pcbinfo->ipi_hashmask);
>>  	pcbinfo->ipi_porthashbase = hashinit(porthash_nelements, M_PCB,
>> @@ -234,10 +240,14 @@ void
>>  in_pcbinfo_destroy(struct inpcbinfo *pcbinfo)
>>  {
>>
>> +	KASSERT(pcbinfo->ipi_count == 0,
>> +	    ("%s: ipi_count = %u", __func__, pcbinfo->ipi_count));
>> +
>>  	hashdestroy(pcbinfo->ipi_hashbase, M_PCB, pcbinfo->ipi_hashmask);
>>  	hashdestroy(pcbinfo->ipi_porthashbase, M_PCB,
>>  	    pcbinfo->ipi_porthashmask);
>>  	uma_zdestroy(pcbinfo->ipi_zone);
>> +	INP_HASH_LOCK_DESTROY(pcbinfo);
>>  	INP_INFO_LOCK_DESTROY(pcbinfo);
>>  }
>>
>> @@ -309,8 +319,8 @@ in_pcbbind(struct inpcb *inp, struct soc
>>  {
>>  	int anonport, error;
>>
>> -	INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo);
>>  	INP_WLOCK_ASSERT(inp);
>> +	INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
>>
>>  	if (inp->inp_lport != 0 || inp->inp_laddr.s_addr != INADDR_ANY)
>>  		return (EINVAL);
>> @@ -351,8 +361,8 @@ in_pcb_lport(struct inpcb *inp, struct i
>>  	 * Because no actual state changes occur here, a global write lock on
>>  	 * the pcbinfo isn't required.
>>  	 */
>> -	INP_INFO_LOCK_ASSERT(pcbinfo);
>>  	INP_LOCK_ASSERT(inp);
>> +	INP_HASH_LOCK_ASSERT(pcbinfo);
>>
>>  	if (inp->inp_flags & INP_HIGHPORT) {
>>  		first = V_ipport_hifirstauto;	/* sysctl */
>> @@ -473,11 +483,10 @@ in_pcbbind_setup(struct inpcb *inp, stru
>>  	int error;
>>
>>  	/*
>> -	 * Because no actual state changes occur here, a global write lock on
>> -	 * the pcbinfo isn't required.
>> +	 * No state changes, so read locks are sufficient here.
>>  	 */
>> -	INP_INFO_LOCK_ASSERT(pcbinfo);
>>  	INP_LOCK_ASSERT(inp);
>> +	INP_HASH_LOCK_ASSERT(pcbinfo);
>>
>>  	if (TAILQ_EMPTY(&V_in_ifaddrhead)) /* XXX broken! */
>>  		return (EADDRNOTAVAIL);
>> @@ -618,8 +627,8 @@ in_pcbconnect(struct inpcb *inp, struct
>>  	in_addr_t laddr, faddr;
>>  	int anonport, error;
>>
>> -	INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo);
>>  	INP_WLOCK_ASSERT(inp);
>> +	INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
>>
>>  	lport = inp->inp_lport;
>>  	laddr = inp->inp_laddr.s_addr;
>> @@ -907,8 +916,8 @@ in_pcbconnect_setup(struct inpcb *inp, s
>>  	 * Because a global state change doesn't actually occur here, a read
>>  	 * lock is sufficient.
>>  	 */
>> -	INP_INFO_LOCK_ASSERT(inp->inp_pcbinfo);
>>  	INP_LOCK_ASSERT(inp);
>> +	INP_HASH_LOCK_ASSERT(inp->inp_pcbinfo);
>>
>>  	if (oinpp != NULL)
>>  		*oinpp = NULL;
>> @@ -983,8 +992,8 @@ in_pcbconnect_setup(struct inpcb *inp, s
>>  		if (error)
>>  			return (error);
>>  	}
>> -	oinp = in_pcblookup_hash(inp->inp_pcbinfo, faddr, fport, laddr, lport,
>> -	    0, NULL);
>> +	oinp = in_pcblookup_hash_locked(inp->inp_pcbinfo, faddr, fport,
>> +	    laddr, lport, 0, NULL);
>>  	if (oinp != NULL) {
>>  		if (oinpp != NULL)
>>  			*oinpp = oinp;
>> @@ -1007,8 +1016,8 @@ void
>>  in_pcbdisconnect(struct inpcb *inp)
>>  {
>>
>> -	INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo);
>>  	INP_WLOCK_ASSERT(inp);
>> +	INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
>>
>>  	inp->inp_faddr.s_addr = INADDR_ANY;
>>  	inp->inp_fport = 0;
>> @@ -1187,19 +1196,24 @@ void
>>  in_pcbdrop(struct inpcb *inp)
>>  {
>>
>> -	INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo);
>>  	INP_WLOCK_ASSERT(inp);
>>
>> +	/*
>> +	 * XXXRW: Possibly we should protect the setting of INP_DROPPED with
>> +	 * the hash lock...?
>> +	 */
>>  	inp->inp_flags |= INP_DROPPED;
>>  	if (inp->inp_flags & INP_INHASHLIST) {
>>  		struct inpcbport *phd = inp->inp_phd;
>>
>> +		INP_HASH_WLOCK(inp->inp_pcbinfo);
>>  		LIST_REMOVE(inp, inp_hash);
>>  		LIST_REMOVE(inp, inp_portlist);
>>  		if (LIST_FIRST(&phd->phd_pcblist) == NULL) {
>>  			LIST_REMOVE(phd, phd_hash);
>>  			free(phd, M_PCB);
>>  		}
>> +		INP_HASH_WUNLOCK(inp->inp_pcbinfo);
>>  		inp->inp_flags &= ~INP_INHASHLIST;
>>  	}
>>  }
>> @@ -1328,7 +1342,8 @@ in_pcbpurgeif0(struct inpcbinfo *pcbinfo
>>  }
>>
>>  /*
>> - * Lookup a PCB based on the local address and port.
>> + * Lookup a PCB based on the local address and port.  Caller must hold the
>> + * hash lock.  No inpcb locks or references are acquired.
>>   */
>>  #define INP_LOOKUP_MAPPED_PCB_COST	3
>>  struct inpcb *
>> @@ -1346,7 +1361,7 @@ in_pcblookup_local(struct inpcbinfo *pcb
>>  	KASSERT((lookupflags & ~(INPLOOKUP_WILDCARD)) == 0,
>>  	    ("%s: invalid lookup flags %d", __func__, lookupflags));
>>
>> -	INP_INFO_LOCK_ASSERT(pcbinfo);
>> +	INP_HASH_LOCK_ASSERT(pcbinfo);
>>
>>  	if ((lookupflags & INPLOOKUP_WILDCARD) == 0) {
>>  		struct inpcbhead *head;
>> @@ -1450,10 +1465,12 @@ in_pcblookup_local(struct inpcbinfo *pcb
>>  #undef INP_LOOKUP_MAPPED_PCB_COST
>>
>>  /*
>> - * Lookup PCB in hash list.
>> + * Lookup PCB in hash list, using pcbinfo tables.  This variation assumes
>> + * that the caller has locked the hash list, and will not perform any further
>> + * locking or reference operations on either the hash list or the connection.
>>   */
>> -struct inpcb *
>> -in_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in_addr faddr,
>> +static struct inpcb *
>> +in_pcblookup_hash_locked(struct inpcbinfo *pcbinfo, struct in_addr faddr,
>>      u_int fport_arg, struct in_addr laddr, u_int lport_arg, int lookupflags,
>>      struct ifnet *ifp)
>>  {
>> @@ -1464,7 +1481,7 @@ in_pcblookup_hash(struct inpcbinfo *pcbi
>>  	KASSERT((lookupflags & ~(INPLOOKUP_WILDCARD)) == 0,
>>  	    ("%s: invalid lookup flags %d", __func__, lookupflags));
>>
>> -	INP_INFO_LOCK_ASSERT(pcbinfo);
>> +	INP_HASH_LOCK_ASSERT(pcbinfo);
>>
>>  	/*
>>  	 * First look for an exact match.
>> @@ -1574,6 +1591,56 @@ in_pcblookup_hash(struct inpcbinfo *pcbi
>>
>>  	return (NULL);
>>  }
>> +
>> +/*
>> + * Lookup PCB in hash list, using pcbinfo tables.  This variation locks the
>> + * hash list lock, and will return the inpcb locked (i.e., requires
>> + * INPLOOKUP_LOCKPCB).
>> + */
>> +static struct inpcb *
>> +in_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in_addr faddr,
>> +    u_int fport, struct in_addr laddr, u_int lport, int lookupflags,
>> +    struct ifnet *ifp)
>> +{
>> +	struct inpcb *inp;
>> +
>> +	INP_HASH_RLOCK(pcbinfo);
>> +	inp = in_pcblookup_hash_locked(pcbinfo, faddr, fport, laddr, lport,
>> +	    (lookupflags & ~(INPLOOKUP_RLOCKPCB | INPLOOKUP_WLOCKPCB)), ifp);
>> +	if (inp != NULL) {
>> +		in_pcbref(inp);
>> +		INP_HASH_RUNLOCK(pcbinfo);
>> +		if (lookupflags & INPLOOKUP_WLOCKPCB) {
>> +			INP_WLOCK(inp);
>> +			if (in_pcbrele_wlocked(inp))
>> +				return (NULL);
>> +		} else if (lookupflags & INPLOOKUP_RLOCKPCB) {
>> +			INP_RLOCK(inp);
>> +			if (in_pcbrele_rlocked(inp))
>> +				return (NULL);
>> +		} else
>> +			panic("%s: locking bug", __func__);
>> +	} else
>> +		INP_HASH_RUNLOCK(pcbinfo);
>> +	return (inp);
>> +}
>> +
>> +/*
>> + * Public inpcb lookup routines, accepting a 4-tuple.
>> + */
>> +struct inpcb *
>> +in_pcblookup(struct inpcbinfo *pcbinfo, struct in_addr faddr, u_int fport,
>> +    struct in_addr laddr, u_int lport, int lookupflags, struct ifnet *ifp)
>> +{
>> +
>> +	KASSERT((lookupflags & ~INPLOOKUP_MASK) == 0,
>> +	    ("%s: invalid lookup flags %d", __func__, lookupflags));
>> +	KASSERT((lookupflags & (INPLOOKUP_RLOCKPCB | INPLOOKUP_WLOCKPCB)) != 0,
>> +	    ("%s: LOCKPCB not set", __func__));
>> +
>> +	return (in_pcblookup_hash(pcbinfo, faddr, fport, laddr, lport,
>> +	    lookupflags, ifp));
>> +}
>>  #endif /* INET */
>>
>>  /*
>> @@ -1588,8 +1655,9 @@ in_pcbinshash(struct inpcb *inp)
>>  	struct inpcbport *phd;
>>  	u_int32_t hashkey_faddr;
>>
>> -	INP_INFO_WLOCK_ASSERT(pcbinfo);
>>  	INP_WLOCK_ASSERT(inp);
>> +	INP_HASH_WLOCK_ASSERT(pcbinfo);
>> +
>>  	KASSERT((inp->inp_flags & INP_INHASHLIST) == 0,
>>  	    ("in_pcbinshash: INP_INHASHLIST"));
>>
>> @@ -1645,8 +1713,9 @@ in_pcbrehash(struct inpcb *inp)
>>  	struct inpcbhead *head;
>>  	u_int32_t hashkey_faddr;
>>
>> -	INP_INFO_WLOCK_ASSERT(pcbinfo);
>>  	INP_WLOCK_ASSERT(inp);
>> +	INP_HASH_WLOCK_ASSERT(pcbinfo);
>> +
>>  	KASSERT(inp->inp_flags & INP_INHASHLIST,
>>  	    ("in_pcbrehash: !INP_INHASHLIST"));
>>
>> @@ -1679,12 +1748,14 @@ in_pcbremlists(struct inpcb *inp)
>>  	if (inp->inp_flags & INP_INHASHLIST) {
>>  		struct inpcbport *phd = inp->inp_phd;
>>
>> +		INP_HASH_WLOCK(pcbinfo);
>>  		LIST_REMOVE(inp, inp_hash);
>>  		LIST_REMOVE(inp, inp_portlist);
>>  		if (LIST_FIRST(&phd->phd_pcblist) == NULL) {
>>  			LIST_REMOVE(phd, phd_hash);
>>  			free(phd, M_PCB);
>>  		}
>> +		INP_HASH_WUNLOCK(pcbinfo);
>>  		inp->inp_flags &= ~INP_INHASHLIST;
>>  	}
>>  	LIST_REMOVE(inp, inp_list);
>>
>> Modified: head/sys/netinet/in_pcb.h
>> ==============================================================================
>> --- head/sys/netinet/in_pcb.h	Mon May 30 09:41:38 2011	(r222487)
>> +++ head/sys/netinet/in_pcb.h	Mon May 30 09:43:55 2011	(r222488)
>> @@ -268,22 +268,22 @@ struct inpcbport {
>>   * Global data structure for each high-level protocol (UDP, TCP, ...) in both
>>   * IPv4 and IPv6.  Holds inpcb lists and information for managing them.
>>   *
>> - * Each pcbinfo is protected by ipi_lock, covering mutable global fields (such
>> - * as the global pcb list) and hashed lookup tables.  The lock order is:
>> + * Each pcbinfo is protected by two locks: ipi_lock and ipi_hash_lock,
>> + * the former covering mutable global fields (such as the global pcb list),
>> + * and the latter covering the hashed lookup tables.  The lock order is:
>>   *
>> - *    ipi_lock (before) inpcb locks
>> + *    ipi_lock (before) inpcb locks (before) ipi_hash_lock
>>   *
>>   * Locking key:
>>   *
>>   * (c) Constant or nearly constant after initialisation
>>   * (g) Locked by ipi_lock
>> - * (h) Read using either ipi_lock or inpcb lock; write requires both.
>> + * (h) Read using either ipi_hash_lock or inpcb lock; write requires both.
>>   * (x) Synchronisation properties poorly defined
>>   */
>>  struct inpcbinfo {
>>  	/*
>> -	 * Global lock protecting global inpcb list, inpcb count, hash tables,
>> -	 * etc.
>> +	 * Global lock protecting global inpcb list, inpcb count, etc.
>>  	 */
>>  	struct rwlock		 ipi_lock;
>>
>> @@ -312,17 +312,22 @@ struct inpcbinfo {
>>  	struct	uma_zone	*ipi_zone;		/* (c) */
>>
>>  	/*
>> +	 * Global lock protecting hash lookup tables.
>> +	 */
>> +	struct rwlock		 ipi_hash_lock;
>> +
>> +	/*
>>  	 * Global hash of inpcbs, hashed by local and foreign addresses and
>>  	 * port numbers.
>>  	 */
>> -	struct inpcbhead	*ipi_hashbase;		/* (g) */
>> -	u_long			 ipi_hashmask;		/* (g) */
>> +	struct inpcbhead	*ipi_hashbase;		/* (h) */
>> +	u_long			 ipi_hashmask;		/* (h) */
>>
>>  	/*
>>  	 * Global hash of inpcbs, hashed by only local port number.
>>  	 */
>> -	struct inpcbporthead	*ipi_porthashbase;	/* (g) */
>> -	u_long			 ipi_porthashmask;	/* (g) */
>> +	struct inpcbporthead	*ipi_porthashbase;	/* (h) */
>> +	u_long			 ipi_porthashmask;	/* (h) */
>>
>>  	/*
>>  	 * Pointer to network stack instance
>> @@ -406,6 +411,18 @@ void 	inp_4tuple_get(struct inpcb *inp,
>>  #define INP_INFO_WLOCK_ASSERT(ipi)	rw_assert(&(ipi)->ipi_lock, RA_WLOCKED)
>>  #define INP_INFO_UNLOCK_ASSERT(ipi)	rw_assert(&(ipi)->ipi_lock, RA_UNLOCKED)
>>
>> +#define	INP_HASH_LOCK_INIT(ipi, d) \
>> +	rw_init_flags(&(ipi)->ipi_hash_lock, (d), 0)
>> +#define	INP_HASH_LOCK_DESTROY(ipi)	rw_destroy(&(ipi)->ipi_hash_lock)
>> +#define	INP_HASH_RLOCK(ipi)		rw_rlock(&(ipi)->ipi_hash_lock)
>> +#define	INP_HASH_WLOCK(ipi)		rw_wlock(&(ipi)->ipi_hash_lock)
>> +#define	INP_HASH_RUNLOCK(ipi)		rw_runlock(&(ipi)->ipi_hash_lock)
>> +#define	INP_HASH_WUNLOCK(ipi)		rw_wunlock(&(ipi)->ipi_hash_lock)
>> +#define	INP_HASH_LOCK_ASSERT(ipi)	rw_assert(&(ipi)->ipi_hash_lock, \
>> +					    RA_LOCKED)
>> +#define	INP_HASH_WLOCK_ASSERT(ipi)	rw_assert(&(ipi)->ipi_hash_lock, \
>> +					    RA_WLOCKED)
>> +
>>  #define INP_PCBHASH(faddr, lport, fport, mask) \
>>  	(((faddr) ^ ((faddr) >> 16) ^ ntohs((lport) ^ (fport))) & (mask))
>>  #define INP_PCBPORTHASH(lport, mask) \
>> @@ -466,7 +483,16 @@ void 	inp_4tuple_get(struct inpcb *inp,
>>  #define	INP_LLE_VALID		0x00000001 /* cached lle is valid */
>>  #define	INP_RT_VALID		0x00000002 /* cached rtentry is valid */
>>
>> -#define	INPLOOKUP_WILDCARD	1
>> +/*
>> + * Flags passed to in_pcblookup*() functions.
>> + */
>> +#define	INPLOOKUP_WILDCARD	0x00000001	/* Allow wildcard sockets. */
>> +#define	INPLOOKUP_RLOCKPCB	0x00000002	/* Return inpcb read-locked. */
>> +#define	INPLOOKUP_WLOCKPCB	0x00000004	/* Return inpcb write-locked. */
>> +
>> +#define	INPLOOKUP_MASK	(INPLOOKUP_WILDCARD | INPLOOKUP_RLOCKPCB | \
>> +			    INPLOOKUP_WLOCKPCB)
>> +
>>  #define	sotoinpcb(so)	((struct inpcb *)(so)->so_pcb)
>>  #define	sotoin6pcb(so)	sotoinpcb(so) /* for KAME src sync over BSD*'s */
>>
>> @@ -527,7 +553,7 @@ struct inpcb *
>>  	in_pcblookup_local(struct inpcbinfo *,
>>  	    struct in_addr, u_short, int, struct ucred *);
>>  struct inpcb *
>> -	in_pcblookup_hash(struct inpcbinfo *, struct in_addr, u_int,
>> +	in_pcblookup(struct inpcbinfo *, struct in_addr, u_int,
>>  	    struct in_addr, u_int, int, struct ifnet *);
>>  void	in_pcbnotifyall(struct inpcbinfo *pcbinfo, struct in_addr,
>>  	    int, struct inpcb *(*)(struct inpcb *, int));
>>
>> Modified: head/sys/netinet/ip_divert.c
>> ==============================================================================
>> --- head/sys/netinet/ip_divert.c	Mon May 30 09:41:38 2011	(r222487)
>> +++ head/sys/netinet/ip_divert.c	Mon May 30 09:43:55 2011	(r222488)
>> @@ -659,9 +659,9 @@ div_pcblist(SYSCTL_HANDLER_ARGS)
>>  	INP_INFO_WLOCK(&V_divcbinfo);
>>  	for (i = 0; i < n; i++) {
>>  		inp = inp_list[i];
>> -		INP_WLOCK(inp);
>> -		if (!in_pcbrele(inp))
>> -			INP_WUNLOCK(inp);
>> +		INP_RLOCK(inp);
>> +		if (!in_pcbrele_rlocked(inp))
>> +			INP_RUNLOCK(inp);
>>  	}
>>  	INP_INFO_WUNLOCK(&V_divcbinfo);
>>
>>
>> Modified: head/sys/netinet/ipfw/ip_fw2.c
>> ==============================================================================
>> --- head/sys/netinet/ipfw/ip_fw2.c	Mon May 30 09:41:38 2011	(r222487)
>> +++ head/sys/netinet/ipfw/ip_fw2.c	Mon May 30 09:43:55 2011	(r222488)
>> @@ -657,7 +657,7 @@ check_uidgid(ipfw_insn_u32 *insn, int pr
>>  	    (struct bsd_ucred *)uc, ugid_lookupp, ((struct mbuf *)inp)->m_skb);
>>  #else  /* FreeBSD */
>>  	struct inpcbinfo *pi;
>> -	int wildcard;
>> +	int lookupflags;
>>  	struct inpcb *pcb;
>>  	int match;
>>
>> @@ -682,30 +682,31 @@ check_uidgid(ipfw_insn_u32 *insn, int pr
>>  	if (*ugid_lookupp == -1)
>>  		return (0);
>>  	if (proto == IPPROTO_TCP) {
>> -		wildcard = 0;
>> +		lookupflags = 0;
>>  		pi = &V_tcbinfo;
>>  	} else if (proto == IPPROTO_UDP) {
>> -		wildcard = INPLOOKUP_WILDCARD;
>> +		lookupflags = INPLOOKUP_WILDCARD;
>>  		pi = &V_udbinfo;
>>  	} else
>>  		return 0;
>> +	lookupflags |= INPLOOKUP_RLOCKPCB;
>>  	match = 0;
>>  	if (*ugid_lookupp == 0) {
>> -		INP_INFO_RLOCK(pi);
>>  		pcb =  (oif) ?
>> -			in_pcblookup_hash(pi,
>> +			in_pcblookup(pi,
>>  				dst_ip, htons(dst_port),
>>  				src_ip, htons(src_port),
>> -				wildcard, oif) :
>> -			in_pcblookup_hash(pi,
>> +				lookupflags, oif) :
>> +			in_pcblookup(pi,
>>  				src_ip, htons(src_port),
>>  				dst_ip, htons(dst_port),
>> -				wildcard, NULL);
>> +				lookupflags, NULL);
>>  		if (pcb != NULL) {
>> +			INP_RLOCK_ASSERT(pcb);
>>  			*uc = crhold(pcb->inp_cred);
>>  			*ugid_lookupp = 1;
>> +			INP_RUNLOCK(pcb);
>>  		}
>> -		INP_INFO_RUNLOCK(pi);
>>  		if (*ugid_lookupp == 0) {
>>  			/*
>>  			 * We tried and failed, set the variable to -1
>> @@ -1827,21 +1828,32 @@ do {								\
>>  				else
>>  					break;
>>
>> +				/*
>> +				 * XXXRW: so_user_cookie should almost
>> +				 * certainly be inp_user_cookie?
>> +				 */
>> +
>>  				/* For incomming packet, lookup up the
>>  				inpcb using the src/dest ip/port tuple */
>>  				if (inp == NULL) {
>> -					INP_INFO_RLOCK(pi);
>> -					inp = in_pcblookup_hash(pi,
>> +					inp = in_pcblookup(pi,
>>  						src_ip, htons(src_port),
>>  						dst_ip, htons(dst_port),
>> -						0, NULL);
>> -					INP_INFO_RUNLOCK(pi);
>> -				}
>> -
>> -				if (inp && inp->inp_socket) {
>> -					tablearg = inp->inp_socket->so_user_cookie;
>> -					if (tablearg)
>> -						match = 1;
>> +						INPLOOKUP_RLOCKPCB, NULL);
>> +					if (inp != NULL) {
>> +						tablearg =
>> +						    inp->inp_socket->so_user_cookie;
>> +						if (tablearg)
>> +							match = 1;
>> +						INP_RUNLOCK(inp);
>> +					}
>> +				} else {
>> +					if (inp->inp_socket) {
>> +						tablearg =
>> +						    inp->inp_socket->so_user_cookie;
>> +						if (tablearg)
>> +							match = 1;
>> +					}
>>  				}
>>  				break;
>>  			}
>>
>> Modified: head/sys/netinet/raw_ip.c
>> ==============================================================================
>> --- head/sys/netinet/raw_ip.c	Mon May 30 09:41:38 2011	(r222487)
>> +++ head/sys/netinet/raw_ip.c	Mon May 30 09:43:55 2011	(r222488)
>> @@ -226,7 +226,7 @@ rip_append(struct inpcb *last, struct ip
>>  {
>>  	int policyfail = 0;
>>
>> -	INP_RLOCK_ASSERT(last);
>> +	INP_LOCK_ASSERT(last);
>>
>>  #ifdef IPSEC
>>  	/* check AH/ESP integrity. */
>> @@ -834,16 +834,19 @@ rip_detach(struct socket *so)
>>  static void
>>  rip_dodisconnect(struct socket *so, struct inpcb *inp)
>>  {
>> +	struct inpcbinfo *pcbinfo;
>>
>> -	INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo);
>> -	INP_WLOCK_ASSERT(inp);
>> -
>> +	pcbinfo = inp->inp_pcbinfo;
>> +	INP_INFO_WLOCK(pcbinfo);
>> +	INP_WLOCK(inp);
>>  	rip_delhash(inp);
>>  	inp->inp_faddr.s_addr = INADDR_ANY;
>>  	rip_inshash(inp);
>>  	SOCK_LOCK(so);
>>  	so->so_state &= ~SS_ISCONNECTED;
>>  	SOCK_UNLOCK(so);
>> +	INP_WUNLOCK(inp);
>> +	INP_INFO_WUNLOCK(pcbinfo);
>>  }
>>
>>  static void
>> @@ -854,11 +857,7 @@ rip_abort(struct socket *so)
>>  	inp = sotoinpcb(so);
>>  	KASSERT(inp != NULL, ("rip_abort: inp == NULL"));
>>
>> -	INP_INFO_WLOCK(&V_ripcbinfo);
>> -	INP_WLOCK(inp);
>>  	rip_dodisconnect(so, inp);
>> -	INP_WUNLOCK(inp);
>> -	INP_INFO_WUNLOCK(&V_ripcbinfo);
>>  }
>>
>>  static void
>> @@ -869,11 +868,7 @@ rip_close(struct socket *so)
>>  	inp = sotoinpcb(so);
>>  	KASSERT(inp != NULL, ("rip_close: inp == NULL"));
>>
>> -	INP_INFO_WLOCK(&V_ripcbinfo);
>> -	INP_WLOCK(inp);
>>  	rip_dodisconnect(so, inp);
>> -	INP_WUNLOCK(inp);
>> -	INP_INFO_WUNLOCK(&V_ripcbinfo);
>>  }
>>
>>  static int
>> @@ -887,11 +882,7 @@ rip_disconnect(struct socket *so)
>>  	inp = sotoinpcb(so);
>>  	KASSERT(inp != NULL, ("rip_disconnect: inp == NULL"));
>>
>> -	INP_INFO_WLOCK(&V_ripcbinfo);
>> -	INP_WLOCK(inp);
>>  	rip_dodisconnect(so, inp);
>> -	INP_WUNLOCK(inp);
>> -	INP_INFO_WUNLOCK(&V_ripcbinfo);
>>  	return (0);
>>  }
>>
>> @@ -1077,9 +1068,9 @@ rip_pcblist(SYSCTL_HANDLER_ARGS)
>>  	INP_INFO_WLOCK(&V_ripcbinfo);
>>  	for (i = 0; i < n; i++) {
>>  		inp = inp_list[i];
>> -		INP_WLOCK(inp);
>> -		if (!in_pcbrele(inp))
>> -			INP_WUNLOCK(inp);
>> +		INP_RLOCK(inp);
>> +		if (!in_pcbrele_rlocked(inp))
>> +			INP_RUNLOCK(inp);
>>  	}
>>  	INP_INFO_WUNLOCK(&V_ripcbinfo);
>>
>>
>> Modified: head/sys/netinet/siftr.c
>> ==============================================================================
>> --- head/sys/netinet/siftr.c	Mon May 30 09:41:38 2011	(r222487)
>> +++ head/sys/netinet/siftr.c	Mon May 30 09:43:55 2011	(r222488)
>> @@ -696,17 +696,16 @@ siftr_findinpcb(int ipver, struct ip *ip
>>
>>  	/* We need the tcbinfo lock. */
>>  	INP_INFO_UNLOCK_ASSERT(&V_tcbinfo);
>> -	INP_INFO_RLOCK(&V_tcbinfo);
>>
>>  	if (dir == PFIL_IN)
>>  		inp = (ipver == INP_IPV4 ?
>> -		    in_pcblookup_hash(&V_tcbinfo, ip->ip_src, sport, ip->ip_dst,
>> -		    dport, 0, m->m_pkthdr.rcvif)
>> +		    in_pcblookup(&V_tcbinfo, ip->ip_src, sport, ip->ip_dst,
>> +		    dport, INPLOOKUP_RLOCKPCB, m->m_pkthdr.rcvif)
>>  		    :
>>  #ifdef SIFTR_IPV6
>> -		    in6_pcblookup_hash(&V_tcbinfo,
>> +		    in6_pcblookup(&V_tcbinfo,
>>  		    &((struct ip6_hdr *)ip)->ip6_src, sport,
>> -		    &((struct ip6_hdr *)ip)->ip6_dst, dport, 0,
>> +		    &((struct ip6_hdr *)ip)->ip6_dst, dport, INPLOOKUP_RLOCKPCB,
>>  		    m->m_pkthdr.rcvif)
>>  #else
>>  		    NULL
>> @@ -715,13 +714,13 @@ siftr_findinpcb(int ipver, struct ip *ip
>>
>>  	else
>>  		inp = (ipver == INP_IPV4 ?
>> -		    in_pcblookup_hash(&V_tcbinfo, ip->ip_dst, dport, ip->ip_src,
>> -		    sport, 0, m->m_pkthdr.rcvif)
>> +		    in_pcblookup(&V_tcbinfo, ip->ip_dst, dport, ip->ip_src,
>> +		    sport, INPLOOKUP_RLOCKPCB, m->m_pkthdr.rcvif)
>>  		    :
>>  #ifdef SIFTR_IPV6
>> -		    in6_pcblookup_hash(&V_tcbinfo,
>> +		    in6_pcblookup(&V_tcbinfo,
>>  		    &((struct ip6_hdr *)ip)->ip6_dst, dport,
>> -		    &((struct ip6_hdr *)ip)->ip6_src, sport, 0,
>> +		    &((struct ip6_hdr *)ip)->ip6_src, sport, INPLOOKUP_RLOCKPCB,
>>  		    m->m_pkthdr.rcvif)
>>  #else
>>  		    NULL
>> @@ -734,12 +733,7 @@ siftr_findinpcb(int ipver, struct ip *ip
>>  			ss->nskip_in_inpcb++;
>>  		else
>>  			ss->nskip_out_inpcb++;
>> -	} else {
>> -		/* Acquire the inpcb lock. */
>> -		INP_UNLOCK_ASSERT(inp);
>> -		INP_RLOCK(inp);
>>  	}
>> -	INP_INFO_RUNLOCK(&V_tcbinfo);
>>
>>  	return (inp);
>>  }
>>
>> Modified: head/sys/netinet/tcp_input.c
>> ==============================================================================
>> --- head/sys/netinet/tcp_input.c	Mon May 30 09:41:38 2011	(r222487)
>> +++ head/sys/netinet/tcp_input.c	Mon May 30 09:43:55 2011	(r222488)
>> @@ -5,6 +5,7 @@
>>   *	Swinburne University of Technology, Melbourne, Australia.
>>   * Copyright (c) 2009-2010 Lawrence Stewart <lstewart@freebsd.org>
>>   * Copyright (c) 2010 The FreeBSD Foundation
>> + * Copyright (c) 2010-2011 Juniper Networks, Inc.
>>   * All rights reserved.
>>   *
>>   * Portions of this software were developed at the Centre for Advanced Internet
>> @@ -16,6 +17,9 @@
>>   * Internet Architectures, Swinburne University of Technology, Melbourne,
>>   * Australia by David Hayes under sponsorship from the FreeBSD Foundation.
>>   *
>> + * Portions of this software were developed by Robert N. M. Watson under
>> + * contract to Juniper Networks, Inc.
>> + *
>>   * Redistribution and use in source and binary forms, with or without
>>   * modification, are permitted provided that the following conditions
>>   * are met:
>> @@ -197,10 +201,6 @@ SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO,
>>      &VNET_NAME(tcp_autorcvbuf_max), 0,
>>      "Max size of automatic receive buffer");
>>
>> -int	tcp_read_locking = 1;
>> -SYSCTL_INT(_net_inet_tcp, OID_AUTO, read_locking, CTLFLAG_RW,
>> -    &tcp_read_locking, 0, "Enable read locking strategy");
>> -
>>  VNET_DEFINE(struct inpcbhead, tcb);
>>  #define	tcb6	tcb  /* for KAME src sync over BSD*'s */
>>  VNET_DEFINE(struct inpcbinfo, tcbinfo);
>> @@ -591,8 +591,7 @@ tcp_input(struct mbuf *m, int off0)
>>  	char *s = NULL;			/* address and port logging */
>>  	int ti_locked;
>>  #define	TI_UNLOCKED	1
>> -#define	TI_RLOCKED	2
>> -#define	TI_WLOCKED	3
>> +#define	TI_WLOCKED	2
>>
>>  #ifdef TCPDEBUG
>>  	/*
>> @@ -756,30 +755,25 @@ tcp_input(struct mbuf *m, int off0)
>>  	drop_hdrlen = off0 + off;
>>
>>  	/*
>> -	 * Locate pcb for segment, which requires a lock on tcbinfo.
>> -	 * Optimisticaly acquire a global read lock rather than a write lock
>> -	 * unless header flags necessarily imply a state change.  There are
>> -	 * two cases where we might discover later we need a write lock
>> -	 * despite the flags: ACKs moving a connection out of the syncache,
>> -	 * and ACKs for a connection in TIMEWAIT.
>> +	 * Locate pcb for segment; if we're likely to add or remove a
>> +	 * connection then first acquire pcbinfo lock.  There are two cases
>> +	 * where we might discover later we need a write lock despite the
>> +	 * flags: ACKs moving a connection out of the syncache, and ACKs for
>> +	 * a connection in TIMEWAIT.
>>  	 */
>> -	if ((thflags & (TH_SYN | TH_FIN | TH_RST)) != 0 ||
>> -	    tcp_read_locking == 0) {
>> +	if ((thflags & (TH_SYN | TH_FIN | TH_RST)) != 0) {
>>  		INP_INFO_WLOCK(&V_tcbinfo);
>>  		ti_locked = TI_WLOCKED;
>> -	} else {
>> -		INP_INFO_RLOCK(&V_tcbinfo);
>> -		ti_locked = TI_RLOCKED;
>> -	}
>> +	} else
>> +		ti_locked = TI_UNLOCKED;
>>
>>  findpcb:
>>  #ifdef INVARIANTS
>> -	if (ti_locked == TI_RLOCKED)
>> -		INP_INFO_RLOCK_ASSERT(&V_tcbinfo);
>> -	else if (ti_locked == TI_WLOCKED)
>> +	if (ti_locked == TI_WLOCKED) {
>>  		INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
>> -	else
>> -		panic("%s: findpcb ti_locked %d\n", __func__, ti_locked);
>> +	} else {
>> +		INP_INFO_UNLOCK_ASSERT(&V_tcbinfo);
>> +	}
>>  #endif
>>
>>  #ifdef INET
>> @@ -797,20 +791,18 @@ findpcb:
>>  		 * Transparently forwarded. Pretend to be the destination.
>>  		 * already got one like this?
>>  		 */
>> -		inp = in_pcblookup_hash(&V_tcbinfo,
>> -					ip->ip_src, th->th_sport,
>> -					ip->ip_dst, th->th_dport,
>> -					0, m->m_pkthdr.rcvif);
>> +		inp = in_pcblookup(&V_tcbinfo, ip->ip_src, th->th_sport,
>> +		    ip->ip_dst, th->th_dport, INPLOOKUP_WLOCKPCB,
>> +		    m->m_pkthdr.rcvif);
>>  		if (!inp) {
>> -			/* It's new.  Try to find the ambushing socket. */
>> -			inp = in_pcblookup_hash(&V_tcbinfo,
>> -						ip->ip_src, th->th_sport,
>> -						next_hop->sin_addr,
>> -						next_hop->sin_port ?
>> -						    ntohs(next_hop->sin_port) :
>> -						    th->th_dport,
>> -						INPLOOKUP_WILDCARD,
>> -						m->m_pkthdr.rcvif);
>> +			/*
>> +			 * It's new.  Try to find the ambushing socket.
>> +			 */
>> +			inp = in_pcblookup(&V_tcbinfo, ip->ip_src,
>> +			    th->th_sport, next_hop->sin_addr,
>> +			    next_hop->sin_port ? ntohs(next_hop->sin_port) :
>> +			    th->th_dport, INPLOOKUP_WILDCARD |
>> +			    INPLOOKUP_WLOCKPCB, m->m_pkthdr.rcvif);
>>  		}
>>  		/* Remove the tag from the packet.  We don't need it anymore. */
>>  		m_tag_delete(m, fwd_tag);
>> @@ -820,21 +812,19 @@ findpcb:
>>  	{
>>  #ifdef INET6
>>  		if (isipv6)
>> -			inp = in6_pcblookup_hash(&V_tcbinfo,
>> -						 &ip6->ip6_src, th->th_sport,
>> -						 &ip6->ip6_dst, th->th_dport,
>> -						 INPLOOKUP_WILDCARD,
>> -						 m->m_pkthdr.rcvif);
>> +			inp = in6_pcblookup(&V_tcbinfo, &ip6->ip6_src,
>> +			    th->th_sport, &ip6->ip6_dst, th->th_dport,
>> +			    INPLOOKUP_WILDCARD | INPLOOKUP_WLOCKPCB,
>> +			    m->m_pkthdr.rcvif);
>>  #endif
>>  #if defined(INET) && defined(INET6)
>>  		else
>>  #endif
>>  #ifdef INET
>> -			inp = in_pcblookup_hash(&V_tcbinfo,
>> -						ip->ip_src, th->th_sport,
>> -						ip->ip_dst, th->th_dport,
>> -						INPLOOKUP_WILDCARD,
>> -						m->m_pkthdr.rcvif);
>> +			inp = in_pcblookup(&V_tcbinfo, ip->ip_src,
>> +			    th->th_sport, ip->ip_dst, th->th_dport,
>> +			    INPLOOKUP_WILDCARD | INPLOOKUP_WLOCKPCB,
>> +			    m->m_pkthdr.rcvif);
>>  #endif
>>  	}
>>
>> @@ -865,7 +855,7 @@ findpcb:
>>  		rstreason = BANDLIM_RST_CLOSEDPORT;
>>  		goto dropwithreset;
>>  	}
>> -	INP_WLOCK(inp);
>> +	INP_WLOCK_ASSERT(inp);
>>  	if (!(inp->inp_flags & INP_HW_FLOWID)
>>  	    && (m->m_flags & M_FLOWID)
>>  	    && ((inp->inp_socket == NULL)
>> @@ -906,28 +896,26 @@ findpcb:
>>  	 * legitimate new connection attempt the old INPCB gets removed and
>>  	 * we can try again to find a listening socket.
>>  	 *
>> -	 * At this point, due to earlier optimism, we may hold a read lock on
>> -	 * the inpcbinfo, rather than a write lock.  If so, we need to
>> -	 * upgrade, or if that fails, acquire a reference on the inpcb, drop
>> -	 * all locks, acquire a global write lock, and then re-acquire the
>> -	 * inpcb lock.  We may at that point discover that another thread has
>> -	 * tried to free the inpcb, in which case we need to loop back and
>> -	 * try to find a new inpcb to deliver to.
>> +	 * At this point, due to earlier optimism, we may hold only an inpcb
>> +	 * lock, and not the inpcbinfo write lock.  If so, we need to try to
>> +	 * acquire it, or if that fails, acquire a reference on the inpcb,
>> +	 * drop all locks, acquire a global write lock, and then re-acquire
>> +	 * the inpcb lock.  We may at that point discover that another thread
>> +	 * has tried to free the inpcb, in which case we need to loop back
>> +	 * and try to find a new inpcb to deliver to.
>> +	 *
>> +	 * XXXRW: It may be time to rethink timewait locking.
>>  	 */
>>  relocked:
>>  	if (inp->inp_flags & INP_TIMEWAIT) {
>> -		KASSERT(ti_locked == TI_RLOCKED || ti_locked == TI_WLOCKED,
>> -		    ("%s: INP_TIMEWAIT ti_locked %d", __func__, ti_locked));
>> -
>> -		if (ti_locked == TI_RLOCKED) {
>> -			if (INP_INFO_TRY_UPGRADE(&V_tcbinfo) == 0) {
>> +		if (ti_locked == TI_UNLOCKED) {
>> +			if (INP_INFO_TRY_WLOCK(&V_tcbinfo) == 0) {
>>  				in_pcbref(inp);
>>  				INP_WUNLOCK(inp);
>> -				INP_INFO_RUNLOCK(&V_tcbinfo);
>>  				INP_INFO_WLOCK(&V_tcbinfo);
>>  				ti_locked = TI_WLOCKED;
>>  				INP_WLOCK(inp);
>> -				if (in_pcbrele(inp)) {
>> +				if (in_pcbrele_wlocked(inp)) {
>>  					inp = NULL;
>>  					goto findpcb;
>>  				}
>> @@ -975,26 +963,24 @@ relocked:
>>
>>  	/*
>>  	 * We've identified a valid inpcb, but it could be that we need an
>> -	 * inpcbinfo write lock and have only a read lock.  In this case,
>> -	 * attempt to upgrade/relock using the same strategy as the TIMEWAIT
>> -	 * case above.  If we relock, we have to jump back to 'relocked' as
>> -	 * the connection might now be in TIMEWAIT.
>> -	 */
>> -	if (tp->t_state != TCPS_ESTABLISHED ||
>> -	    (thflags & (TH_SYN | TH_FIN | TH_RST)) != 0 ||
>> -	    tcp_read_locking == 0) {
>> -		KASSERT(ti_locked == TI_RLOCKED || ti_locked == TI_WLOCKED,
>> -		    ("%s: upgrade check ti_locked %d", __func__, ti_locked));
>> -
>> -		if (ti_locked == TI_RLOCKED) {
>> -			if (INP_INFO_TRY_UPGRADE(&V_tcbinfo) == 0) {
>> +	 * inpcbinfo write lock but don't hold it.  In this case, attempt to
>> +	 * acquire using the same strategy as the TIMEWAIT case above.  If we
>> +	 * relock, we have to jump back to 'relocked' as the connection might
>> +	 * now be in TIMEWAIT.
>> +	 */
>> +#ifdef INVARIANTS
>> +	if ((thflags & (TH_SYN | TH_FIN | TH_RST)) != 0)
>> +		INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
>> +#endif
>> +	if (tp->t_state != TCPS_ESTABLISHED) {
>> +		if (ti_locked == TI_UNLOCKED) {
>> +			if (INP_INFO_TRY_WLOCK(&V_tcbinfo) == 0) {
>>  				in_pcbref(inp);
>>  				INP_WUNLOCK(inp);
>> -				INP_INFO_RUNLOCK(&V_tcbinfo);
>>  				INP_INFO_WLOCK(&V_tcbinfo);
>>  				ti_locked = TI_WLOCKED;
>>  				INP_WLOCK(inp);
>> -				if (in_pcbrele(inp)) {
>> +				if (in_pcbrele_wlocked(inp)) {
>>  					inp = NULL;
>>  					goto findpcb;
>>  				}
>> @@ -1027,13 +1013,16 @@ relocked:
>>  	/*
>>  	 * When the socket is accepting connections (the INPCB is in LISTEN
>>  	 * state) we look into the SYN cache if this is a new connection
>> -	 * attempt or the completion of a previous one.
>> +	 * attempt or the completion of a previous one.  Because listen
>> +	 * sockets are never in TCPS_ESTABLISHED, the V_tcbinfo lock will be
>> +	 * held in this case.
>>  	 */
>>  	if (so->so_options & SO_ACCEPTCONN) {
>>  		struct in_conninfo inc;
>>
>>  		KASSERT(tp->t_state == TCPS_LISTEN, ("%s: so accepting but "
>>  		    "tp not listening", __func__));
>> +		INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
>>
>>  		bzero(&inc, sizeof(inc));
>>  #ifdef INET6
>> @@ -1371,13 +1360,17 @@ relocked:
>>  	return;
>>
>>  dropwithreset:
>> -	if (ti_locked == TI_RLOCKED)
>> -		INP_INFO_RUNLOCK(&V_tcbinfo);
>> -	else if (ti_locked == TI_WLOCKED)
>> +	if (ti_locked == TI_WLOCKED) {
>>  		INP_INFO_WUNLOCK(&V_tcbinfo);
>> -	else
>> -		panic("%s: dropwithreset ti_locked %d", __func__, ti_locked);
>> -	ti_locked = TI_UNLOCKED;
>> +		ti_locked = TI_UNLOCKED;
>> +	}
>> +#ifdef INVARIANTS
>> +	else {
>> +		KASSERT(ti_locked == TI_UNLOCKED, ("%s: dropwithreset "
>> +		    "ti_locked: %d", __func__, ti_locked));
>> +		INP_INFO_UNLOCK_ASSERT(&V_tcbinfo);
>> +	}
>> +#endif
>>
>>  	if (inp != NULL) {
>>  		tcp_dropwithreset(m, th, tp, tlen, rstreason);
>> @@ -1388,13 +1381,17 @@ dropwithreset:
>>  	goto drop;
>>
>>  dropunlock:
>> -	if (ti_locked == TI_RLOCKED)
>> -		INP_INFO_RUNLOCK(&V_tcbinfo);
>> -	else if (ti_locked == TI_WLOCKED)
>> +	if (ti_locked == TI_WLOCKED) {
>>  		INP_INFO_WUNLOCK(&V_tcbinfo);
>> -	else
>> -		panic("%s: dropunlock ti_locked %d", __func__, ti_locked);
>> -	ti_locked = TI_UNLOCKED;
>> +		ti_locked = TI_UNLOCKED;
>> +	}
>> +#ifdef INVARIANTS
>> +	else {
>> +		KASSERT(ti_locked == TI_UNLOCKED, ("%s: dropunlock "
>> +		    "ti_locked: %d", __func__, ti_locked));
>> +		INP_INFO_UNLOCK_ASSERT(&V_tcbinfo);
>> +	}
>> +#endif
>>
>>  	if (inp != NULL)
>>  		INP_WUNLOCK(inp);
>> @@ -1449,13 +1446,13 @@ tcp_do_segment(struct mbuf *m, struct tc
>>  		INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
>>  	} else {
>>  #ifdef INVARIANTS
>> -		if (ti_locked == TI_RLOCKED)
>> -			INP_INFO_RLOCK_ASSERT(&V_tcbinfo);
>>
>> *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***
>> _______________________________________________
>> svn-src-head@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/svn-src-head
>> To unsubscribe, send any mail to "svn-src-head-unsubscribe@freebsd.org"
>>
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.00.1106041726570.74164>