From owner-freebsd-bugs@FreeBSD.ORG  Wed Mar 18 12:30:10 2009
Return-Path: <owner-freebsd-bugs@FreeBSD.ORG>
Delivered-To: freebsd-bugs@hub.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7DAC21065726
	for <freebsd-bugs@hub.freebsd.org>;
	Wed, 18 Mar 2009 12:30:10 +0000 (UTC)
	(envelope-from gnats@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id 1B30D8FC28
	for <freebsd-bugs@hub.freebsd.org>;
	Wed, 18 Mar 2009 12:30:09 +0000 (UTC)
	(envelope-from gnats@FreeBSD.org)
Received: from freefall.freebsd.org (gnats@localhost [127.0.0.1])
	by freefall.freebsd.org (8.14.3/8.14.3) with ESMTP id n2ICU92J046177
	for <freebsd-bugs@freefall.freebsd.org>; Wed, 18 Mar 2009 12:30:09 GMT
	(envelope-from gnats@freefall.freebsd.org)
Received: (from gnats@localhost)
	by freefall.freebsd.org (8.14.3/8.14.3/Submit) id n2ICU9FF046176;
	Wed, 18 Mar 2009 12:30:09 GMT (envelope-from gnats)
Resent-Date: Wed, 18 Mar 2009 12:30:09 GMT
Resent-Message-Id: <200903181230.n2ICU9FF046176@freefall.freebsd.org>
Resent-From: FreeBSD-gnats-submit@FreeBSD.org (GNATS Filer)
Resent-To: freebsd-bugs@FreeBSD.org
Resent-Reply-To: FreeBSD-gnats-submit@FreeBSD.org,
	Nick Withers <nick@nickwithers.com>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 183D41065675
	for <freebsd-gnats-submit@FreeBSD.org>;
	Wed, 18 Mar 2009 12:24:41 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [IPv6:2001:4f8:fff6::21])
	by mx1.freebsd.org (Postfix) with ESMTP id 04EDA8FC1A
	for <freebsd-gnats-submit@FreeBSD.org>;
	Wed, 18 Mar 2009 12:24:41 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.14.3/8.14.3) with ESMTP id n2ICOebZ023892
	for <freebsd-gnats-submit@FreeBSD.org>; Wed, 18 Mar 2009 12:24:40 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.14.3/8.14.3/Submit) id n2ICOeJp023891;
	Wed, 18 Mar 2009 12:24:40 GMT (envelope-from nobody)
Message-Id: <200903181224.n2ICOeJp023891@www.freebsd.org>
Date: Wed, 18 Mar 2009 12:24:40 GMT
From: Nick Withers <nick@nickwithers.com>
To: freebsd-gnats-submit@FreeBSD.org
X-Send-Pr-Version: www-3.1
Cc: 
Subject: kern/132774: IPFW with uid/gid/jail rules may lead to lockup
X-BeenThere: freebsd-bugs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Bug reports <freebsd-bugs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-bugs>,
	<mailto:freebsd-bugs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-bugs>
List-Post: <mailto:freebsd-bugs@freebsd.org>
List-Help: <mailto:freebsd-bugs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-bugs>,
	<mailto:freebsd-bugs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 18 Mar 2009 12:30:12 -0000


>Number:         132774
>Category:       kern
>Synopsis:       IPFW with uid/gid/jail rules may lead to lockup
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Wed Mar 18 12:30:09 UTC 2009
>Closed-Date:
>Last-Modified:
>Originator:     Nick Withers
>Release:        7.1-STABLE, circa 2009-03-11
>Organization:
>Environment:
FreeBSD internal.shmon.net 7.1-STABLE FreeBSD 7.1-STABLE #1: Mon Mar 16 20:11:00 EST 2009     root@internal.shmon.net:/usr/obj/usr/src/sys/INTERNAL  amd64
>Description:
(Initially reported on freebsd-stable@ - see http://lists.freebsd.org/pipermail/freebsd-stable/2009-March/048803.html)

Experiencing network lockups...

Breaking into the debugger manually and ps-ing showed the network card
(e.g., "[irq20:  fxp0+]") in state "LL" in "*tcp_sc_h". It seemed user
process(es) trying to access the card at the time were in state "L"
in "*tcp".

This happened on both fxp and rl NICs, both under and without lagg.

DDB's "show alllocks" output (I may have made a bit of a mess of the line-wrappings here, I'm copying-and-pasting from the mailing list archives, after my mailer wrapped-em automatically):
____

Process 31 (irp20: fxp0+) thread 0xffffff00012016e0 (100030)
exclusive rw tcpinp r = 0 (0xffffff000392d570) locked @ /usr/src/sys/netinet/tcp_input.c:480
exclusive rw tcp r = 0 (0xffffffff806dcbe8) locked @ /usr/src/sys/netinet/tcp_input.c:400
Process 17 (swi6: Giant taskq) thread 0xffffff0001173000 (100016)
exclusive sleep mutex Giant r = 0 (0xffffffff80652520) locked @ /usr/src/sys/kern/kern_intr.c:1087
Process 12 (swi4: clock) thread 0xffffff00010c6370 (100003)
shared rw IPFW static rules r = 0 (0xffffffff806db2b8) locked @ /usr/src/sys/netinet/ip_fw2.c:2460
shared rw PFil hoow read/write mutex r = 0 (0xffffffff806dba28) locked @ /usr/src/sys/net/pfil.c:73
exclusive sleep mutex tcp_sc_head r = 0 (0xfffffffe8036c8f8) locked @ /usr/src/sys/kern/kern_timeout.c:241
____

..and "show allchains":
____

chain 1:
 thread 100031 (pid 32, irp22: rl0) blocked on lock 0xffffffff806dcbe8 (rw) "tcp"
 thread 100030 (pid 31, irq20: fxp0+) blocked on lock 0xfffffffe8036c8f8 (sleep mutex) "tcp_sc_head"
 thread 100003 (pid 12, swi4: clock) blocked on lock 0xffffffff806dcbe8 (rw) "tcp"
 thread 100030 (pid 31, irq20: fxp0+) blocked on lock 0xfffffffe8036c8f8 (sleep mutex) "tcp_sc_head"
 thread 100003 (pid 12, swi4: clock) blocked on lock 0xffffffff806dcbe8 (rw) "tcp"
 thread 100030 (pid 31, irq20: fxp0+) blocked on lock 0xfffffffe8036c8f8 (sleep mutex) "tcp_sc_head"
 thread 100003 (pid 12, swi4: clock) blocked on lock 0xffffffff806dcbe8 (rw) "tcp"
(...and so on, these last two seeming to go on forever.)
____

This is with fxp0 and rl0 lagg(4)-ed.

For completeness...
  - "ps" in DDB showed that:
    - PID 32 ("[irp22: rl0]") was in state "LL" on "*tcp"
    - PID 31 ("[irq20: fxp0+]") was in state "LL" on "*tcp_sc_h"
    - PID 12 ("[swi4: clock]") was in state "LL" on "*tcp"
  - "where 32" gave:
____

sched_switch() at sched_switch+0xdf
mi_switch() at mi_switch+0x18b
turnstile_wait() at turnstile_wait+0x1c4
_rw_lock_hard() at _rw_lock_hard+0xa3
_rw_wlock() at _rw_wlock+0x54
tcp_input() at tcp_input+0x318
ip_input() at ip_input+0xaf
ether_demux() at ether_demux+0x1b9
ether_input() at ether_input+0x1bb
rl_rxeof() at rl_rxeof+0x1c8
rl_intr() at rl_intr+0x138
ithread_loop() at ithread_loop+0xe9
fork_exit() at fork_exit+0x112
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xfffffffe91e0cd30, rbp = 0 ---
____

  - "where 31" gave:
____

Tracing PID 31 tid 100030 td 0xffffff00012016e0
sched_switch() at sched_switch+0xdf
mi_switch() at mi_switch+0x18b
turnstile_wait() at turnstile_wait+0x1c4
_mtx_lock_sleep() at _mtx_lock_sleep+0x76
_mtx_lock_flags() at _mtx_lock_flags+0x95
syncache_lookup() at syncache_lookup+0xee
syncache_expand() at syncache_expand+0x38
tcp_input() at tcp_input+0x99b
ip_input() at ip_input+0xaf
ether_demux() at ether_demux+0x1b9
ether_input() at ether_input+0x1bb
fxp_intr() at fxp_intr+0x224
ithread_loop() at ithread_loop+0xe9
fork_exit() at fork_exit+0x112
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xfffffffe80174d30, rbp = 0 ---
____

  - "where 12" gave:
____

sched_switch() at sched_switch+0xdf
mi_switch() at mi_switch+0x18b
turnstile_wait() at turnstile_wait+0x1c4
_rw_rlock() at _rw_rlock+0x9c
ipfw_chk() at ipfw_chk+0x3ac1
ipfw_check_out() at ipfw_check_out+0xb1
pfil_run_hooks() at pfil_run_hooks+0xac
ip_output() at ip_output+0x357
syncache_respond() at syncache_respond+0x2fd
syncache_timer() at syncache_timer+0x15a
softclock() at softclock+0x270
ithread_loop() at ithread_loop+0xe9
fork_exit() at fork_exit+0x112
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xfffffffe80017d30, rbp = 0 ---
____

  - Before having entered the debugger, the following were logged:
____

lock order reversal:
 1st 0xffffff00032947b0 tcpinp (tcpinp) @ /usr/src/sys/netinet/tcp_timer.c:169
 2nd 0xffffffff806dba28 PFil hook read/write mutex (PFil hook read/write mutex) @ /usr/src/sys/net/pfil.c:73
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
witness_checkorder() at witness_checkorder+0x565
_rw_rlock() at _rw_rlock+0x25
pfil_run_hooks() at pfil_run_hooks+0x44
ip_output() at ip_output+0x357
tcp_output() at tcp_output+0xa1d
tcp_timer_delack() at tcp_timer_delack+0xc7
softclock() at softclock+0x270
ithread_loop() at ithread_loop+0xe9
fork_exit() at fork_exit+0x112
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xfffffffe80017d30, rbp = 0 ---
____

..and:
____

lock order reversal:
 1st 0xfffffffe80365df0 tcp_sc_head (tcp_sc_head) @ /usr/src/sys/kern/kern_timeout.c:241
 2nd 0xffffffff806dba28 PFil hook read/write mutex (PFil hook read/write mutex) @ /usr/src/sys/net/pfil.c:73
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
witness_checkorder() at witness_checkorder+0x565
_rw_rlock() at _rw_rlock+0x25
pfil_run_hooks() at pfil_run_hooks+0x44
ip_output() at ip_output+0x357
syncache_respond() at syncache_respond+0x2fd
syncache_timer() at syncache_timer+0x15a
softclock() at softclock+0x270
ithread_loop() at ithread_loop+0xe9
fork_exit() at fork_exit+0x112
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xfffffffe80017d30, rbp = 0 ---
____

On reboot:
____

lock order reversal:
 1st 0xffffffff806db2b8 IPFW static rules (IPFW static rules) @ /usr/src/sys/netinet/ip_fw2.c:2460
 2nd 0xffffffff806dcbe8 tcp (tcp) @ /usr/src/sys/netinet/ip_fw2.c:2009
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
witness_checkorder() at witness_checkorder+0x565
_rw_rlock() at _rw_rlock+0x25
ipfw_chk() at ipfw_chk+0x3ac1
ipfw_check_out() at ipfw_check_out+0xb1
pfil_run_hooks() at pfil_run_hooks+0xac
ip_output() at ip_output+0x357
tcp_respond() at tcp_respond+0x33a
tcp_dropwithreset() at tcp_dropwithreset+0x131
tcp_do_segment() at tcp_do_segment+0xd93
tcp_input() at tcp_input+0x8dd
ip_input() at ip_input+0xaf
ether_demux() at ether_demux+0x1b9
ether_input() at ether_input+0x1bb
fxp_intr() at fxp_intr+0x224
ithread_loop() at ithread_loop+0xe9
fork_exit() at fork_exit+0x112
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xfffffffe80174d30, rbp = 0 ---
____

I haven't actually checked whether any of these LORs are known innocuous.


>How-To-Repeat:
Truth be told, I have other machines (all x86, but had one ppc recently, too) using IPFW uid rules and have done for years and never had dramas there... But I can generally reproduce these lockups fairly easily on this amd64 machine and am certainly willing to try in the name of squishing this :-)
>Fix:
>From one of rwatson@'s replies on the mailing list (see http://lists.freebsd.org/pipermail/freebsd-stable/2009-March/048821.html):
____

Turns out that the problem is a lock cycle triggered by the syncache calling, 
indirectly, the firewall during output, and the firewall trying to look up the 
connection for the packet.  Thread one:

> Tracing PID 31 tid 100030 td 0xffffff00012016e0
> sched_switch() at sched_switch+0xdf
> mi_switch() at mi_switch+0x18b
> turnstile_wait() at turnstile_wait+0x1c4
> _mtx_lock_sleep() at _mtx_lock_sleep+0x76
> _mtx_lock_flags() at _mtx_lock_flags+0x95
> syncache_lookup() at syncache_lookup+0xee
> syncache_expand() at syncache_expand+0x38
> tcp_input() at tcp_input+0x99b
> ip_input() at ip_input+0xaf
> ether_demux() at ether_demux+0x1b9
> ether_input() at ether_input+0x1bb
> fxp_intr() at fxp_intr+0x224
> ithread_loop() at ithread_loop+0xe9
> fork_exit() at fork_exit+0x112
> fork_trampoline() at fork_trampoline+0xe
> --- trap 0, rip = 0, rsp = 0xfffffffe80174d30, rbp = 0 ---

This thread holds TCP locks and is trying to acquire the syncache lock. 
Thread two:

> sched_switch() at sched_switch+0xdf
> mi_switch() at mi_switch+0x18b
> turnstile_wait() at turnstile_wait+0x1c4
> _rw_rlock() at _rw_rlock+0x9c
> ipfw_chk() at ipfw_chk+0x3ac1
> ipfw_check_out() at ipfw_check_out+0xb1
> pfil_run_hooks() at pfil_run_hooks+0xac
> ip_output() at ip_output+0x357
> syncache_respond() at syncache_respond+0x2fd
> syncache_timer() at syncache_timer+0x15a
> softclock() at softclock+0x270
> ithread_loop() at ithread_loop+0xe9
> fork_exit() at fork_exit+0x112
> fork_trampoline() at fork_trampoline+0xe

This is the syncache timer holding syncache locks, calling IP output, and IPFW 
trying to acquire TCP locks.

Am I right in thinking that you are using uid/gid/jail firewall rules?  They 
suffer from a fundamental architectural problem in that they require reaching 
"up" to a higher level of the stack at times when it's not always a good idea 
to do so.  In general we solve the problem by passing "down" the inpcb for a 
connection in the output path so that TCP doesn't have to look it up -- 
however, in the case of the syncache we actually don't have the inpcb easily 
in hand (or at least, we have it, but we can't just lock it because syncache 
locks are after TCP locks in the lock order...).  It transpires that what the 
firewall really wants is not the inpcb, but the credential, but those are 
interfaces we can't change right now.

I'll need to think a bit about a proper fix for this, but you'll find the 
problem likely goes away if you eliminate all uid/gid/jail rules from your 
firewall.  You could also tweak the syncache logic not to use a retransmit 
timer, which might slightly extend the time it takes for systems to connect to 
your host in the presence of packet loss, but would eliminate this 
transmission path entirely.  We'll need a real and more general fix, however, 
to commit, and I'll look and see what I can come up with.
____

I'm using alternate IPFW rules which don't employ uid / gid / jail (I was only ever using uid of these) and have been lock-up free for 3 to four days (three days being about the average time I was seeing this lock-up in previously).


>Release-Note:
>Audit-Trail:
>Unformatted: