From owner-freebsd-current@FreeBSD.ORG  Wed Apr 18 07:54:01 2007
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
X-Original-To: current@freebsd.org
Delivered-To: freebsd-current@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id EA1E616A409
	for <current@freebsd.org>; Wed, 18 Apr 2007 07:54:01 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.freebsd.org (Postfix) with ESMTP id A28D213C46C
	for <current@freebsd.org>; Wed, 18 Apr 2007 07:54:01 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id A4FC447848;
	Wed, 18 Apr 2007 03:54:00 -0400 (EDT)
Date: Wed, 18 Apr 2007 08:54:00 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Tillman Hodgson <tillman@seekingfire.com>
In-Reply-To: <20070417220339.E2913@fledge.watson.org>
Message-ID: <20070418084345.H2913@fledge.watson.org>
References: <20070417153357.GA1335@seekingfire.com>
	<20070417173005.O42234@fledge.watson.org>
	<20070417181627.GA1225@seekingfire.com>
	<20070417220339.E2913@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: current@freebsd.org
Subject: Re: Panic on boot with April 16 src (lengthy info attached)
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 18 Apr 2007 07:54:02 -0000


On Tue, 17 Apr 2007, Robert Watson wrote:

>> I originally put it in there to work around a LOR that I was experiencing 
>> (based on you mentioning it in an email to current@ Sun 18 Mar 2007 15:50). 
>> http://sources.zabbadoz.net/freebsd/lor/191.html doesn't show any changes 
>> to that particular LOR, do you happen to know if there's any ongoing work 
>> on this? I'm very willing to act as a test system.
>
> I chatted with Andre about the panic earlier this afternoon, and it sounds 
> like the fix is straight forward.  I would anticipate seeing it committed in 
> the near future.
>
> I'll send out an e-mail explaining the above lock order reversal tomorrow 
> morning.  I understand that several people have been looking at this, so 
> perhaps one of those people will reply talking about it before then. :-)

The essential problem of this lock order reversal has to do with the fact that 
higher network stack layers hold locks over lower network stack layers.  For 
example, the lock for a TCP connection is held over the operation to enqueue 
the TCP packet for transmission at a lower layer.  This is necessary in order 
to maintain TCP transmission order into the transmission queue between 
multiple threads operating on the same TCP connection, as if the "transmit and 
enqueue" operation were non-atomic with respect to the same TCP connection in 
another thread, quite damaging reordering could take place.  We directly 
dispatch the entire outbound network stack from that enqueue point, meaning 
that the per-TCP connection lock is held over that processing path, including 
the firewall.  As a result, PCB locks (TCP connection locks) preceed the 
firewall in the lock order.

Firewall locks are about protecting the rule state of the firewall from 
corruption when firewall rules are updated, allowing readers to interpret the 
rules using a static snapshot, and writers to avoid mangling the rules via 
simultaneous non-atomic update.  As such, when the firewall code is entered, 
the firewall lock is acquired, and held until the packet has been completely 
processed.  Things get sticky deep in the firewall code because our firewalls 
include credential-aware rules, which essentially "peek up the stack" in order 
to decide what user is associated with a packet before delivery to the 
connection is done.  The firewall rule lock is held over this lookup and 
inspection of TCP-layer state.  In the out-bound path, we pass down the TCP 
state reference (PCB pointer) and guarantee the lock is already held. 
However, in the in-bound direction, the firewall has to do the full lookup and 
lock acquisition.  Which reverses the lock order, and can lead to deadlocks.

debug.mpsafenet=0 places the Giant lock in front of all network stack lock 
acquisition, which effectively serializes all of the above.  It doesn't remove 
the lock order reversal, but it does eliminate simultaneous lock acquisition, 
removing one of the necessary causes of deadlock.  This trick of a serializing 
"global" lock in order to prevent lock order between "leaf" locks is not an 
uncommon technique, but in this case has a significant overhead (requiring 
non-parallelism in network processing), and needs to be fixed.

The key is to guarantee that the acquisition of the firewall reference will 
never be blocked waiting on a PCB lock -- i.e., that the firewall "lock" isn't 
a lock so much as a reference count that will never have to wait, removing the 
waiting requirement from the deadlock equation.  I know that Julian Elischer 
has been looking at doing this, and others may have also.  The model is 
essentially that you either starve writers to the firewall data, or you create 
a read-only snapshot to be used by readers in the event a writer arrives, 
allowing readers to pick up the new rules if available, or the old rules if 
not, and never wait indefinitely either way.

Robert N M Watson
Computer Laboratory
University of Cambridge