From owner-freebsd-current@FreeBSD.ORG Wed Apr 18 07:54:01 2007 Return-Path: X-Original-To: current@freebsd.org Delivered-To: freebsd-current@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id EA1E616A409 for ; Wed, 18 Apr 2007 07:54:01 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.freebsd.org (Postfix) with ESMTP id A28D213C46C for ; Wed, 18 Apr 2007 07:54:01 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id A4FC447848; Wed, 18 Apr 2007 03:54:00 -0400 (EDT) Date: Wed, 18 Apr 2007 08:54:00 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Tillman Hodgson In-Reply-To: <20070417220339.E2913@fledge.watson.org> Message-ID: <20070418084345.H2913@fledge.watson.org> References: <20070417153357.GA1335@seekingfire.com> <20070417173005.O42234@fledge.watson.org> <20070417181627.GA1225@seekingfire.com> <20070417220339.E2913@fledge.watson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: current@freebsd.org Subject: Re: Panic on boot with April 16 src (lengthy info attached) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Apr 2007 07:54:02 -0000 On Tue, 17 Apr 2007, Robert Watson wrote: >> I originally put it in there to work around a LOR that I was experiencing >> (based on you mentioning it in an email to current@ Sun 18 Mar 2007 15:50). >> http://sources.zabbadoz.net/freebsd/lor/191.html doesn't show any changes >> to that particular LOR, do you happen to know if there's any ongoing work >> on this? I'm very willing to act as a test system. > > I chatted with Andre about the panic earlier this afternoon, and it sounds > like the fix is straight forward. I would anticipate seeing it committed in > the near future. > > I'll send out an e-mail explaining the above lock order reversal tomorrow > morning. I understand that several people have been looking at this, so > perhaps one of those people will reply talking about it before then. :-) The essential problem of this lock order reversal has to do with the fact that higher network stack layers hold locks over lower network stack layers. For example, the lock for a TCP connection is held over the operation to enqueue the TCP packet for transmission at a lower layer. This is necessary in order to maintain TCP transmission order into the transmission queue between multiple threads operating on the same TCP connection, as if the "transmit and enqueue" operation were non-atomic with respect to the same TCP connection in another thread, quite damaging reordering could take place. We directly dispatch the entire outbound network stack from that enqueue point, meaning that the per-TCP connection lock is held over that processing path, including the firewall. As a result, PCB locks (TCP connection locks) preceed the firewall in the lock order. Firewall locks are about protecting the rule state of the firewall from corruption when firewall rules are updated, allowing readers to interpret the rules using a static snapshot, and writers to avoid mangling the rules via simultaneous non-atomic update. As such, when the firewall code is entered, the firewall lock is acquired, and held until the packet has been completely processed. Things get sticky deep in the firewall code because our firewalls include credential-aware rules, which essentially "peek up the stack" in order to decide what user is associated with a packet before delivery to the connection is done. The firewall rule lock is held over this lookup and inspection of TCP-layer state. In the out-bound path, we pass down the TCP state reference (PCB pointer) and guarantee the lock is already held. However, in the in-bound direction, the firewall has to do the full lookup and lock acquisition. Which reverses the lock order, and can lead to deadlocks. debug.mpsafenet=0 places the Giant lock in front of all network stack lock acquisition, which effectively serializes all of the above. It doesn't remove the lock order reversal, but it does eliminate simultaneous lock acquisition, removing one of the necessary causes of deadlock. This trick of a serializing "global" lock in order to prevent lock order between "leaf" locks is not an uncommon technique, but in this case has a significant overhead (requiring non-parallelism in network processing), and needs to be fixed. The key is to guarantee that the acquisition of the firewall reference will never be blocked waiting on a PCB lock -- i.e., that the firewall "lock" isn't a lock so much as a reference count that will never have to wait, removing the waiting requirement from the deadlock equation. I know that Julian Elischer has been looking at doing this, and others may have also. The model is essentially that you either starve writers to the firewall data, or you create a read-only snapshot to be used by readers in the event a writer arrives, allowing readers to pick up the new rules if available, or the old rules if not, and never wait indefinitely either way. Robert N M Watson Computer Laboratory University of Cambridge