From owner-freebsd-current@FreeBSD.ORG Thu Jul 8 03:47:31 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 18DA716A4CE for ; Thu, 8 Jul 2004 03:47:31 +0000 (GMT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id AE3D543D41 for ; Thu, 8 Jul 2004 03:47:30 +0000 (GMT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.12.11/8.12.11) with ESMTP id i683lVqM057999; Wed, 7 Jul 2004 23:47:31 -0400 (EDT) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)i683lUke057996; Wed, 7 Jul 2004 23:47:31 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Wed, 7 Jul 2004 23:47:30 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: Wiktor Niesiobedzki In-Reply-To: <20040707214417.GF26768@mail.evip.pl> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: "Bjoern A. Zeeb" cc: current@freebsd.org Subject: Re: LORs with ipfw X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 08 Jul 2004 03:47:31 -0000 On Wed, 7 Jul 2004, Wiktor Niesiobedzki wrote: > lock order reversal > 1st 0xc07287c8 IPFW static rules (IPFW static rules) @ /usr/src/sys/netinet/ip_fw2.c:1828 > 2nd 0xc065cfcc tcp (tcp) @ /usr/src/sys/netinet/ip_fw2.c:1574 > Stack backtrace: > backtrace(c05ec5a7,c065cfcc,c05ec12e,c05ec12e,c0726a3c) at backtrace+0x17 > witness_checkorder(c065cfcc,9,c0726a3c,626,806) at witness_checkorder+0x678 > _mtx_lock_flags(c065cfcc,0,c0726a3c,626,0) at _mtx_lock_flags+0x80 > check_uidgid(c15610a4,6,0,e08d1f53,1bd) at check_uidgid+0xd3 > ipfw_chk(cb9b6bf4,cb9b6c48,c1189014,1,0) at ipfw_chk+0x9e2 > ip_input(c1395c00,0,c071c576,1d0,0) at ip_input+0x375 > transmit_event(c1510c00,0,c071c576,300,2) at transmit_event+0x14b > dummynet(0,0,c05ea27a,f6,1) at dummynet+0x1a9 > softclock(0,0,c05e6b67,263,c0631d40) at softclock+0x1aa > ithread_loop(c10dd500,cb9b6d48,c05e695e,327,c10dd500) at ithread_loop+0x172 > fork_exit(c04a5b80,c10dd500,cb9b6d48) at fork_exit+0xbc > fork_trampoline() at fork_trampoline+0x8 > > This is from yesterdays CURRENT. I have compiled kernel with > CPUTYPE=athlon-xp and CFLAGS=-O2. Currently I'm not able to reproduce > this messages with CPUTYPE=i686 and empty CFLAGS. > > Does anyone has an clue, where the problem may lie here (or is it just > harmless?) This is a warning about a potentially harmful, but somewhat harder to fix issue. Basically, we currently have what amounts to a subsystem or giant lock over the ipfw rule set and its evaluation. Normally, the ipfw lock will fall "after" most other locks, including protocol control block (pcb) locks, as it will be called from other protocol code during processing. However, when using a uid/gid rule, the protocol control block for the packet is looked up by the ipfw code, which acquires pcb locks after the ipfw lock. There are a few things to think about here: (1) This lock order reversal is really a result of a layering violation -- the ipfw code is acting on packets at the IP layer, and looking up the connection from the IP layer results in cross-layer transitions that don't fit the general model. (2) The lock order reversal occurs in a situation where a race condition also occurs -- the pcb may actually be looked up twice for inbound packets, once in ipfw, and then again for delivery. While it's somewhat unlikely, the pcb could change in that window. The window is stretched out through the use of functionality like dummynet. (3) One way to think about fixing this is to avoid the need to hold the ipfw lock across the entire execution of ipfw. I've been thinking about reference-counting the rule set, such that each instance of a thread entering the ipfw code sees the rule set as read-only and can access it lock-free once it has acquired a reference, releasing the reference on exit. For long rule sets, this would help reduce contention. You can imagine various variations on the model, such as per-cpu rule set instances, etc. There are some interesting challengs in dynamic state management, however. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Principal Research Scientist, McAfee Research