From owner-freebsd-stable@FreeBSD.ORG Tue Mar 25 20:15:33 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 021D310656A2 for ; Tue, 25 Mar 2008 20:15:32 +0000 (UTC) (envelope-from max@love2party.net) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.126.179]) by mx1.freebsd.org (Postfix) with ESMTP id 8A56A8FC19 for ; Tue, 25 Mar 2008 20:15:32 +0000 (UTC) (envelope-from max@love2party.net) Received: from vampire.homelinux.org (dslb-088-066-059-230.pools.arcor-ip.net [88.66.59.230]) by mrelayeu.kundenserver.de (node=mrelayeu7) with ESMTP (Nemesis) id 0ML2xA-1JeFYO2nxQ-0000Xp; Tue, 25 Mar 2008 21:15:31 +0100 Received: (qmail 17437 invoked from network); 25 Mar 2008 20:14:35 -0000 Received: from myhost.laiers.local (192.168.4.151) by ns1.laiers.local with SMTP; 25 Mar 2008 20:14:35 -0000 From: Max Laier Organization: FreeBSD To: Alex Popa Date: Tue, 25 Mar 2008 21:13:46 +0100 User-Agent: KMail/1.9.7 References: <20080314192359.GA4677@dataxnet.ro> <200803221655.28975.max@love2party.net> <20080325192113.GA61579@dataxnet.ro> In-Reply-To: <20080325192113.GA61579@dataxnet.ro> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Message-Id: <200803252113.47259.max@love2party.net> X-Provags-ID: V01U2FsdGVkX191HwAZhBiwfPld/+y15LbPz9tvGoTHfVycsQN PfDHqYWCh1vbjDYRVlbjSgKdp5NbhcThcQ8TZfn/CdUm9ZRxjd MAekK5CrpKZqEcRGA+JIA== Cc: Attilio Rao , freebsd-stable@freebsd.org, Robert Watson , John Baldwin Subject: Re: Lock Order Reversal on 7.0-STABLE with pf and ipfw / dummynet (traces) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Mar 2008 20:15:33 -0000 Hi Alex, so it's basically back to square one. We only have LORs between the pfil=20 R/W lock (read instance) and mutexes that don't have any lock order with=20 the pfil R/W lock (write instance) at all. This means the deadlock can't=20 be explained by the LORs that are reported (unless there is something I'm=20 missing). Unless somebody who is seeing these kind of deadlocks can=20 actually break into a debugger to identify the locks at play, everything=20 else is just speculation. I will fix the fastroute LOR with the patch you have been testing,=20 eventhough it didn't fix your problem. For the remaining issue, we need=20 more IPFW or lock primitives knowledge (extending CC-list). Note that the first LOR features a recursive pickup of the pfil R/W lock. = =20 I remember that Attilio committed a patch to forbid this for CURRENT. =20 Could this be the cause of a deadlock? Would it make sense to MFC=20 rm_locks and try if they hold up under this scenario? On Tuesday 25 March 2008 20:21:15 Alex Popa wrote: [...] > Hello. > > I have tested the patch, booted with a WITNESS kernel including that > patch, and it has locked up (solid again, no numlock or console > changing, no control-alt-esc to debugger) after about 41 minutes > (timestamps in /var/log/all.log go from 19:12:57 to 19:53:40). > > I did get two LOR reports in dmesg, they are attached. > lock order reversal: >=C2=A01st 0xffffffff8096ebc8 PFil hook read/write mutex (PFil hook read/wr= ite >=C2=A0mutex) @ /usr/src/sys/net/pfil.c:73=20 > 2nd 0xffffffff8096f8e8 udp (udp) @ /usr/src/sys/netinet/udp_usrreq.c:385= =20 This one could be avoided if dummynet_send where to use a queue instead of= =20 direct dispatch - with all associated problems. > KDB: stack backtrace:=20 > db_trace_self_wrapper() at db_trace_self_wrapper+0x2a > witness_checkorder() at witness_checkorder+0x539 > _mtx_lock_flags() at _mtx_lock_flags+0x1f > udp_input() at udp_input+0x1f7 > ip_input() at ip_input+0xa7 >>> QUEUE here <<< > dummynet_send() at dummynet_send+0xde > dummynet_io() at dummynet_io+0x587 > ipfw_check_in() at ipfw_check_in+0x241 > pfil_run_hooks() at pfil_run_hooks+0xac > ip_input() at ip_input+0x292 > ether_demux() at ether_demux+0x1ac > ether_input() at ether_input+0x1bf > em_handle_rxtx() at em_handle_rxtx+0x1d2 > taskqueue_run() at taskqueue_run+0x95 > taskqueue_thread_loop() at taskqueue_thread_loop+0x53 > fork_exit() at fork_exit+0x112 > fork_trampoline() at fork_trampoline+0xe > --- trap 0, rip =3D 0, rsp =3D 0xffffffffa057ad30, rbp =3D 0 --- > > lock order reversal: > 1st 0xffffff00018ff690 inp (rawinp) @ /usr/src/sys/netinet/raw_ip.c:281=20 > 2nd 0xffffffff8096ebc8 PFil hook read/write mutex (PFil hook read/write=20 > mutex) @ /usr/src/sys/net/pfil.c:73 I'm still a bit suspicious of this one due to the interaction of exclusive= =20 IPFW locks with raw sockets, but I still can't find the code path for the=20 original order, which makes it a bit hard to follow. Alex, can you get a "show witness" after this LOR has been displayed? > KDB: stack backtrace:=20 > db_trace_self_wrapper() at db_trace_self_wrapper+0x2a > witness_checkorder() at witness_checkorder+0x539 > _rw_rlock() at _rw_rlock+0x25 > pfil_run_hooks() at pfil_run_hooks+0x44 > ip_output() at ip_output+0x35a > rip_output() at rip_output+0x1eb > sosend_generic() at sosend_generic+0x289 > kern_sendit() at kern_sendit+0x122 > sendit() at sendit+0xc6 > sendto() at sendto+0x4d > syscall() at syscall+0x1b5 > Xfast_syscall() at Xfast_syscall+0xab > --- syscall (133, FreeBSD ELF64, sendto), rip =3D 0x80091132c, rsp =3D > 0x7ffffffee6e8, rbp =3D 0x40 --- Any input greatly appreciated! =2D-=20 /"\ Best regards, | mlaier@freebsd.org \ / Max Laier | ICQ #67774661 X http://pf4freebsd.love2party.net/ | mlaier@EFnet / \ ASCII Ribbon Campaign | Against HTML Mail and News