Date: Tue, 25 Mar 2008 21:13:46 +0100 From: Max Laier <max@love2party.net> To: Alex Popa <razor@dataxnet.ro> Cc: Attilio Rao <attilio@freebsd.org>, freebsd-stable@freebsd.org, Robert Watson <rwatson@freebsd.org>, John Baldwin <jhb@freebsd.org> Subject: Re: Lock Order Reversal on 7.0-STABLE with pf and ipfw / dummynet (traces) Message-ID: <200803252113.47259.max@love2party.net> In-Reply-To: <20080325192113.GA61579@dataxnet.ro> References: <20080314192359.GA4677@dataxnet.ro> <200803221655.28975.max@love2party.net> <20080325192113.GA61579@dataxnet.ro>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi Alex, so it's basically back to square one. We only have LORs between the pfil=20 R/W lock (read instance) and mutexes that don't have any lock order with=20 the pfil R/W lock (write instance) at all. This means the deadlock can't=20 be explained by the LORs that are reported (unless there is something I'm=20 missing). Unless somebody who is seeing these kind of deadlocks can=20 actually break into a debugger to identify the locks at play, everything=20 else is just speculation. I will fix the fastroute LOR with the patch you have been testing,=20 eventhough it didn't fix your problem. For the remaining issue, we need=20 more IPFW or lock primitives knowledge (extending CC-list). Note that the first LOR features a recursive pickup of the pfil R/W lock. = =20 I remember that Attilio committed a patch to forbid this for CURRENT. =20 Could this be the cause of a deadlock? Would it make sense to MFC=20 rm_locks and try if they hold up under this scenario? On Tuesday 25 March 2008 20:21:15 Alex Popa wrote: [...] > Hello. > > I have tested the patch, booted with a WITNESS kernel including that > patch, and it has locked up (solid again, no numlock or console > changing, no control-alt-esc to debugger) after about 41 minutes > (timestamps in /var/log/all.log go from 19:12:57 to 19:53:40). > > I did get two LOR reports in dmesg, they are attached. > lock order reversal: >=C2=A01st 0xffffffff8096ebc8 PFil hook read/write mutex (PFil hook read/wr= ite >=C2=A0mutex) @ /usr/src/sys/net/pfil.c:73=20 > 2nd 0xffffffff8096f8e8 udp (udp) @ /usr/src/sys/netinet/udp_usrreq.c:385= =20 This one could be avoided if dummynet_send where to use a queue instead of= =20 direct dispatch - with all associated problems. > KDB: stack backtrace:=20 > db_trace_self_wrapper() at db_trace_self_wrapper+0x2a > witness_checkorder() at witness_checkorder+0x539 > _mtx_lock_flags() at _mtx_lock_flags+0x1f > udp_input() at udp_input+0x1f7 > ip_input() at ip_input+0xa7 >>> QUEUE here <<< > dummynet_send() at dummynet_send+0xde > dummynet_io() at dummynet_io+0x587 > ipfw_check_in() at ipfw_check_in+0x241 > pfil_run_hooks() at pfil_run_hooks+0xac > ip_input() at ip_input+0x292 > ether_demux() at ether_demux+0x1ac > ether_input() at ether_input+0x1bf > em_handle_rxtx() at em_handle_rxtx+0x1d2 > taskqueue_run() at taskqueue_run+0x95 > taskqueue_thread_loop() at taskqueue_thread_loop+0x53 > fork_exit() at fork_exit+0x112 > fork_trampoline() at fork_trampoline+0xe > --- trap 0, rip =3D 0, rsp =3D 0xffffffffa057ad30, rbp =3D 0 --- > > lock order reversal: > 1st 0xffffff00018ff690 inp (rawinp) @ /usr/src/sys/netinet/raw_ip.c:281=20 > 2nd 0xffffffff8096ebc8 PFil hook read/write mutex (PFil hook read/write=20 > mutex) @ /usr/src/sys/net/pfil.c:73 I'm still a bit suspicious of this one due to the interaction of exclusive= =20 IPFW locks with raw sockets, but I still can't find the code path for the=20 original order, which makes it a bit hard to follow. Alex, can you get a "show witness" after this LOR has been displayed? > KDB: stack backtrace:=20 > db_trace_self_wrapper() at db_trace_self_wrapper+0x2a > witness_checkorder() at witness_checkorder+0x539 > _rw_rlock() at _rw_rlock+0x25 > pfil_run_hooks() at pfil_run_hooks+0x44 > ip_output() at ip_output+0x35a > rip_output() at rip_output+0x1eb > sosend_generic() at sosend_generic+0x289 > kern_sendit() at kern_sendit+0x122 > sendit() at sendit+0xc6 > sendto() at sendto+0x4d > syscall() at syscall+0x1b5 > Xfast_syscall() at Xfast_syscall+0xab > --- syscall (133, FreeBSD ELF64, sendto), rip =3D 0x80091132c, rsp =3D > 0x7ffffffee6e8, rbp =3D 0x40 --- Any input greatly appreciated! =2D-=20 /"\ Best regards, | mlaier@freebsd.org \ / Max Laier | ICQ #67774661 X http://pf4freebsd.love2party.net/ | mlaier@EFnet / \ ASCII Ribbon Campaign | Against HTML Mail and News
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200803252113.47259.max>