From owner-freebsd-net@FreeBSD.ORG Fri Sep 28 17:28:38 2007 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3B1AC16A421; Fri, 28 Sep 2007 17:28:38 +0000 (UTC) (envelope-from ambrisko@ambrisko.com) Received: from mail.ambrisko.com (mail.ambrisko.com [64.174.51.43]) by mx1.freebsd.org (Postfix) with ESMTP id C1BEC13C45D; Fri, 28 Sep 2007 17:28:37 +0000 (UTC) (envelope-from ambrisko@ambrisko.com) Received: from server2.ambrisko.com (HELO www.ambrisko.com) ([192.168.1.2]) by ironport2.ambrisko.com with ESMTP; 28 Sep 2007 09:54:08 -0700 Received: from ambrisko.com (localhost [127.0.0.1]) by www.ambrisko.com (8.14.1/8.12.11) with ESMTP id l8SGxbHd072054; Fri, 28 Sep 2007 09:59:37 -0700 (PDT) (envelope-from ambrisko@ambrisko.com) Received: (from ambrisko@localhost) by ambrisko.com (8.14.1/8.13.1/Submit) id l8SGxbBv072053; Fri, 28 Sep 2007 09:59:37 -0700 (PDT) (envelope-from ambrisko) From: Doug Ambrisko Message-Id: <200709281659.l8SGxbBv072053@ambrisko.com> In-Reply-To: To: Ivan Voras Date: Fri, 28 Sep 2007 09:59:37 -0700 (PDT) X-Mailer: ELM [version 2.4ME+ PL94b (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII Cc: freebsd-net@freebsd.org Subject: Re: Panic in rt_check X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Sep 2007 17:28:38 -0000 Ivan Voras writes: -- Start of PGP signed section. [ Charset UTF-8 unsupported, converting... ] | Hi, | | I have a machine that panics almost daily in route.c, in rt_check(). | This panic has been reported by several users, including Marcel | Moolenaar for a machine in freebsd.org. | | The problem is present in both 6-STABLE and 7-CURRENT, and apparently it | manifests on SMP machines, both i386 and AMD64. | | The panic backtrace looks like this: | | panic: mtx_lock() of destroyed mutex @ /usr/src/sys/net/route.c:1305 | cpuid = 1 | KDB: stack backtrace: | db_trace_self_wrapper(c091bcf0,e38b690c,c0659fc1,c093f3cf,1,...) at | db_trace_self_wrapper+0x26 | kdb_backtrace(c093f3cf,1,c0917de2,e38b6918,1,...) at kdb_backtrace+0x29 | panic(c0917de2,c0925d40,519,0,0,...) at panic+0x111 | _mtx_lock_flags(c5d333a8,0,c0925d40,519,0,...) at _mtx_lock_flags+0x59 | rt_check(e38b6970,e38b698c,c55b7d10,0,0,...) at rt_check+0x11e | arpresolve(c4e27000,c5d33d98,c50dbe00,c55b7d10,e38b69a6,...) at | arpresolve+0xaf | ether_output(c4e27000,c50dbe00,c55b7d10,c5d33d98,ccf8b348,...) at | ether_output+0x7e | ip_output(c50dbe00,0,e38b6a1c,0,0,...) at ip_output+0xa09 | tcp_output(ccefbac8,0,c0929785,91d,0,...) at tcp_output+0x1463 | tcp_do_segment(ccefbac8,28,0,1dd,901f,...) at tcp_do_segment+0x1c97 | tcp_input(c6095100,14,c4ea3c00,1,0,...) at tcp_input+0xd5e | ip_input(c6095100,0,c09258bd,8c,c09efc38,...) at ip_input+0x662 | netisr_processqueue(e38b6cc4,c064df85,c09eb940,1,c4d03480,...) at | netisr_processqueue+0x98 | swi_net(0,0,c0915aee,471,c4d0bd64,...) at swi_net+0xdb | ithread_loop(c4d0c270,e38b6d38,c0915862,315,c4d56558,...) at | ithread_loop+0x1c5 | fork_exit(c063e2d0,c4d0c270,e38b6d38) at fork_exit+0xc5 | fork_trampoline() at fork_trampoline+0x8 | | ... | | #0 doadump () at pcpu.h:195 | 195 pcpu.h: No such file or directory. | in pcpu.h | (kgdb) bt | #0 doadump () at pcpu.h:195 | #1 0xc0659d2c in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:409 | #2 0xc0659ff0 in panic (fmt=Variable "fmt" is not available. | ) at /usr/src/sys/kern/kern_shutdown.c:563 | #3 0xc064e699 in _mtx_lock_flags (m=0x0, opts=0, file=0xc0925d40 | "/usr/src/sys/net/route.c", line=1305) | at /usr/src/sys/kern/kern_mutex.c:178 | #4 0xc06fe28e in rt_check (lrt=0xe38b6970, lrt0=0xe38b698c, | dst=0xc55b7d10) at /usr/src/sys/net/route.c:1305 | #5 0xc070282f in arpresolve (ifp=0xc4e27000, rt0=0xc5d33d98, | m=0xc50dbe00, dst=0xc55b7d10, desten=0xe38b69a6 "") | at /usr/src/sys/netinet/if_ether.c:373 | #6 0xc06f019e in ether_output (ifp=0xc4e27000, m=0xc50dbe00, | dst=0xc55b7d10, rt0=0xc5d33d98) at /usr/src/sys/net/if_ethersubr.c:175 | #7 0xc07127a9 in ip_output (m=0xc50dbe00, opt=0x0, ro=0xe38b6a1c, | flags=Variable "flags" is not available. | ) at /usr/src/sys/netinet/ip_output.c:547 | #8 0xc076d6e3 in tcp_output (tp=0xccefbac8) at | /usr/src/sys/netinet/tcp_output.c:1125 | #9 0xc076ab87 in tcp_do_segment (m=0xc6095100, th=0xc6095158, | so=0xccdb67bc, tp=0xccefbac8, drop_hdrlen=40, tlen=0) | at /usr/src/sys/netinet/tcp_input.c:2345 | #10 0xc076bb0e in tcp_input (m=0xc6095100, off0=20) at | /usr/src/sys/netinet/tcp_input.c:843 | #11 0xc0710c42 in ip_input (m=0xc6095100) at | /usr/src/sys/netinet/ip_input.c:663 | #12 0xc06f9148 in netisr_processqueue (ni=0xc09efc38) at | /usr/src/sys/net/netisr.c:143 | #13 0xc06f925b in swi_net (dummy=0x0) at /usr/src/sys/net/netisr.c:256 | #14 0xc063e495 in ithread_loop (arg=0xc4d0c270) at | /usr/src/sys/kern/kern_intr.c:1036 | #15 0xc063b845 in fork_exit (callout=0xc063e2d0 , | arg=0xc4d0c270, frame=0xe38b6d38) at /usr/src/sys/kern/kern_fork.c:797 | #16 0xc0896f80 in fork_trampoline () at | /usr/src/sys/i386/i386/exception.s:205 | | I've been trying to solve this with Craig Rodrigues, and I've tried | several patches, without success. The backtrace above happens on the | following code from net/route.c: | | 1299 /* XXX BSD/OS checks dst->sa_family != AF_NS */ | 1300 if (rt->rt_flags & RTF_GATEWAY) { | 1301 struct rtentry *temp_rt_gwroute = rt->rt_gwroute; | 1302 if (temp_rt_gwroute == NULL) | 1303 goto lookup; | 1304 rt = rt->rt_gwroute; | 1305 RT_LOCK(rt); /* NB: gwroute */ | 1306 if(rt0->rt_flags & 0x80000000U){ | 1307 /*This rt is under process...*/ | 1308 RT_UNLOCK(rt); | 1309 RT_UNLOCK(rt0); | 1310 goto try_again; | 1311 } | 1312 if ((rt->rt_flags & RTF_UP) == 0) { | 1313 rt0->rt_flags |= 0x80000000U; | 1314 RTFREE_LOCKED(rt); /* unlock gwroute */ | 1315 rt = rt0; | 1316 lookup: | 1317 RT_UNLOCK(rt0); | 1318 rt = rtalloc1(rt->rt_gateway, 1, 0UL); | 1319 if (rt == rt0) { | 1320 rt0->rt_gwroute = NULL; | 1321 RT_REMREF(rt0); | 1322 RT_UNLOCK(rt0); | 1323 return (ENETUNREACH); | 1324 } | 1325 RT_LOCK(rt0); | 1326 rt0->rt_gwroute = rt; | 1327 rt0->rt_flags &= (~0x80000000U); | 1328 if (rt == NULL) { | 1329 RT_UNLOCK(rt0); | 1330 return (EHOSTUNREACH); | 1331 } | 1332 } | 1333 RT_UNLOCK(rt0); | 1334 } | | This code contains several patches we tried for workarounds, without any | success. The panic is always in RT_LOCK(rt) line: sometimes it's NULL | pointer reference, sometimes it's an operation on destroyed mutex. | | This is a critical problem for me, but I believe it's also critical for | other users. | | Does anyone have more ideas about how to solve this problem? Something along the lines of: Index: sys/net/route.c =================================================================== RCS file: /usr/local/cvsroot/freebsd/src/sys/net/route.c,v retrieving revision 1.109.2.3 diff -u -p -r1.109.2.3 route.c --- sys/net/route.c 25 Feb 2007 05:36:25 -0000 1.109.2.3 +++ sys/net/route.c 27 Sep 2007 02:03:05 -0000 @@ -615,7 +615,8 @@ rtexpunge(struct rtentry *rt) * we held its last reference. */ if (rt->rt_gwroute) { - RTFREE(rt->rt_gwroute); + if (rt->rt_gwroute->rt_refcnt) + RTFREE(rt->rt_gwroute); rt->rt_gwroute = NULL; } @@ -701,7 +702,8 @@ rtrequest1(int req, struct rt_addrinfo * * we held its last reference. */ if (rt->rt_gwroute) { - RTFREE(rt->rt_gwroute); + if (rt->rt_gwroute->rt_refcnt) + RTFREE(rt->rt_gwroute); rt->rt_gwroute = NULL; } @@ -822,9 +824,11 @@ rtrequest1(int req, struct rt_addrinfo * */ if (rn == NULL) { if (rt->rt_gwroute) - RTFREE(rt->rt_gwroute); + if (rt->rt_gwroute->rt_refcnt) + RTFREE(rt->rt_gwroute); if (rt->rt_ifa) - IFAFREE(rt->rt_ifa); + if (rt->rt_ifa->ifa_refcnt) + IFAFREE(rt->rt_ifa); Free(rt_key(rt)); RT_LOCK_DESTROY(rt); uma_zfree(rtzone, rt); @@ -1039,7 +1043,8 @@ rt_setgate(struct rtentry *rt, struct so if (rt->rt_gwroute == gwrt) { RT_REMREF(rt->rt_gwroute); } else - RTFREE(rt->rt_gwroute); + if (rt->rt_gwroute->rt_refcnt) + RTFREE(rt->rt_gwroute); } if ((rt->rt_gwroute = gwrt) != NULL) might help. The problem here was a stale gateway route going away in flight. You might try to check the refcnt of the route. This is common to -current and -stable. In -stable you can "fix" it by turning off mpnetsafe. Your panic looks different then this but it might raise some more questions that could lead to a solution. I'd be looking at rt_gwroute->rt_refcnt. Note that I did get a panic before like yours until I settled on the above patch for another issue. Then that problem and my others didn't occur any more (well in a 6.1 I had to merge in jhb's bpf race fix). So maybe you might want to revert other patches and try just this one. You should be able to poke around the route structure via kgdb. On a cool note I was using kgdb over IPMI serial over lan to the remote host and I could "flip" between various remote hosts :-) Doug A.