Date: Tue, 16 Oct 2007 11:31:23 -0500 From: Dan Nelson <dnelson@allantgroup.com> To: Kris Kennaway <kris@freebsd.org> Cc: freebsd-net@freebsd.org, freebsd-current@freebsd.org, Ivan Voras <ivoras@freebsd.org> Subject: Re: Panic in arpresolve->rt_check? Message-ID: <20071016163123.GA22599@dan.emsphone.com> In-Reply-To: <470F9211.8010604@FreeBSD.org> References: <fc89en$fc2$1@sea.gmane.org> <20070912172752.GA13960@dan.emsphone.com> <fc9ce3$jps$1@sea.gmane.org> <200710101615.48760.jhb@freebsd.org> <20071012143800.GA76105@dan.emsphone.com> <470F9211.8010604@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
--fUYQa+Pmc3FrFX/N
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In the last episode (Oct 12), Kris Kennaway said:
> Dan Nelson wrote:
>> In the last episode (Oct 10), John Baldwin said:
>>> On Wednesday 12 September 2007 02:50:37 pm Ivan Voras wrote:
>>>> Dan Nelson wrote:
>>>>> The same panic was also reported for 6.2 via PR 107865 and PR
>>>>> 112490. 112490 included a workaround patch (I haven't tried it;
>>>>> just found it).
>>>>
>>>> The proposed patch in kern/112490 looks trivial but someone who
>>>> knows more about net locking should check it out. Unfortunately it
>>>> lacks context and I don't know the code in question to apply it
>>>> safely on a production machine :(
>>>
>>> I also get panics with what appears to be a double free of
>>> rt_gwroute in rtexpunge(), so I think while this PR may help some
>>> with figuring out the problem, I'm not sure it solves the root bug.
>>>
>>> Hmm, possibly try this patch:
>>
>> This makes the panics more frequent on my machine, actually :)
>
> Since you can reproduce this frequently the best thing might be to
> instrument all the route handling with KTR so that you can do
> post-mortem and try to figure out where the double-free or missing
> reference happened.
I've added some KTR debugging (why are the macros named CTR* instead of
KTR* ? ) and I think I've got the problem nailed down, but I don't know
anything about networking so I don't know what the solution is.
I've attached a KTR dump and the debugging patches I made (done to
preserve line numbering at the expense of style).
It looks like two threads are entering rt_check at the same time. In
the ktrdump, lrt0 is 0xc674d000 and lrt0->rt_gwroute is 0xc674ae88 for
both threads.
The thread on CPU1 locks lrt0 at line :1287 (ktr index 641), then locks
lrt0->rt_gwroute at :1303 (k642). It frees lrt0->rt_gwroute at :1305
(k643), then unlocks lrt0 at :1308 (k651) before calling rtalloc1().
Meanwhile, the thread on CPU0 has entered the rt_check function and is
spinning on the lrt0 lock at line :1287 (k649). When CPU1 unlocks ltr0
(k651 above), lrt0->rt_gwroute is still pointing to the freed rtentry.
CPU0 then attempts to lock the now-freed lrt0->rt_gwroute and crashes.
So, the problem is that ltr0->rt_gwroute is being left in an
inconsistent state while ltr0 is unlocked. What's the solution? Zero
out rt_gwroute before unlocking lrt0, then do some extra checks after
re-locking to handle the case where another thread has called rtalloc1
before us, or something else? Or is there some other locking problem
higher up that's allowing rt_check to be called in parellel on the same
rtentry in the first place?
--
Dan Nelson
dnelson@allantgroup.com
--fUYQa+Pmc3FrFX/N
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="ktrdump.150"
index cpu timestamp file and line trace
------ --- ---------------- ---------------------------------------- -----
658 1 55794610845896 net/route.c:1316 0xcd38caa0 lock 0xc674d000 in rt_check
657 1 55794610828094 net/route.c:175 0xcd38caa0 lock 0xc7a49168 in rtalloc1
656 1 55794610826764 net/route.c:911 0xcd38caa0 unlock 0xc7a49168 in rtrequest1
655 1 55794610826184 net/route.c:909 0xcd38caa0 addref 0xc7a49168 1->2 in rtrequest1
654 1 55794610824041 netinet/if_ether.c:224 0xcd38caa0 addref 0xc7a49168 0->1 in arp_rtrequest
653 1 55794610811998 net/route.c:788 0xcd38caa0 lock 0xc7a49168 in rtrequest1
652 0 55794610803227 net/route.c:1303 0xcb107000 lock 0xc674ae88 in rt_check
651 1 55794610799304 net/route.c:1308 0xcd38caa0 unlock 0xc674d000 in rt_check
650 1 55794610797886 net/route.c:302 0xcd38caa0<rtfree 0xc674ae88 destroyed
649 0 55794610795429 net/route.c:1287 0xcb107000 lock 0xc674d000 in rt_check
648 1 55794610794813 net/route.c:300 0xcd38caa0 destroy 0xc674ae88 in rtfree
647 0 55794610794075 net/route.c:1285 0xcb107000>rt_check 0xc674d000
646 0 55794610791936 netinet/if_ether.c:365 0xcb107000>arpresolve 0xc674d000
645 1 55794610786793 net/route.c:245 0xcd38caa0 remref 0xc674ae88 1->0 in rtfree
644 1 55794610786128 net/route.c:238 0xcd38caa0>rtfree 0xc674ae88
643 1 55794610783781 net/route.c:1305 0xcd38caa0 free 0xc674ae88
642 1 55794610782649 net/route.c:1303 0xcd38caa0 lock 0xc674ae88 in rt_check
641 1 55794610780584 net/route.c:1287 0xcd38caa0 lock 0xc674d000 in rt_check
640 1 55794610779823 net/route.c:1285 0xcd38caa0>rt_check 0xc674d000
639 1 55794610777390 netinet/if_ether.c:365 0xcd38caa0>arpresolve 0xc674d000
638 0 55794610752772 net/route.c:119 0xcb107000 unlock 0xc674d000 in rtalloc_ign
637 0 55794610751654 net/route.c:198 0xcb107000 addref 0xc674d000 1->2 in rtalloc1
636 0 55794610750193 net/route.c:197 0xcb107000 lock 0xc674d000 in rtalloc1
635 1 55794610736882 net/route.c:119 0xcd38caa0 unlock 0xc674d000 in rtalloc_ign
634 1 55794610735429 net/route.c:198 0xcd38caa0 addref 0xc674d000 0->1 in rtalloc1
633 1 55794610733358 net/route.c:197 0xcd38caa0 lock 0xc674d000 in rtalloc1
--fUYQa+Pmc3FrFX/N
Content-Type: text/x-diff; charset=us-ascii
Content-Disposition: attachment; filename="ktr.diff"
Index: net/route.c
===================================================================
RCS file: /home/ncvs/src/sys/net/route.c,v
retrieving revision 1.120
diff -u -r1.120 route.c
--- net/route.c 11 Jun 2007 12:19:34 -0000 1.120
+++ net/route.c 16 Oct 2007 03:01:18 -0000
@@ -40,7 +40,7 @@
#include <sys/socket.h>
#include <sys/domain.h>
#include <sys/kernel.h>
-
+#include <sys/ktr.h>
#include <net/if.h>
#include <net/route.h>
@@ -235,7 +235,7 @@
KASSERT(rt != NULL,("%s: NULL rt", __func__));
rnh = rt_tables[rt_key(rt)->sa_family];
KASSERT(rnh != NULL,("%s: NULL rnh", __func__));
-
+ CTR2(KTR_NET, "%p>rtfree %p",curthread, rt);
RT_LOCK_ASSERT(rt);
/*
@@ -299,13 +299,13 @@
*/
RT_LOCK_DESTROY(rt);
uma_zfree(rtzone, rt);
- return;
+ CTR2(KTR_NET, "%p<rtfree %p destroyed",curthread,rt); return;
}
-done:
+ done:
RT_UNLOCK(rt);
+ CTR2(KTR_NET, "%p<rtfree %p",curthread, rt);
}
-
/*
* Force a routing table entry to the specified
* destination to go through the given gateway.
@@ -1282,27 +1282,27 @@
KASSERT(*lrt0 != NULL, ("rt_check"));
rt = rt0 = *lrt0;
-
+ CTR2(KTR_NET, "%p>rt_check %p", curthread, rt);
/* NB: the locking here is tortuous... */
RT_LOCK(rt);
if ((rt->rt_flags & RTF_UP) == 0) {
RT_UNLOCK(rt);
- rt = rtalloc1(dst, 1, 0UL);
+ CTR2(KTR_NET, "%p discard %p in rt_check", curthread, rt); rt = rtalloc1(dst, 1, 0UL);
if (rt != NULL) {
RT_REMREF(rt);
/* XXX what about if change? */
- } else
- return (EHOSTUNREACH);
+ } else {
+ CTR1(KTR_NET, "%p<rt_check EHOSTUNREACH", curthread); return (EHOSTUNREACH);
+ }
rt0 = rt;
- }
- /* XXX BSD/OS checks dst->sa_family != AF_NS */
+ } /* XXX BSD/OS checks dst->sa_family != AF_NS */
if (rt->rt_flags & RTF_GATEWAY) {
if (rt->rt_gwroute == NULL)
goto lookup;
rt = rt->rt_gwroute;
RT_LOCK(rt); /* NB: gwroute */
if ((rt->rt_flags & RTF_UP) == 0) {
- RTFREE_LOCKED(rt); /* unlock gwroute */
+ CTR2(KTR_NET, "%p free %p", curthread, rt); RTFREE_LOCKED(rt); /* unlock gwroute */
rt = rt0;
lookup:
RT_UNLOCK(rt0);
@@ -1311,13 +1311,13 @@
rt0->rt_gwroute = NULL;
RT_REMREF(rt0);
RT_UNLOCK(rt0);
- return (ENETUNREACH);
+ CTR1(KTR_NET, "%p<rt_check ENETUNREACH",curthread); return (ENETUNREACH);
}
RT_LOCK(rt0);
rt0->rt_gwroute = rt;
if (rt == NULL) {
RT_UNLOCK(rt0);
- return (EHOSTUNREACH);
+ CTR1(KTR_NET, "%p<rt_check EHOSTUNREACH",curthread); return (EHOSTUNREACH);
}
}
RT_UNLOCK(rt0);
@@ -1328,12 +1328,12 @@
time_uptime < rt->rt_rmx.rmx_expire);
if (error) {
RT_UNLOCK(rt);
- return (rt == rt0 ? EHOSTDOWN : EHOSTUNREACH);
+ CTR1(KTR_NET, "%p<rt_check EHOSTUNREACH/DOWN",curthread); return (rt == rt0 ? EHOSTDOWN : EHOSTUNREACH);
}
*lrt = rt;
*lrt0 = rt0;
- return (0);
+ CTR3(KTR_NET, "%p<rt_check %p %p", curthread, rt, rt0); return (0);
}
/* This must be before ip6_init2(), which is now SI_ORDER_MIDDLE */
Index: net/route.h
===================================================================
RCS file: /home/ncvs/src/sys/net/route.h,v
retrieving revision 1.65
diff -u -r1.65 route.h
--- net/route.h 15 Mar 2006 19:39:09 -0000 1.65
+++ net/route.h 15 Oct 2007 18:54:32 -0000
@@ -288,21 +288,27 @@
#define RT_LOCK_INIT(_rt) \
mtx_init(&(_rt)->rt_mtx, "rtentry", NULL, MTX_DEF | MTX_DUPOK)
-#define RT_LOCK(_rt) mtx_lock(&(_rt)->rt_mtx)
-#define RT_UNLOCK(_rt) mtx_unlock(&(_rt)->rt_mtx)
-#define RT_LOCK_DESTROY(_rt) mtx_destroy(&(_rt)->rt_mtx)
+#define RT_LOCK(_rt) do { CTR3(KTR_NET,"%p lock %p in %s", curthread,_rt,__func__); mtx_lock(&(_rt)->rt_mtx); } while(0)
+#define RT_UNLOCK(_rt) do { CTR3(KTR_NET,"%p unlock %p in %s", curthread,_rt,__func__); mtx_unlock(&(_rt)->rt_mtx); } while(0)
+#define RT_LOCK_DESTROY(_rt) do { CTR3(KTR_NET,"%p destroy %p in %s", curthread,_rt,__func__); mtx_destroy(&(_rt)->rt_mtx); } while(0)
#define RT_LOCK_ASSERT(_rt) mtx_assert(&(_rt)->rt_mtx, MA_OWNED)
#define RT_ADDREF(_rt) do { \
RT_LOCK_ASSERT(_rt); \
KASSERT((_rt)->rt_refcnt >= 0, \
("negative refcnt %ld", (_rt)->rt_refcnt)); \
+ CTR5(KTR_NET,"%p addref %p %d->%d in %s", curthread, _rt, \
+ (_rt)->rt_refcnt, (_rt)->rt_refcnt+1, \
+ __func__); \
(_rt)->rt_refcnt++; \
} while (0)
#define RT_REMREF(_rt) do { \
RT_LOCK_ASSERT(_rt); \
KASSERT((_rt)->rt_refcnt > 0, \
("bogus refcnt %ld", (_rt)->rt_refcnt)); \
+ CTR5(KTR_NET,"%p remref %p %d->%d in %s", curthread, _rt, \
+ (_rt)->rt_refcnt, (_rt)->rt_refcnt-1, \
+ __func__); \
(_rt)->rt_refcnt--; \
} while (0)
Index: netinet/if_ether.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/if_ether.c,v
retrieving revision 1.162
diff -u -r1.162 if_ether.c
--- netinet/if_ether.c 7 Oct 2007 20:44:22 -0000 1.162
+++ netinet/if_ether.c 15 Oct 2007 21:02:46 -0000
@@ -362,20 +362,20 @@
struct sockaddr_dl *sdl;
int error;
- if (m->m_flags & M_BCAST) { /* broadcast */
+ CTR2(KTR_NET,"%p>arpresolve %p", curthread,rt0); if (m->m_flags & M_BCAST) { /* broadcast */
(void)memcpy(desten, ifp->if_broadcastaddr, ifp->if_addrlen);
- return (0);
+ CTR1(KTR_NET,"%p<arpresolve broadcast",curthread); return (0);
}
if (m->m_flags & M_MCAST && ifp->if_type != IFT_ARCNET) {/* multicast */
ETHER_MAP_IP_MULTICAST(&SIN(dst)->sin_addr, desten);
- return (0);
+ CTR1(KTR_NET,"%p<arpresolve multicast",curthread); return (0);
}
if (rt0 != NULL) {
error = rt_check(&rt, &rt0, dst);
if (error) {
m_freem(m);
- return error;
+ CTR2(KTR_NET,"%p<arpresolve errno %d", curthread, error); return error;
}
la = (struct llinfo_arp *)rt->rt_llinfo;
if (la == NULL)
@@ -392,7 +392,7 @@
"arpresolve: can't allocate route for %s\n",
inet_ntoa(SIN(dst)->sin_addr));
m_freem(m);
- return (EINVAL); /* XXX */
+ CTR1(KTR_NET,"%p<arpresolve can't allocate route",curthread); return (EINVAL); /* XXX */
}
la = (struct llinfo_arp *)rt->rt_llinfo;
if (la == NULL) {
@@ -401,7 +401,7 @@
"arpresolve: can't allocate llinfo for %s\n",
inet_ntoa(SIN(dst)->sin_addr));
m_freem(m);
- return (EINVAL); /* XXX */
+ CTR1(KTR_NET,"%p<arpresolve can't allocate llinfo",curthread); return (EINVAL); /* XXX */
}
}
sdl = SDL(rt->rt_gateway);
@@ -427,11 +427,11 @@
RT_UNLOCK(rt);
arprequest(ifp, &sin, &SIN(dst)->sin_addr,
IF_LLADDR(ifp));
- return (0);
+ CTR1(KTR_NET,"%p<arpresolve sent refresh arp",curthread); return (0);
}
RT_UNLOCK(rt);
- return (0);
+ CTR1(KTR_NET,"%p<arpresolve hit",curthread); return (0);
}
/*
* If ARP is disabled or static on this interface, stop.
@@ -442,7 +442,7 @@
if (ifp->if_flags & (IFF_NOARP | IFF_STATICARP)) {
RT_UNLOCK(rt);
m_freem(m);
- return (EINVAL);
+ CTR1(KTR_NET,"%p<arpresolve NOARP",curthread); return (EINVAL);
}
/*
* There is an arptab entry, but no ethernet address
@@ -480,7 +480,7 @@
} else
RT_UNLOCK(rt);
- return (error);
+ CTR2(KTR_NET,"%p<arpresolve %d",curthread, error); return (error);
}
/*
--fUYQa+Pmc3FrFX/N--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20071016163123.GA22599>
