Date: Fri, 11 Dec 2015 12:16:17 +0100 From: Hans Petter Selasky <hps@selasky.org> To: "Alexander V. Chernikov" <melifaro@ipfw.ru>, Adrian Chadd <adrian@freebsd.org>, "freebsd-net@freebsd.org" <freebsd-net@freebsd.org> Subject: Re: Race between arptimer() and lle removal [WAS: panic in arptimer in r289937] Message-ID: <566AB081.8050100@selasky.org> In-Reply-To: <2850091449828775@web21o.yandex.ru> References: null <CAJ-VmonvVyTNuYv_as41yPCFdfR5T3FE45DP9MKAc-eyzXzPUg@mail.gmail.com> <2739461446298483@web2h.yandex.ru> <566A94A1.60400@selasky.org> <2850091449828775@web21o.yandex.ru>
next in thread | previous in thread | raw e-mail | index | archive | help
On 12/11/15 11:12, Alexander V. Chernikov wrote: > 11.12.2015, 12:15, "Hans Petter Selasky" <hps@selasky.org>: >> Hi, >> >> Pulling the nail out of the haystack hopefully. >> >>>> Any ideas on where next to look? >> >> Adrian: In your dump aswell I see: >> >> la_flags = 1 >> >> That means there was a race calling arptimer() and removing the "lle". > Yes. The interesting part here is why lle is removed. There are quite a few reasons: either interface address deleted or interface going down, or explicit delete request. > That's why I asked Adrian about interface stuff (and haven't got a reply). >> >> Alexander: Can you comment on the following patch: >> >> > Index: netinet/if_ether.c >> > =================================================================== >> > --- netinet/if_ether.c (revision 291256) >> > +++ netinet/if_ether.c (working copy) >> > @@ -185,7 +185,13 @@ >> > LLE_WUNLOCK(lle); >> > return; >> > } >> > - ifp = lle->lle_tbl->llt_ifp; >> > + if (lle->la_flags & LLE_LINKED) { >> > + ifp = lle->lle_tbl->llt_ifp; >> > + } else { >> > + /* XXX RACE entry has been freed */ >> > + llentry_free(lle); >> > + return; >> > + } >> > CURVNET_SET(ifp->if_vnet); >> > >> > if ((lle->la_flags & LLE_DELETED) == 0) { >> >> We need a check in arptimer() that the lle is still linked before > Yes, I had exactly that approach in mind. (And nd6_llinfo_timer() needs the same fix). > So, would you commit it or should I? >> proceeding, in there from what I can see. Because the callback is not >> protected by a mutex, it is not atomically stopped by callout_stop(). Hi, Talking to Randall offlist, I see some more trouble. Let's get everything straight before making a fix. There is one more race I see: The start of the arptimer() callback looks like this: > static void > arptimer(void *arg) > { POINT0 > LLE_WLOCK(lle); > if (callout_pending(&lle->lle_timer)) { POINT1 > LLE_WUNLOCK(lle); > return; > } The code starting the callback looks like this: > LLE_ADDREF(la); > la->la_expire = time_uptime; > canceled = callout_reset(&la->lle_timer, hz * V_arpt_down, > arptimer, la); > if (canceled) > LLE_REMREF(la); Which can be written like this: > la->la_expire = time_uptime; > canceled = callout_reset(&la->lle_timer, hz * V_arpt_down, > arptimer, la); > if (canceled == 0) > LLE_ADDREF(la); In case we are at POINT0 in arptimer, callout_reset() will not be able to cancel the callout and will return 0. We should also drop one ref at POINT1, or rewrite the code a bit, which might need Randall's help in the callout subsystem area. --HPS
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?566AB081.8050100>