From owner-freebsd-stable@FreeBSD.ORG Wed Jun 22 13:03:53 2005 Return-Path: X-Original-To: freebsd-stable@FreeBSD.org Delivered-To: freebsd-stable@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 65FE716A41F for ; Wed, 22 Jun 2005 13:03:53 +0000 (GMT) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5F5C843D5D for ; Wed, 22 Jun 2005 13:03:52 +0000 (GMT) (envelope-from andre@freebsd.org) Received: (qmail 53700 invoked from network); 22 Jun 2005 12:54:26 -0000 Received: from unknown (HELO freebsd.org) ([62.48.0.53]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 22 Jun 2005 12:54:26 -0000 Message-ID: <42B961B9.7A5856B3@freebsd.org> Date: Wed, 22 Jun 2005 15:03:53 +0200 From: Andre Oppermann X-Mailer: Mozilla 4.8 [en] (Windows NT 5.0; U) X-Accept-Language: en MIME-Version: 1.0 To: Gleb Smirnoff References: <20050621070427.GA738@obiwan.tataz.chchile.org> <20050621090701.GB34406@cell.sick.ru> <20050621105154.GA36538@cell.sick.ru> Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Cc: qingli@FreeBSD.org, sam@FreeBSD.org, Jeremie Le Hen , freebsd-stable@FreeBSD.org Subject: Re: panic in RELENG_5 UMA X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Jun 2005 13:03:53 -0000 Gleb Smirnoff wrote: > > [ cc'ing parties involved in this part of code] > > On Tue, Jun 21, 2005 at 01:07:01PM +0400, Gleb Smirnoff wrote: > T> On Tue, Jun 21, 2005 at 09:04:27AM +0200, Jeremie Le Hen wrote: > T> J> #25 0xc05a0a0b in m_freem (mb=0x0) at uma.h:304 > T> J> No locals. > T> J> #26 0xc05ee0d5 in arpresolve (ifp=0xc1a5b000, rt0=0xc1d44000, m=0xc1be7200, > T> J> dst=0xd6d3fa94, desten=0xd6d3fa2c "/??]??????w??") > T> J> at ../../../netinet/if_ether.c:442 > T> J> la = (struct llinfo_arp *) 0xc1a75a00 > T> J> sdl = (struct sockaddr_dl *) 0xc2128910 > T> J> error = -1038972656 > T> J> rt = (struct rtentry *) 0xc1d44000 > T> > T> IMHO, this looks like a race. The route is not locked, when > T> its llinfo is edited. > T> > T> Probably the mbuf was freed when arp reply arrived and la_hold was send. > T> Look into in_arpinput() near 736: > T> > T> (*ifp->if_output)(ifp, la->la_hold, rt_key(rt), rt); > T> la->la_hold = 0; > T> > T> Yeah, I have just triggered another panic running 15 instances of this script on > T> SMP box: > T> > T> ( > T> while (true); do > T> arp -d 81.19.64.111 >/dev/null 2>&1; > T> ping -c 1 -t 1 81.19.64.111 >/dev/null 2>&1; > T> done > T> ) & > T> > T> But my duplicate free is in fxp_txeof(). This means that output thread has > T> won the race. > > I suppose that the attached patch closes your race. However, there is still > race between RTM_DELETE and output path. The above script still drops kernel > to panic, but the other one. Output path works with already freed llinfo: > > #28 0xc0507000 in m_freem (mb=0x0) at mbuf.h:410 > #29 0xc053fde3 in arpresolve (ifp=0xc2012800, rt0=0xc22fcdec, m=0xc25a8000, dst=0xe720bb28, > desten=0xe720bacc "uøbÀ+\001") at /usr/src/sys/netinet/if_ether.c:443 > #30 0xc0538078 in ether_output (ifp=0xc2012800, m=0xc25a8000, dst=0xe720bb28, rt0=0xc22fcdec) > at /usr/src/sys/net/if_ethersubr.c:173 > #31 0xc054b5b4 in ip_output (m=0xc25a8000, opt=0xc25a80ac, ro=0xe720bb24, flags=0x20, imo=0x0, inp=0xc25eb5a0) > at /usr/src/sys/netinet/ip_output.c:772 > #32 0xc054d36b in rip_output (m=0xc25a8000, so=0x0, dst=0x0) at /usr/src/sys/netinet/raw_ip.c:320 > #33 0xc054de7b in rip_send (so=0xc248c914, flags=0x0, m=0xc25a8000, nam=0xc218d410, control=0x0, td=0xc224d7d0) > at /usr/src/sys/netinet/raw_ip.c:785 > #34 0xc050a30f in sosend (so=0xc248c914, addr=0xc218d410, uio=0xe720bc3c, top=0xc25a8000, control=0x0, flags=0x0, > td=0xc224d7d0) at /usr/src/sys/kern/uipc_socket.c:827 > > (kgdb) frame 29 > #29 0xc053fde3 in arpresolve (ifp=0xc2012800, rt0=0xc22fcdec, m=0xc25a8000, dst=0xe720bb28, > desten=0xe720bacc "uøbÀ+\001") at /usr/src/sys/netinet/if_ether.c:443 > 443 m_freem(la->la_hold); > (kgdb) p *la > $3 = { > la_le = { > le_next = 0xdeadc0de, > le_prev = 0xdeadc0de > }, > la_rt = 0xdeadc0de, > la_hold = 0xdeadc0de, > la_preempt = 0xc0de, > la_asked = 0xdead > } > > Fixing this one is harder. We take la from unlocked rtentry obtained via > rt_check(), or from arplookup(). The latter drops lock on rtentry, too. > Then we do some work and use this la. It may have already been freed in > arp_rtrequest(), the RTM_DELETE case. > > I see two approaches here: > > 1) Protecting llinfo with route lock. In this case we need rt_check() > to return locked *rt (just reference won't help). We also need > arplookup() to return locked rt. And do not unlock it withing all > arpresolve() and a big part of in_arpinput() functions. I think for 5-stable this is the way to go. > 2) Add mutex to llinfo_arp. I'm afraid this will hurt performance. The new ARP stuff should fix these issues, however it is not ready yet. At the moment it looks like it wont make it right away into 6.0 but go into 7-current and then MFC'd back for 6.1R. -- Andre