Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 19 Dec 2007 13:08:31 +0100
From:      Maxime Henrion <mux@FreeBSD.org>
To:        net@FreeBSD.org
Cc:        Gleb Smirnoff <glebius@FreeBSD.org>, Julian Elischer <julian@elischer.org>
Subject:   Re: Deadlock in the routing code
Message-ID:  <20071219120831.GN71713@elvis.mu.org>
In-Reply-To: <20071217101009.GL71713@elvis.mu.org>
References:  <20071213133817.GC71713@elvis.mu.org> <47617AF5.7070701@elischer.org> <20071214092539.GB14339@glebius.int.ru> <4762DD82.9070904@elischer.org> <20071217101009.GL71713@elvis.mu.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--jCrbxBqMcLqd4mOl
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Maxime Henrion wrote:
> Julian Elischer wrote:
> > Gleb Smirnoff wrote:
> > >On Thu, Dec 13, 2007 at 10:33:25AM -0800, Julian Elischer wrote:
> > >J>  Maxime Henrion wrote:
> > >J> > Replying to myself on this one, sorry about that.
> > >J> > I said in my previous mail that I didn't know yet what process was
> > >J> > holding the lock of the rtentry that the routed process is dealing
> > >J> > with in rt_setgate(), and I just could verify that it is held by
> > >J> > the swi1: net thread.
> > >J> > So, in a nutshell:
> > >J> > - The routed process does its business on the routing socket, that 
> > >ends up
> > >J> >   calling rt_setgate().  While in rt_setgate() it drops the lock on 
> > >its
> > >J> >   rtentry in order to call rtalloc1().  At this point, the routed
> > >J> >   process hold the gateway route (rtalloc1() returns it locked), and 
> > >it
> > >J> >   now tries to re-lock the original rtentry.
> > >J> > - At the same time, the swi net thread calls arpresolve() which ends 
> > >up
> > >J> >   calling rt_check().  Then rt_check() locks the rtentry, and tries to
> > >J> >   lock the gateway route.
> > >J> > A classical case of deadlock with mutexes because of different locking
> > >J> > order.  Now, it's not obvious to me how to fix it :-).
> > >J> 
> > >J>  On failure to re-lock, the routed call to rt_setgate should completely 
> > >abort J>  and restart from scratch, releasing all locks it has on the way 
> > >out.
> > >
> > >Do you suggest mtx_trylock?
> > 
> > I think that would be the cleanest way..
> 
> So, here's what I've got.  I have yet to test it at all, I hope that
> I'll be able to do so today, or tomorrow.  Any input appreciated.

It appears that this patch fixed the problem.  My gateway server
now has a nearly two days uptime, whereas previously it would have
probably crashed already.  I'm attaching the final version of the
patch here, since the last one had build-time errors.  I'm going
to commit this in HEAD soon unless someone has an objection for it.

Cheers,
Maxime

--jCrbxBqMcLqd4mOl
Content-Type: text/x-diff; charset=us-ascii
Content-Disposition: attachment; filename="rt_setgate.patch"

--- route.h.orig	Tue Apr  4 22:07:23 2006
+++ route.h	Mon Dec 17 13:11:44 2007
@@ -289,6 +289,7 @@
 #define	RT_LOCK_INIT(_rt) \
 	mtx_init(&(_rt)->rt_mtx, "rtentry", NULL, MTX_DEF | MTX_DUPOK)
 #define	RT_LOCK(_rt)		mtx_lock(&(_rt)->rt_mtx)
+#define	RT_TRYLOCK(_rt)		mtx_trylock(&(_rt)->rt_mtx)
 #define	RT_UNLOCK(_rt)		mtx_unlock(&(_rt)->rt_mtx)
 #define	RT_LOCK_DESTROY(_rt)	mtx_destroy(&(_rt)->rt_mtx)
 #define	RT_LOCK_ASSERT(_rt)	mtx_assert(&(_rt)->rt_mtx, MA_OWNED)
--- route.c.orig	Tue Oct 30 19:07:54 2007
+++ route.c	Mon Dec 17 15:13:20 2007
@@ -996,6 +996,7 @@
 	struct radix_node_head *rnh = rt_tables[dst->sa_family];
 	int dlen = SA_SIZE(dst), glen = SA_SIZE(gate);
 
+again:
 	RT_LOCK_ASSERT(rt);
 
 	/*
@@ -1029,7 +1030,15 @@
 			RT_REMREF(rt);
 			return (EADDRINUSE); /* failure */
 		}
-		RT_LOCK(rt);
+		/*
+		 * Try to reacquire the lock on rt, and if it fails,
+		 * clean state and restart from scratch.
+		 */
+		if (!RT_TRYLOCK(rt)) {
+			RTFREE_LOCKED(gwrt);
+			RT_LOCK(rt);
+			goto again;
+		}
 		/*
 		 * If there is already a gwroute, then drop it. If we
 		 * are asked to replace route with itself, then do

--jCrbxBqMcLqd4mOl--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20071219120831.GN71713>