Date: Wed, 17 Nov 2004 23:22:56 +0000 From: Ian Dowse <iedowse@maths.tcd.ie> To: arch@freebsd.org Subject: Simpler SMP-friendly callout/timeout semantics Message-ID: <200411172322.aa34520@salmon.maths.tcd.ie>
next in thread | raw e-mail | index | archive | help
The current callout/timeout API supports Giant-locked and MP-safe usage options, but in both cases it is very difficult to use the API in a way that is free of race conditions and complications. Below is an attempt to extend that API to offer much simpler semantics to many users while not limiting its usefulness or performance. A summary of the difficulties faced when using the current callout API can be found at: http://people.freebsd.org/~iedowse/callout/callout_current.txt This subject has been discussed a few times before (e.g, search for "Timeout and SMP race"). One of the suggestions, made by Archie Cobbs, was to supply a mutex when the callout is created and have the callout code use that to avoid many of the races. This is what I've attempted to do. The patch below adds a new API function: callout_init_mtx(struct callout *c, struct mtx *mtx, int flags); If a non-NULL `mtx' is supplied, then the callout system will acquire that mutex before invoking the callout handler. Normally the handler must return with the mutex still locked, but if the CALLOUT_RETURNUNLOCKED flag is supplied, then the handler should unlock the mutex itself before returning. Since the callout system is acquiring the mutex itself, it can check if the callout has been cancelled while holding the mutex and before invoking the handler or dereferencing the callout function pointer or argument. When combined with a requirement of holding the mutex during calls to callout_stop() and callout_reset(), this entirely eliminates the races, so callout_stop() and callout_reset() reliably do what you'd expect. The semantics of callouts initialised with the new function are: o When calling callout_stop() or callout_reset() on the callout, the specified mutex must be held. o Calls to callout_stop() and callout_reset() guarantee that the handler will not be invoked even if it has already been de-queued by softclock(). There are no races since the mutex is held by the caller of callout_stop() or callout_reset(). When the CALLOUT_RETURNUNLOCKED flag is used, it is however possible that the handler function is still running, but the because the caller holds the mutex, it is known that the handler, if active, has at least got past the mtx_unlock(). These functions do not sleep. o Before destroying the specified mutex, callout_drain() must be called. Since callout_drain() may sleep, the mutex may not be held at this time. If the callout has not been cancelled already, then it might be invoked while callout_drain() executes. To avoid this, call callout_stop() before releasing the mutex and then call callout_drain() afterwards. The call to callout_drain() is necessary to ensure that softclock() no longer references the mutex (e.g. it might still be waiting to acquire it). The patch also changes the semantics of all existing Giant-locked callouts to match the above with a mutex of Giant implicitly specified. This means that Giant-locked users MUST hold Giant while calling callout_stop(), callout_reset(), timeout() and untimeout(), which will require a number of changes in various parts of the kernel. It also means that fully Giant-locked code gets the race-free semantics with no source changes. The new API function handles some cases very well: - Code that uses just one mutex per instance, such as many device drivers. - Giant-locked code because again there is just one mutex involved. However there are more complex uses of the callout API that currently use more than one mutex due to lock ordering requirements. A good example is the TCP code, where the timeout routines can acquire a global TCP mutex before the per-connection mutex, so this new scheme on its own cannot be used directly. Maybe it is possible to rework the TCP code to avoid this, or even extend the callout API further so that an alternative locking routine can be supplied if that helps. Do people think that a change like this is worth attempting? Are there further improvements that could be made or any other suggestions or comments? Ian Additional stuff: Patch to make the "re" driver use the new API (currently it uses timeout/untimeout without locking Giant): http://people.freebsd.org/~iedowse/callout/callout_re.diff Patch that attempts to close a number of timeout races in the ATA code using the new API: http://people.freebsd.org/~iedowse/callout/callout_ata.diff Finally the callout patch itself: Index: sys/kern/kern_timeout.c =================================================================== RCS file: /dump/FreeBSD-CVS/src/sys/kern/kern_timeout.c,v retrieving revision 1.91 diff -u -r1.91 kern_timeout.c --- sys/kern/kern_timeout.c 6 Aug 2004 21:49:00 -0000 1.91 +++ sys/kern/kern_timeout.c 17 Nov 2004 02:31:28 -0000 @@ -53,6 +53,9 @@ static int avg_gcalls; SYSCTL_INT(_debug, OID_AUTO, to_avg_gcalls, CTLFLAG_RD, &avg_gcalls, 0, "Average number of Giant callouts made per softclock call. Units = 1/1000"); +static int avg_mtxcalls; +SYSCTL_INT(_debug, OID_AUTO, to_avg_mtxcalls, CTLFLAG_RD, &avg_mtxcalls, 0, + "Average number of mtx callouts made per softclock call. Units = 1/1000"); static int avg_mpcalls; SYSCTL_INT(_debug, OID_AUTO, to_avg_mpcalls, CTLFLAG_RD, &avg_mpcalls, 0, "Average number of MP callouts made per softclock call. Units = 1/1000"); @@ -80,6 +83,12 @@ * If curr_callout is non-NULL, threads waiting on * callout_wait will be woken up as soon as the * relevant callout completes. + * curr_cancelled - Changing to 1 with both callout_lock and c_mtx held + * guarantees that the current callout will not run. + * The softclock() function sets this to 0 before it + * drops callout_lock to acquire c_mtx, and it calls + * the handler only if curr_cancelled still 0 when + * c_mtx is successfully acquired. * wakeup_ctr - Incremented every time a thread wants to wait * for a callout to complete. Modified only when * curr_callout is non-NULL. @@ -88,6 +97,7 @@ * cutt_callout is non-NULL. */ static struct callout *curr_callout; +static int curr_cancelled; static int wakeup_ctr; static int wakeup_needed; @@ -181,6 +191,7 @@ int steps; /* #steps since we last allowed interrupts */ int depth; int mpcalls; + int mtxcalls; int gcalls; int wakeup_cookie; #ifdef DIAGNOSTIC @@ -195,6 +206,7 @@ #endif /* MAX_SOFTCLOCK_STEPS */ mpcalls = 0; + mtxcalls = 0; gcalls = 0; depth = 0; steps = 0; @@ -225,12 +237,14 @@ } else { void (*c_func)(void *); void *c_arg; + struct mtx *c_mtx; int c_flags; nextsoftcheck = TAILQ_NEXT(c, c_links.tqe); TAILQ_REMOVE(bucket, c, c_links.tqe); c_func = c->c_func; c_arg = c->c_arg; + c_mtx = c->c_mtx; c_flags = c->c_flags; c->c_func = NULL; if (c->c_flags & CALLOUT_LOCAL_ALLOC) { @@ -242,11 +256,32 @@ (c->c_flags & ~CALLOUT_PENDING); } curr_callout = c; + curr_cancelled = 0; mtx_unlock_spin(&callout_lock); - if (!(c_flags & CALLOUT_MPSAFE)) { - mtx_lock(&Giant); - gcalls++; - CTR1(KTR_CALLOUT, "callout %p", c_func); + if (c_mtx != NULL) { + mtx_lock(c_mtx); + /* + * The callout may have been cancelled + * while we switched locks. + */ + if (curr_cancelled) { + mtx_unlock(c_mtx); + mtx_lock_spin(&callout_lock); + goto done_locked; + } + /* The callout cannot be stopped now. */ + curr_cancelled = 1; + + if (c_mtx == &Giant) { + gcalls++; + CTR1(KTR_CALLOUT, "callout %p", + c_func); + } else { + mtxcalls++; + CTR1(KTR_CALLOUT, + "callout mtx %p", + c_func); + } } else { mpcalls++; CTR1(KTR_CALLOUT, "callout mpsafe %p", @@ -275,9 +310,10 @@ lastfunc = c_func; } #endif - if (!(c_flags & CALLOUT_MPSAFE)) - mtx_unlock(&Giant); + if ((c_flags & CALLOUT_RETURNUNLOCKED) == 0) + mtx_unlock(c_mtx); mtx_lock_spin(&callout_lock); +done_locked: curr_callout = NULL; if (wakeup_needed) { /* @@ -300,6 +336,7 @@ } avg_depth += (depth * 1000 - avg_depth) >> 8; avg_mpcalls += (mpcalls * 1000 - avg_mpcalls) >> 8; + avg_mtxcalls += (mtxcalls * 1000 - avg_mtxcalls) >> 8; avg_gcalls += (gcalls * 1000 - avg_gcalls) >> 8; nextsoftcheck = NULL; mtx_unlock_spin(&callout_lock); @@ -395,15 +432,26 @@ void *arg; { + if (c->c_mtx != NULL) + mtx_assert(c->c_mtx, MA_OWNED); + mtx_lock_spin(&callout_lock); - if (c == curr_callout && wakeup_needed) { + if (c == curr_callout) { /* * We're being asked to reschedule a callout which is - * currently in progress, and someone has called - * callout_drain to kill that callout. Don't reschedule. + * currently in progress. If there is a mutex then we + * can cancel the callout if it has not really started. */ - mtx_unlock_spin(&callout_lock); - return; + if (c->c_mtx != NULL && !curr_cancelled) + curr_cancelled = 1; + if (wakeup_needed) { + /* + * Someone has called callout_drain to kill this + * callout. Don't reschedule. + */ + mtx_unlock_spin(&callout_lock); + return; + } } if (c->c_flags & CALLOUT_PENDING) { if (nextsoftcheck == c) { @@ -446,13 +494,20 @@ { int wakeup_cookie; + if (!safe && c->c_mtx != NULL) + mtx_assert(c->c_mtx, MA_OWNED); + mtx_lock_spin(&callout_lock); /* * Don't attempt to delete a callout that's not on the queue. */ if (!(c->c_flags & CALLOUT_PENDING)) { c->c_flags &= ~CALLOUT_ACTIVE; - if (c == curr_callout && safe) { + if (c != curr_callout) { + mtx_unlock_spin(&callout_lock); + return (0); + } + if (safe) { /* We need to wait until the callout is finished. */ wakeup_needed = 1; wakeup_cookie = wakeup_ctr++; @@ -468,6 +523,11 @@ cv_wait(&callout_wait, &callout_wait_lock); mtx_unlock(&callout_wait_lock); + } else if (c->c_mtx != NULL && !curr_cancelled) { + /* We can stop the callout before it runs. */ + curr_cancelled = 1; + mtx_unlock_spin(&callout_lock); + return (1); } else mtx_unlock_spin(&callout_lock); return (0); @@ -493,8 +553,29 @@ int mpsafe; { bzero(c, sizeof *c); - if (mpsafe) - c->c_flags |= CALLOUT_MPSAFE; + if (mpsafe) { + c->c_mtx = NULL; + c->c_flags = CALLOUT_RETURNUNLOCKED; + } else { + c->c_mtx = &Giant; + c->c_flags = 0; + } +} + +void +callout_init_mtx(c, mtx, flags) + struct callout *c; + struct mtx *mtx; + int flags; +{ + bzero(c, sizeof *c); + c->c_mtx = mtx; + KASSERT((flags & ~CALLOUT_RETURNUNLOCKED) == 0, + ("callout_init_mtx: bad flags %d", flags)); + /* CALLOUT_RETURNUNLOCKED makes no sense without a mutex. */ + KASSERT(mtx != NULL || (flags & CALLOUT_RETURNUNLOCKED) == 0, + ("callout_init_mtx: CALLOUT_RETURNUNLOCKED with no mutex")); + c->c_flags = flags & CALLOUT_RETURNUNLOCKED; } #ifdef APM_FIXUP_CALLTODO Index: sys/sys/callout.h =================================================================== RCS file: /dump/FreeBSD-CVS/src/sys/sys/callout.h,v retrieving revision 1.27 diff -u -r1.27 callout.h --- sys/sys/callout.h 20 Apr 2004 15:49:31 -0000 1.27 +++ sys/sys/callout.h 17 Nov 2004 01:12:01 -0000 @@ -40,6 +40,8 @@ #include <sys/queue.h> +struct mtx; + SLIST_HEAD(callout_list, callout); TAILQ_HEAD(callout_tailq, callout); @@ -51,6 +53,7 @@ int c_time; /* ticks to the event */ void *c_arg; /* function argument */ void (*c_func)(void *); /* function to call */ + struct mtx *c_mtx; /* mutex to lock */ int c_flags; /* state of this entry */ }; @@ -58,6 +61,7 @@ #define CALLOUT_ACTIVE 0x0002 /* callout is currently active */ #define CALLOUT_PENDING 0x0004 /* callout is waiting for timeout */ #define CALLOUT_MPSAFE 0x0008 /* callout handler is mp safe */ +#define CALLOUT_RETURNUNLOCKED 0x0010 /* handler returns with mtx unlocked */ struct callout_handle { struct callout *callout; @@ -75,6 +79,7 @@ #define callout_deactivate(c) ((c)->c_flags &= ~CALLOUT_ACTIVE) #define callout_drain(c) _callout_stop_safe(c, 1) void callout_init(struct callout *, int); +void callout_init_mtx(struct callout *, struct mtx *, int); #define callout_pending(c) ((c)->c_flags & CALLOUT_PENDING) void callout_reset(struct callout *, int, void (*)(void *), void *); #define callout_stop(c) _callout_stop_safe(c, 0)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200411172322.aa34520>