From owner-freebsd-current@FreeBSD.ORG  Mon Dec 29 20:02:45 2014
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 33DC0B06
 for <freebsd-current@freebsd.org>; Mon, 29 Dec 2014 20:02:45 +0000 (UTC)
Received: from mail.turbocat.net (heidi.turbocat.net [88.198.202.214])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 9F67835BF
 for <freebsd-current@freebsd.org>; Mon, 29 Dec 2014 20:02:44 +0000 (UTC)
Received: from laptop015.home.selasky.org (31.89-11-148.nextgentel.com
 [89.11.148.31])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by mail.turbocat.net (Postfix) with ESMTPSA id E64701FE022
 for <freebsd-current@freebsd.org>; Mon, 29 Dec 2014 21:02:41 +0100 (CET)
Message-ID: <54A1B38C.1000709@selasky.org>
Date: Mon, 29 Dec 2014 21:03:24 +0100
From: Hans Petter Selasky <hps@selasky.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:31.0) Gecko/20100101 Thunderbird/31.3.0
MIME-Version: 1.0
To: FreeBSD Current <freebsd-current@freebsd.org>
Subject: [RFC] kern/kern_timeout.c rewrite in progress
Content-Type: multipart/mixed; boundary="------------080102030604030205080208"
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current/>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 29 Dec 2014 20:02:45 -0000

This is a multi-part message in MIME format.
--------------080102030604030205080208
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

Hi,

I recently came across a class of errors which lead me into 
investigating the "kern/kern_timeout.c" and its subsystem. From what I 
can see new features like the SMP awareness has been "added" instead of 
fully "integrated". When going into the cornercases I've uncovered that 
the internal callout statemachine can sometimes report wrong values via 
its callout_active() and callout_pending() bits to its clients, which in 
turn can make the clients behave badly. I further did an investigation 
on how the safety of callout migration between CPU's is maintained. When 
I looked into the code and found stuff like "volatile" and "while()" 
loops to figure which CPU a callout belongs I understood that such logic 
completely undermines the cleverness found in the turnstiles of mutexes 
and decided to go through all of the logic inside "kern_timeout.c". Also 
static code analysis is harder when we don't use the basic mutexes and 
condition variables available in the kernel.

First of all we need to make some driving rules for everyone:

1) A new feature called direct callbacks which execute the timer 
callbacks from the fast interrupt handler was added. All these callbacks 
_must_ be associated with a regular spinlocks, to maintain a safe 
callout_drain(). Else they should only be executed on CPU0.

2) All Giant locked callbacks should only execute on CPU0 to avoid 
congestion.

3) Callbacks using read-only locks for its callback should also only 
execute on CPU0 to avoid multiple instances pending for completion on 
multiple CPU's, because read-only locks can be entered multiple times. 
 From what I can see, there are currently no consumers of this feature 
in the kernel.

Secondly, we need a way to drain callbacks asynchronously, for example 
like in the TCP stack. Currently the TCP shutdown sequence for a 
connection looks like this:

...
void
pcb_teardown(pcb)
{
    lock(pcb);
    callout_stop(&pcb->c1);
    callout_stop(&pcb->c2);
    callout_stop(&pcb->c3);
    unlock(pcb);

    free(pcb);
}

I guess some of you see what is wrong: Use after free.

...

With a new function I've added, scenarios like this can be solved more 
elegantly and without having to fear use after free situations!

static void
pcb_callback(void *pcb)
{
   lock(pcb);
   do_free = (--(pcb->refcount) == 0);
   unlock(pcb);
   if (do_free == 0)
	return;
   destroy_mtx(pcb);
   free(pcb);
}

void
pcb_teardown(pcb)
{
    lock(pcb);
    pcb->refcount = 4;
    if (callout_drain_async(&pcb->c1, pcb_callback, pcb) == 0)
	pcb->refcount--;
    if (callout_drain_async(&pcb->c2, pcb_callback, pcb) == 0)
	pcb->refcount--;
    if (callout_drain_async(&pcb->c3, pcb_callback, pcb) == 0)
	pcb->refcount--;
    unlock(pcb);

    pcb_callback(pcb);
}

Please find attached a patch for 11-current as of today. Comments and 
feedback is appreciated.

BTW: Happy New Year everyone!

--HPS

--------------080102030604030205080208
Content-Type: text/x-patch;
 name="kern_timeout_wip.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="kern_timeout_wip.diff"

Index: sys/kern/kern_timeout.c
===================================================================
--- sys/kern/kern_timeout.c	(revision 276240)
+++ sys/kern/kern_timeout.c	(working copy)
@@ -125,36 +125,56 @@
 u_int callwheelsize, callwheelmask;
 
 /*
- * The callout cpu exec entities represent informations necessary for
- * describing the state of callouts currently running on the CPU and the ones
- * necessary for migrating callouts to the new callout cpu. In particular,
- * the first entry of the array cc_exec_entity holds informations for callout
- * running in SWI thread context, while the second one holds informations
- * for callout running directly from hardware interrupt context.
- * The cached informations are very important for deferring migration when
- * the migrating callout is already running.
+ * The callout CPU exec structure represent information necessary for
+ * describing the state of callouts currently running on the CPU and
+ * for handling deferred callout restarts.
+ *
+ * In particular, the first entry of the array cc_exec_entity holds
+ * information for callouts running from the SWI thread context, while
+ * the second one holds information for callouts running directly from
+ * the hardware interrupt context.
  */
 struct cc_exec {
-	struct callout		*cc_next;
+	/*
+	 * The "cc_curr" points to the currently executing callout and
+	 * is write protected by the "cc_lock" spinlock. If no
+	 * callback is currently executing it is equal to "NULL".
+	 */
 	struct callout		*cc_curr;
-#ifdef SMP
-	void			(*ce_migration_func)(void *);
-	void			*ce_migration_arg;
-	int			ce_migration_cpu;
-	sbintime_t		ce_migration_time;
-	sbintime_t		ce_migration_prec;
-#endif
-	bool			cc_cancel;
-	bool			cc_waiting;
+	/*
+	 * The "cc_restart_args" structure holds the argument for a
+	 * deferred callback restart and is write protected by the
+	 * "cc_lock" spinlock. The structure is only valid if
+	 * "cc_restart" is "true". If "cc_restart" is "false" the
+	 * information in the "cc_restart_args" structure shall be
+	 * ignored.
+	 */
+	struct callout_args	cc_restart_args;
+	bool			cc_restart;
+	/*
+	 * The "cc_cancel" variable allows the currently pending
+	 * callback to be atomically cancelled. This field is write
+	 * protected by the "cc_lock" spinlock.
+	 */
+	bool cc_cancel;
+	/*
+	 * The "cc_drain_fn" points to a function which shall be
+	 * called with the argument stored in "cc_drain_arg" when an
+	 * asynchronous drain is performed. This field is write
+	 * protected by the "cc_lock" spinlock.
+	 */
+	callout_func_t *cc_drain_fn;
+	void *cc_drain_arg;
 };
 
 /*
- * There is one struct callout_cpu per cpu, holding all relevant
+ * There is one "struct callout_cpu" per CPU, holding all relevant
  * state for the callout processing thread on the individual CPU.
  */
 struct callout_cpu {
 	struct mtx_padalign	cc_lock;
 	struct cc_exec 		cc_exec_entity[2];
+	struct callout		*cc_exec_next_dir;
 	struct callout		*cc_callout;
 	struct callout_list	*cc_callwheel;
 	struct callout_tailq	cc_expireq;
@@ -166,27 +186,23 @@
 	char			cc_ktr_event_name[20];
 };
 
-#define	cc_exec_curr		cc_exec_entity[0].cc_curr
-#define	cc_exec_next		cc_exec_entity[0].cc_next
-#define	cc_exec_cancel		cc_exec_entity[0].cc_cancel
-#define	cc_exec_waiting		cc_exec_entity[0].cc_waiting
-#define	cc_exec_curr_dir	cc_exec_entity[1].cc_curr
-#define	cc_exec_next_dir	cc_exec_entity[1].cc_next
-#define	cc_exec_cancel_dir	cc_exec_entity[1].cc_cancel
-#define	cc_exec_waiting_dir	cc_exec_entity[1].cc_waiting
+/*
+ * When "CALLOUT_RDONLY_C_CPU(c)" returns "true" it indicates the
+ * given callback doesn't have any locks associated with it or are
+ * using the Giant lock or a shared kind of lock. For these kind of
+ * callbacks the "c_cpu" field must remain constant, hence there is no
+ * lock to protect updates to this field.
+ *
+ * When "CALLOUT_RDONLY_C_CPU(c)" returns "false" it indicates that
+ * "c->c_lock" is protecting any changes to "c->c_cpu", for the
+ * specified callback.
+ */
+#define	CALLOUT_RDONLY_C_CPU(c) \
+	((c)->c_lock == NULL || (c)->c_lock == &Giant.lock_object || \
+	((c)->c_flags & (CALLOUT_LOCAL_ALLOC | CALLOUT_RETURNUNLOCKED | \
+	CALLOUT_SHAREDLOCK)) != 0)
 
 #ifdef SMP
-#define	cc_migration_func	cc_exec_entity[0].ce_migration_func
-#define	cc_migration_arg	cc_exec_entity[0].ce_migration_arg
-#define	cc_migration_cpu	cc_exec_entity[0].ce_migration_cpu
-#define	cc_migration_time	cc_exec_entity[0].ce_migration_time
-#define	cc_migration_prec	cc_exec_entity[0].ce_migration_prec
-#define	cc_migration_func_dir	cc_exec_entity[1].ce_migration_func
-#define	cc_migration_arg_dir	cc_exec_entity[1].ce_migration_arg
-#define	cc_migration_cpu_dir	cc_exec_entity[1].ce_migration_cpu
-#define	cc_migration_time_dir	cc_exec_entity[1].ce_migration_time
-#define	cc_migration_prec_dir	cc_exec_entity[1].ce_migration_prec
-
 struct callout_cpu cc_cpu[MAXCPU];
 #define	CPUBLOCK	MAXCPU
 #define	CC_CPU(cpu)	(&cc_cpu[(cpu)])
@@ -211,62 +227,11 @@
 
 static MALLOC_DEFINE(M_CALLOUT, "callout", "Callout datastructures");
 
-/**
- * Locked by cc_lock:
- *   cc_curr         - If a callout is in progress, it is cc_curr.
- *                     If cc_curr is non-NULL, threads waiting in
- *                     callout_drain() will be woken up as soon as the
- *                     relevant callout completes.
- *   cc_cancel       - Changing to 1 with both callout_lock and cc_lock held
- *                     guarantees that the current callout will not run.
- *                     The softclock() function sets this to 0 before it
- *                     drops callout_lock to acquire c_lock, and it calls
- *                     the handler only if curr_cancelled is still 0 after
- *                     cc_lock is successfully acquired.
- *   cc_waiting      - If a thread is waiting in callout_drain(), then
- *                     callout_wait is nonzero.  Set only when
- *                     cc_curr is non-NULL.
- */
-
 /*
- * Resets the execution entity tied to a specific callout cpu.
+ * Kernel low level callwheel initialization called from cpu0 during
+ * kernel startup:
  */
 static void
-cc_cce_cleanup(struct callout_cpu *cc, int direct)
-{
-
-	cc->cc_exec_entity[direct].cc_curr = NULL;
-	cc->cc_exec_entity[direct].cc_next = NULL;
-	cc->cc_exec_entity[direct].cc_cancel = false;
-	cc->cc_exec_entity[direct].cc_waiting = false;
-#ifdef SMP
-	cc->cc_exec_entity[direct].ce_migration_cpu = CPUBLOCK;
-	cc->cc_exec_entity[direct].ce_migration_time = 0;
-	cc->cc_exec_entity[direct].ce_migration_prec = 0;
-	cc->cc_exec_entity[direct].ce_migration_func = NULL;
-	cc->cc_exec_entity[direct].ce_migration_arg = NULL;
-#endif
-}
-
-/*
- * Checks if migration is requested by a specific callout cpu.
- */
-static int
-cc_cce_migrating(struct callout_cpu *cc, int direct)
-{
-
-#ifdef SMP
-	return (cc->cc_exec_entity[direct].ce_migration_cpu != CPUBLOCK);
-#else
-	return (0);
-#endif
-}
-
-/*
- * Kernel low level callwheel initialization
- * called on cpu0 during kernel startup.
- */
-static void
 callout_callwheel_init(void *dummy)
 {
 	struct callout_cpu *cc;
@@ -324,8 +289,6 @@
 		LIST_INIT(&cc->cc_callwheel[i]);
 	TAILQ_INIT(&cc->cc_expireq);
 	cc->cc_firstevent = SBT_MAX;
-	for (i = 0; i < 2; i++)
-		cc_cce_cleanup(cc, i);
 	snprintf(cc->cc_ktr_event_name, sizeof(cc->cc_ktr_event_name),
 	    "callwheel cpu %d", cpu);
 	if (cc->cc_callout == NULL)	/* Only cpu0 handles timeout(9) */
@@ -338,37 +301,7 @@
 	}
 }
 
-#ifdef SMP
 /*
- * Switches the cpu tied to a specific callout.
- * The function expects a locked incoming callout cpu and returns with
- * locked outcoming callout cpu.
- */
-static struct callout_cpu *
-callout_cpu_switch(struct callout *c, struct callout_cpu *cc, int new_cpu)
-{
-	struct callout_cpu *new_cc;
-
-	MPASS(c != NULL && cc != NULL);
-	CC_LOCK_ASSERT(cc);
-
-	/*
-	 * Avoid interrupts and preemption firing after the callout cpu
-	 * is blocked in order to avoid deadlocks as the new thread
-	 * may be willing to acquire the callout cpu lock.
-	 */
-	c->c_cpu = CPUBLOCK;
-	spinlock_enter();
-	CC_UNLOCK(cc);
-	new_cc = CC_CPU(new_cpu);
-	CC_LOCK(new_cc);
-	spinlock_exit();
-	c->c_cpu = new_cpu;
-	return (new_cc);
-}
-#endif
-
-/*
  * Start standard softclock thread.
  */
 static void
@@ -562,49 +495,54 @@
 callout_lock(struct callout *c)
 {
 	struct callout_cpu *cc;
-	int cpu;
-
-	for (;;) {
-		cpu = c->c_cpu;
-#ifdef SMP
-		if (cpu == CPUBLOCK) {
-			while (c->c_cpu == CPUBLOCK)
-				cpu_spinwait();
-			continue;
-		}
-#endif
-		cc = CC_CPU(cpu);
-		CC_LOCK(cc);
-		if (cpu == c->c_cpu)
-			break;
-		CC_UNLOCK(cc);
-	}
+	cc = CC_CPU(c->c_cpu);
+	CC_LOCK(cc);
 	return (cc);
 }
 
-static void
-callout_cc_add(struct callout *c, struct callout_cpu *cc,
-    sbintime_t sbt, sbintime_t precision, void (*func)(void *),
-    void *arg, int cpu, int flags)
+static struct callout_cpu *
+callout_cc_add_locked(struct callout *c, struct callout_cpu *cc,
+    struct callout_args *coa)
 {
+#ifndef NO_EVENTTIMERS
+	sbintime_t sbt;
+#endif
 	int bucket;
 
 	CC_LOCK_ASSERT(cc);
-	if (sbt < cc->cc_lastscan)
-		sbt = cc->cc_lastscan;
-	c->c_arg = arg;
-	c->c_flags |= (CALLOUT_ACTIVE | CALLOUT_PENDING);
-	if (flags & C_DIRECT_EXEC)
-		c->c_flags |= CALLOUT_DIRECT;
-	c->c_flags &= ~CALLOUT_PROCESSED;
-	c->c_func = func;
-	c->c_time = sbt;
-	c->c_precision = precision;
+
+	/* update flags before swapping locks, if any */
+	c->c_flags &= ~(CALLOUT_PROCESSED | CALLOUT_DIRECT | CALLOUT_DEFRESTART);
+	if (coa->flags & C_DIRECT_EXEC)
+		c->c_flags |= (CALLOUT_ACTIVE | CALLOUT_PENDING | CALLOUT_DIRECT);
+	else
+		c->c_flags |= (CALLOUT_ACTIVE | CALLOUT_PENDING);
+
+#ifdef SMP
+	/*
+	 * Check if we are changing the CPU on which the callback
+	 * should be executed:
+	 */
+	if (coa->cpu != c->c_cpu) {
+		CC_UNLOCK(cc);
+		c->c_cpu = coa->cpu;
+		cc = callout_lock(c);
+	}
+#endif
+	if (coa->time < cc->cc_lastscan)
+		coa->time = cc->cc_lastscan;
+	c->c_arg = coa->arg;
+	c->c_func = coa->func;
+	c->c_time = coa->time;
+	c->c_precision = coa->precision;
+
 	bucket = callout_get_bucket(c->c_time);
 	CTR3(KTR_CALLOUT, "precision set for %p: %d.%08x",
 	    c, (int)(c->c_precision >> 32),
 	    (u_int)(c->c_precision & 0xffffffff));
 	LIST_INSERT_HEAD(&cc->cc_callwheel[bucket], c, c_links.le);
+
+	/* Ensure we are first to be scanned, if called via a callback */
 	if (cc->cc_bucket == bucket)
 		cc->cc_exec_next_dir = c;
 #ifndef NO_EVENTTIMERS
@@ -617,9 +555,10 @@
 	sbt = c->c_time + c->c_precision;
 	if (sbt < cc->cc_firstevent) {
 		cc->cc_firstevent = sbt;
-		cpu_new_callout(cpu, sbt, c->c_time);
+		cpu_new_callout(coa->cpu, sbt, c->c_time);
 	}
 #endif
+	return (cc);
 }
 
 static void
@@ -626,8 +565,6 @@
 callout_cc_del(struct callout *c, struct callout_cpu *cc)
 {
 
-	if ((c->c_flags & CALLOUT_LOCAL_ALLOC) == 0)
-		return;
 	c->c_func = NULL;
 	SLIST_INSERT_HEAD(&cc->cc_callfree, c, c_links.sle);
 }
@@ -640,19 +577,13 @@
     int direct)
 {
 	struct rm_priotracker tracker;
-	void (*c_func)(void *);
+	struct callout_cpu *cc_orig = cc;
+	callout_func_t *c_func;
 	void *c_arg;
 	struct lock_class *class;
 	struct lock_object *c_lock;
 	uintptr_t lock_status;
 	int c_flags;
-#ifdef SMP
-	struct callout_cpu *new_cc;
-	void (*new_func)(void *);
-	void *new_arg;
-	int flags, new_cpu;
-	sbintime_t new_prec, new_time;
-#endif
 #if defined(DIAGNOSTIC) || defined(CALLOUT_PROFILING) 
 	sbintime_t sbt1, sbt2;
 	struct timespec ts2;
@@ -675,25 +606,36 @@
 	c_func = c->c_func;
 	c_arg = c->c_arg;
 	c_flags = c->c_flags;
-	if (c->c_flags & CALLOUT_LOCAL_ALLOC)
+	if (c_flags & CALLOUT_LOCAL_ALLOC)
 		c->c_flags = CALLOUT_LOCAL_ALLOC;
 	else
 		c->c_flags &= ~CALLOUT_PENDING;
+
+	/* reset our local state */
 	cc->cc_exec_entity[direct].cc_curr = c;
 	cc->cc_exec_entity[direct].cc_cancel = false;
+	cc->cc_exec_entity[direct].cc_restart = false;
+	cc->cc_exec_entity[direct].cc_drain_fn = NULL;
+	cc->cc_exec_entity[direct].cc_drain_arg = NULL;
 	CC_UNLOCK(cc);
+
+	/* unlocked region for switching locks, if any */
+
 	if (c_lock != NULL) {
 		class->lc_lock(c_lock, lock_status);
 		/*
-		 * The callout may have been cancelled
-		 * while we switched locks.
+		 * Check if the callout may have been cancelled while
+		 * we were switching locks:
 		 */
 		if (cc->cc_exec_entity[direct].cc_cancel) {
-			class->lc_unlock(c_lock);
-			goto skip;
+			if ((c_flags & CALLOUT_RETURNUNLOCKED) != 0)
+				class->lc_unlock(c_lock);
+			goto skip_callback;
 		}
-		/* The callout cannot be stopped now. */
+
+		/* The callout cannot be stopped now! */
 		cc->cc_exec_entity[direct].cc_cancel = true;
+
 		if (c_lock == &Giant.lock_object) {
 #ifdef CALLOUT_PROFILING
 			(*gcalls)++;
@@ -740,85 +682,49 @@
 #endif
 	KTR_STATE0(KTR_SCHED, "callout", cc->cc_ktr_event_name, "idle");
 	CTR1(KTR_CALLOUT, "callout %p finished", c);
-	if ((c_flags & CALLOUT_RETURNUNLOCKED) == 0)
-		class->lc_unlock(c_lock);
-skip:
+
+skip_callback:
 	CC_LOCK(cc);
 	KASSERT(cc->cc_exec_entity[direct].cc_curr == c, ("mishandled cc_curr"));
 	cc->cc_exec_entity[direct].cc_curr = NULL;
-	if (cc->cc_exec_entity[direct].cc_waiting) {
-		/*
-		 * There is someone waiting for the
-		 * callout to complete.
-		 * If the callout was scheduled for
-		 * migration just cancel it.
-		 */
-		if (cc_cce_migrating(cc, direct)) {
-			cc_cce_cleanup(cc, direct);
 
-			/*
-			 * It should be assert here that the callout is not
-			 * destroyed but that is not easy.
-			 */
-			c->c_flags &= ~CALLOUT_DFRMIGRATION;
-		}
-		cc->cc_exec_entity[direct].cc_waiting = false;
-		CC_UNLOCK(cc);
-		wakeup(&cc->cc_exec_entity[direct].cc_waiting);
-		CC_LOCK(cc);
-	} else if (cc_cce_migrating(cc, direct)) {
-		KASSERT((c_flags & CALLOUT_LOCAL_ALLOC) == 0,
-		    ("Migrating legacy callout %p", c));
-#ifdef SMP
-		/*
-		 * If the callout was scheduled for
-		 * migration just perform it now.
-		 */
-		new_cpu = cc->cc_exec_entity[direct].ce_migration_cpu;
-		new_time = cc->cc_exec_entity[direct].ce_migration_time;
-		new_prec = cc->cc_exec_entity[direct].ce_migration_prec;
-		new_func = cc->cc_exec_entity[direct].ce_migration_func;
-		new_arg = cc->cc_exec_entity[direct].ce_migration_arg;
-		cc_cce_cleanup(cc, direct);
+	/*
+	 * At this point the callback structure might have been freed,
+	 * so we need to check the copied "c->c_flags" to figure out
+	 * if the callback is preallocated or not!
+	 */
+	if (c_flags & CALLOUT_LOCAL_ALLOC) {
+		/* return callout back to freelist */
+		callout_cc_del(c, cc);
+	} else if (cc->cc_exec_entity[direct].cc_restart) {
+		/* [re-]schedule callout, if any */
+		cc = callout_cc_add_locked(c, cc,
+		    &cc->cc_exec_entity[direct].cc_restart_args);
+	}
 
-		/*
-		 * It should be assert here that the callout is not destroyed
-		 * but that is not easy.
-		 *
-		 * As first thing, handle deferred callout stops.
-		 */
-		if ((c->c_flags & CALLOUT_DFRMIGRATION) == 0) {
-			CTR3(KTR_CALLOUT,
-			     "deferred cancelled %p func %p arg %p",
-			     c, new_func, new_arg);
-			callout_cc_del(c, cc);
-			return;
-		}
-		c->c_flags &= ~CALLOUT_DFRMIGRATION;
+	/*
+	 * Unlock the callback lock at the end, if any, so that the
+	 * "c_cpu" field in the callback structure gets write
+	 * protected by this lock.
+	 */
+	if ((c_flags & CALLOUT_RETURNUNLOCKED) == 0)
+		class->lc_unlock(c_lock);
 
-		new_cc = callout_cpu_switch(c, cc, new_cpu);
-		flags = (direct) ? C_DIRECT_EXEC : 0;
-		callout_cc_add(c, new_cc, new_time, new_prec, new_func,
-		    new_arg, new_cpu, flags);
-		CC_UNLOCK(new_cc);
-		CC_LOCK(cc);
-#else
-		panic("migration should not happen");
-#endif
-	}
 	/*
-	 * If the current callout is locally allocated (from
-	 * timeout(9)) then put it on the freelist.
-	 *
-	 * Note: we need to check the cached copy of c_flags because
-	 * if it was not local, then it's not safe to deref the
-	 * callout pointer.
+	 * Check if there is anything which needs draining. There is
+	 * no need to lock any locks here, hence the calling thread is
+	 * the only one writing these fields!
 	 */
-	KASSERT((c_flags & CALLOUT_LOCAL_ALLOC) == 0 ||
-	    c->c_flags == CALLOUT_LOCAL_ALLOC,
-	    ("corrupted callout"));
-	if (c_flags & CALLOUT_LOCAL_ALLOC)
-		callout_cc_del(c, cc);
+	if (cc->cc_exec_entity[direct].cc_drain_fn != NULL) {
+		CC_UNLOCK(cc);
+		cc->cc_exec_entity[direct].cc_drain_fn(
+		    cc->cc_exec_entity[direct].cc_drain_arg);
+		CC_LOCK(cc_orig);
+	} else if (cc_orig != cc) {
+		/* switch back locks again, must return locked */
+		CC_UNLOCK(cc);
+		CC_LOCK(cc_orig);
+	}
 }
 
 /*
@@ -899,10 +805,11 @@
 		/* XXX Attempt to malloc first */
 		panic("timeout table full");
 	SLIST_REMOVE_HEAD(&cc->cc_callfree, c_links.sle);
-	callout_reset(new, to_ticks, ftn, arg);
 	handle.callout = new;
 	CC_UNLOCK(cc);
 
+	callout_reset(new, to_ticks, ftn, arg);
+
 	return (handle);
 }
 
@@ -910,6 +817,7 @@
 untimeout(timeout_t *ftn, void *arg, struct callout_handle handle)
 {
 	struct callout_cpu *cc;
+	bool match;
 
 	/*
 	 * Check for a handle that was initialized
@@ -920,9 +828,11 @@
 		return;
 
 	cc = callout_lock(handle.callout);
-	if (handle.callout->c_func == ftn && handle.callout->c_arg == arg)
+	match = (handle.callout->c_func == ftn && handle.callout->c_arg == arg);
+	CC_UNLOCK(cc);
+
+	if (match)
 		callout_stop(handle.callout);
-	CC_UNLOCK(cc);
 }
 
 void
@@ -931,6 +841,143 @@
 	handle->callout = NULL;
 }
 
+#if defined(DIAGNOSTIC)
+static void
+callout_assert_locks(struct callout *c)
+{
+	struct lock_class *class;
+	/*
+	 * Some old subsystems don't hold Giant while stopping and
+	 * starting callouts. Discard any asserts if the Giant lock is
+	 * specified.
+	 */
+	if (c->c_lock == NULL || c->c_lock == &Giant.lock_object)
+		return;
+	class = LOCK_CLASS(c->c_lock);
+	class->lc_assert(c->c_lock, LA_XLOCKED);
+}
+#endif
+
+static int
+callout_restart_async(struct callout *c, struct callout_args *coa,
+    callout_func_t *drain_fn, void *drain_arg)
+{
+	struct callout_cpu *cc;
+	int cancelled;
+	int direct;
+
+#if defined(DIAGNOSTIC)
+	callout_assert_locks(c);
+#endif
+	cc = callout_lock(c);
+
+#ifdef SMP
+	/* don't change CPU, if any */
+	if (coa != NULL && CALLOUT_RDONLY_C_CPU(c))
+		coa->cpu = c->c_cpu;
+#endif
+	/* Figure out if the callout is direct or not */
+	direct = ((c->c_flags & CALLOUT_DIRECT) != 0);
+
+	/*
+	 * Check if the callback is currently scheduled for
+	 * completion:
+	 */
+	if (cc->cc_exec_entity[direct].cc_curr == c) {
+		/*
+		 * Try to prevent the callback from running by setting
+		 * the "cc_cancel" variable to "true". Also check if
+		 * the callout was previously subject to a deferred
+		 * callout restart:
+		 */
+		if (cc->cc_exec_entity[direct].cc_cancel == false ||
+		    (c->c_flags & CALLOUT_DEFRESTART) != 0) {
+			cc->cc_exec_entity[direct].cc_cancel = true;
+			cancelled = 1;
+		} else {
+			cancelled = 0;
+		}
+
+		/*
+		 * Prevent callback restart if "callout_drain_xxx()"
+		 * is being called or we are stopping the callout or
+		 * the callback was preallocated by us:
+		 */
+		if (cc->cc_exec_entity[direct].cc_drain_fn != NULL ||
+		    coa == NULL || (c->c_flags & CALLOUT_LOCAL_ALLOC) != 0) {
+			CTR4(KTR_CALLOUT, "%s %p func %p arg %p",
+			    cancelled ? "cancelled and draining" : "draining",
+			    c, c->c_func, c->c_arg);
+
+			/* set drain function, if any */
+			if (drain_fn != NULL) {
+				cc->cc_exec_entity[direct].cc_drain_fn = drain_fn;
+				cc->cc_exec_entity[direct].cc_drain_arg = drain_arg;
+				cancelled |= 2;		/* XXX define the value */
+			}
+			/* clear old flags, if any */
+			c->c_flags &= ~(CALLOUT_ACTIVE | CALLOUT_PENDING |
+			    CALLOUT_DEFRESTART | CALLOUT_PROCESSED);
+
+			/* clear restart flag, if any */
+			cc->cc_exec_entity[direct].cc_restart = false;
+		} else {
+			CTR4(KTR_CALLOUT, "%s %p func %p arg %p",
+			    cancelled ? "cancelled and restarting" : "restarting",
+			    c, c->c_func, c->c_arg);
+
+			/* get us back into the game */
+			c->c_flags |= (CALLOUT_ACTIVE | CALLOUT_PENDING |
+			    CALLOUT_DEFRESTART);
+			c->c_flags &= ~CALLOUT_PROCESSED;
+
+			/* enable deferred restart */
+			cc->cc_exec_entity[direct].cc_restart = true;
+
+			/* store arguments for the deferred restart, if any */
+			cc->cc_exec_entity[direct].cc_restart_args = *coa;
+		}
+	} else {
+		/* stop callout */
+		if (c->c_flags & CALLOUT_PENDING) {
+			/*
+			 * The callback has not yet been executed, and
+			 * we simply just need to unlink it:
+			 */
+			if ((c->c_flags & CALLOUT_PROCESSED) == 0) {
+				if (cc->cc_exec_next_dir == c)
+					cc->cc_exec_next_dir = LIST_NEXT(c, c_links.le);
+				LIST_REMOVE(c, c_links.le);
+			} else {
+				TAILQ_REMOVE(&cc->cc_expireq, c, c_links.tqe);
+			}
+			cancelled = 1;
+		} else {
+			cancelled = 0;
+		}
+
+		/* [re-]schedule callout, if any */
+		if (coa != NULL) {
+			cc = callout_cc_add_locked(c, cc, coa);
+		} else {
+			/* clear old flags, if any */
+			c->c_flags &= ~(CALLOUT_ACTIVE | CALLOUT_PENDING |
+			    CALLOUT_DEFRESTART | CALLOUT_PROCESSED);
+
+			/* return callback to pre-allocated list, if any */
+			if ((c->c_flags & CALLOUT_LOCAL_ALLOC) && cancelled != 0) {
+				callout_cc_del(c, cc);
+			}
+		}
+
+		CTR4(KTR_CALLOUT, "%s %p func %p arg %p",
+		    cancelled ? "rescheduled" : "scheduled",
+		    c, c->c_func, c->c_arg);
+	}
+	CC_UNLOCK(cc);
+	return (cancelled);
+}
+
 /*
  * New interface; clients allocate their own callout structures.
  *
@@ -949,25 +996,32 @@
  */
 int
 callout_reset_sbt_on(struct callout *c, sbintime_t sbt, sbintime_t precision,
-    void (*ftn)(void *), void *arg, int cpu, int flags)
+    callout_func_t *ftn, void *arg, int cpu, int flags)
 {
-	sbintime_t to_sbt, pr;
-	struct callout_cpu *cc;
-	int cancelled, direct;
+	struct callout_args coa;
 
-	cancelled = 0;
-	if (flags & C_ABSOLUTE) {
-		to_sbt = sbt;
+	/* store arguments for callout add function */
+	coa.func = ftn;
+	coa.arg = arg;
+	coa.precision = precision;
+	coa.flags = flags;
+	coa.cpu = cpu;
+
+	/* compute the rest of the arguments needed */
+	if (coa.flags & C_ABSOLUTE) {
+		coa.time = sbt;
 	} else {
-		if ((flags & C_HARDCLOCK) && (sbt < tick_sbt))
+		sbintime_t pr;
+
+		if ((coa.flags & C_HARDCLOCK) && (sbt < tick_sbt))
 			sbt = tick_sbt;
-		if ((flags & C_HARDCLOCK) ||
+		if ((coa.flags & C_HARDCLOCK) ||
 #ifdef NO_EVENTTIMERS
 		    sbt >= sbt_timethreshold) {
-			to_sbt = getsbinuptime();
+			coa.time = getsbinuptime();
 
 			/* Add safety belt for the case of hz > 1000. */
-			to_sbt += tc_tick_sbt - tick_sbt;
+			coa.time += tc_tick_sbt - tick_sbt;
 #else
 		    sbt >= sbt_tickthreshold) {
 			/*
@@ -977,101 +1031,29 @@
 			 * active ones.
 			 */
 #ifdef __LP64__
-			to_sbt = DPCPU_GET(hardclocktime);
+			coa.time = DPCPU_GET(hardclocktime);
 #else
 			spinlock_enter();
-			to_sbt = DPCPU_GET(hardclocktime);
+			coa.time = DPCPU_GET(hardclocktime);
 			spinlock_exit();
 #endif
 #endif
-			if ((flags & C_HARDCLOCK) == 0)
-				to_sbt += tick_sbt;
+			if ((coa.flags & C_HARDCLOCK) == 0)
+				coa.time += tick_sbt;
 		} else
-			to_sbt = sbinuptime();
-		if (SBT_MAX - to_sbt < sbt)
-			to_sbt = SBT_MAX;
+			coa.time = sbinuptime();
+		if (SBT_MAX - coa.time < sbt)
+			coa.time = SBT_MAX;
 		else
-			to_sbt += sbt;
-		pr = ((C_PRELGET(flags) < 0) ? sbt >> tc_precexp :
-		    sbt >> C_PRELGET(flags));
-		if (pr > precision)
-			precision = pr;
+			coa.time += sbt;
+		pr = ((C_PRELGET(coa.flags) < 0) ? sbt >> tc_precexp :
+		    sbt >> C_PRELGET(coa.flags));
+		if (pr > coa.precision)
+			coa.precision = pr;
 	}
-	/*
-	 * Don't allow migration of pre-allocated callouts lest they
-	 * become unbalanced.
-	 */
-	if (c->c_flags & CALLOUT_LOCAL_ALLOC)
-		cpu = c->c_cpu;
-	direct = (c->c_flags & CALLOUT_DIRECT) != 0;
-	KASSERT(!direct || c->c_lock == NULL,
-	    ("%s: direct callout %p has lock", __func__, c));
-	cc = callout_lock(c);
-	if (cc->cc_exec_entity[direct].cc_curr == c) {
-		/*
-		 * We're being asked to reschedule a callout which is
-		 * currently in progress.  If there is a lock then we
-		 * can cancel the callout if it has not really started.
-		 */
-		if (c->c_lock != NULL && !cc->cc_exec_entity[direct].cc_cancel)
-			cancelled = cc->cc_exec_entity[direct].cc_cancel = true;
-		if (cc->cc_exec_entity[direct].cc_waiting) {
-			/*
-			 * Someone has called callout_drain to kill this
-			 * callout.  Don't reschedule.
-			 */
-			CTR4(KTR_CALLOUT, "%s %p func %p arg %p",
-			    cancelled ? "cancelled" : "failed to cancel",
-			    c, c->c_func, c->c_arg);
-			CC_UNLOCK(cc);
-			return (cancelled);
-		}
-	}
-	if (c->c_flags & CALLOUT_PENDING) {
-		if ((c->c_flags & CALLOUT_PROCESSED) == 0) {
-			if (cc->cc_exec_next_dir == c)
-				cc->cc_exec_next_dir = LIST_NEXT(c, c_links.le);
-			LIST_REMOVE(c, c_links.le);
-		} else
-			TAILQ_REMOVE(&cc->cc_expireq, c, c_links.tqe);
-		cancelled = 1;
-		c->c_flags &= ~(CALLOUT_ACTIVE | CALLOUT_PENDING);
-	}
 
-#ifdef SMP
-	/*
-	 * If the callout must migrate try to perform it immediately.
-	 * If the callout is currently running, just defer the migration
-	 * to a more appropriate moment.
-	 */
-	if (c->c_cpu != cpu) {
-		if (cc->cc_exec_entity[direct].cc_curr == c) {
-			cc->cc_exec_entity[direct].ce_migration_cpu = cpu;
-			cc->cc_exec_entity[direct].ce_migration_time
-			    = to_sbt;
-			cc->cc_exec_entity[direct].ce_migration_prec 
-			    = precision;
-			cc->cc_exec_entity[direct].ce_migration_func = ftn;
-			cc->cc_exec_entity[direct].ce_migration_arg = arg;
-			c->c_flags |= CALLOUT_DFRMIGRATION;
-			CTR6(KTR_CALLOUT,
-		    "migration of %p func %p arg %p in %d.%08x to %u deferred",
-			    c, c->c_func, c->c_arg, (int)(to_sbt >> 32),
-			    (u_int)(to_sbt & 0xffffffff), cpu);
-			CC_UNLOCK(cc);
-			return (cancelled);
-		}
-		cc = callout_cpu_switch(c, cc, cpu);
-	}
-#endif
-
-	callout_cc_add(c, cc, to_sbt, precision, ftn, arg, cpu, flags);
-	CTR6(KTR_CALLOUT, "%sscheduled %p func %p arg %p in %d.%08x",
-	    cancelled ? "re" : "", c, c->c_func, c->c_arg, (int)(to_sbt >> 32),
-	    (u_int)(to_sbt & 0xffffffff));
-	CC_UNLOCK(cc);
-
-	return (cancelled);
+	/* get callback started, if any */
+	return (callout_restart_async(c, &coa, NULL, NULL));
 }
 
 /*
@@ -1090,175 +1072,74 @@
 }
 
 int
-_callout_stop_safe(struct callout *c, int safe)
+callout_stop(struct callout *c)
 {
-	struct callout_cpu *cc, *old_cc;
+	/* get callback stopped, if any */
+	return (callout_restart_async(c, NULL, NULL, NULL));
+}
+
+static void
+callout_drain_function(void *arg)
+{
+	wakeup(arg);
+}
+
+int
+callout_drain_async(struct callout *c, callout_func_t *fn, void *arg)
+{
+	/* get callback stopped, if any */
+	return (callout_restart_async(c, NULL, fn, arg) & 2);
+}
+
+int
+callout_drain(struct callout *c)
+{
 	struct lock_class *class;
-	int direct, sq_locked, use_lock;
+	int cancelled;
 
-	/*
-	 * Some old subsystems don't hold Giant while running a callout_stop(),
-	 * so just discard this check for the moment.
-	 */
-	if (!safe && c->c_lock != NULL) {
-		if (c->c_lock == &Giant.lock_object)
-			use_lock = mtx_owned(&Giant);
-		else {
-			use_lock = 1;
-			class = LOCK_CLASS(c->c_lock);
-			class->lc_assert(c->c_lock, LA_XLOCKED);
-		}
-	} else
-		use_lock = 0;
-	direct = (c->c_flags & CALLOUT_DIRECT) != 0;
-	sq_locked = 0;
-	old_cc = NULL;
-again:
-	cc = callout_lock(c);
-
-	/*
-	 * If the callout was migrating while the callout cpu lock was
-	 * dropped,  just drop the sleepqueue lock and check the states
-	 * again.
-	 */
-	if (sq_locked != 0 && cc != old_cc) {
-#ifdef SMP
-		CC_UNLOCK(cc);
-		sleepq_release(&old_cc->cc_exec_entity[direct].cc_waiting);
-		sq_locked = 0;
-		old_cc = NULL;
-		goto again;
-#else
-		panic("migration should not happen");
-#endif
+	if (c->c_lock != NULL && c->c_lock != &Giant.lock_object) {
+		class = LOCK_CLASS(c->c_lock);
+		class->lc_lock(c->c_lock, 0);
+	} else {
+		class = NULL;
 	}
 
-	/*
-	 * If the callout isn't pending, it's not on the queue, so
-	 * don't attempt to remove it from the queue.  We can try to
-	 * stop it by other means however.
-	 */
-	if (!(c->c_flags & CALLOUT_PENDING)) {
-		c->c_flags &= ~CALLOUT_ACTIVE;
+	/* at this point the "c->c_cpu" field is not changing */
 
+	cancelled = callout_drain_async(c, &callout_drain_function, c);
+
+	if (cancelled & 2) {
+		struct callout_cpu *cc;
+		int direct;
+
+		CTR3(KTR_CALLOUT, "need to drain %p func %p arg %p",
+		    c, c->c_func, c->c_arg);
+
+		cc = callout_lock(c);
+		direct = ((c->c_flags & CALLOUT_DIRECT) != 0);
+
 		/*
-		 * If it wasn't on the queue and it isn't the current
-		 * callout, then we can't stop it, so just bail.
+		 * We've gotten our callout CPU lock, it is safe to
+		 * drop the initial lock:
 		 */
-		if (cc->cc_exec_entity[direct].cc_curr != c) {
-			CTR3(KTR_CALLOUT, "failed to stop %p func %p arg %p",
-			    c, c->c_func, c->c_arg);
-			CC_UNLOCK(cc);
-			if (sq_locked)
-				sleepq_release(
-				    &cc->cc_exec_entity[direct].cc_waiting);
-			return (0);
-		}
+		if (class != NULL)
+			class->lc_unlock(c->c_lock);
 
-		if (safe) {
-			/*
-			 * The current callout is running (or just
-			 * about to run) and blocking is allowed, so
-			 * just wait for the current invocation to
-			 * finish.
-			 */
-			while (cc->cc_exec_entity[direct].cc_curr == c) {
-				/*
-				 * Use direct calls to sleepqueue interface
-				 * instead of cv/msleep in order to avoid
-				 * a LOR between cc_lock and sleepqueue
-				 * chain spinlocks.  This piece of code
-				 * emulates a msleep_spin() call actually.
-				 *
-				 * If we already have the sleepqueue chain
-				 * locked, then we can safely block.  If we
-				 * don't already have it locked, however,
-				 * we have to drop the cc_lock to lock
-				 * it.  This opens several races, so we
-				 * restart at the beginning once we have
-				 * both locks.  If nothing has changed, then
-				 * we will end up back here with sq_locked
-				 * set.
-				 */
-				if (!sq_locked) {
-					CC_UNLOCK(cc);
-					sleepq_lock(
-					&cc->cc_exec_entity[direct].cc_waiting);
-					sq_locked = 1;
-					old_cc = cc;
-					goto again;
-				}
+		/* Wait for drain to complete */
 
-				/*
-				 * Migration could be cancelled here, but
-				 * as long as it is still not sure when it
-				 * will be packed up, just let softclock()
-				 * take care of it.
-				 */
-				cc->cc_exec_entity[direct].cc_waiting = true;
-				DROP_GIANT();
-				CC_UNLOCK(cc);
-				sleepq_add(
-				    &cc->cc_exec_entity[direct].cc_waiting,
-				    &cc->cc_lock.lock_object, "codrain",
-				    SLEEPQ_SLEEP, 0);
-				sleepq_wait(
-				    &cc->cc_exec_entity[direct].cc_waiting,
-					     0);
-				sq_locked = 0;
-				old_cc = NULL;
+		while (cc->cc_exec_entity[direct].cc_curr == c)
+			msleep_spin(c, (struct mtx *)&cc->cc_lock, "codrain", 0);
 
-				/* Reacquire locks previously released. */
-				PICKUP_GIANT();
-				CC_LOCK(cc);
-			}
-		} else if (use_lock &&
-			    !cc->cc_exec_entity[direct].cc_cancel) {
-			/*
-			 * The current callout is waiting for its
-			 * lock which we hold.  Cancel the callout
-			 * and return.  After our caller drops the
-			 * lock, the callout will be skipped in
-			 * softclock().
-			 */
-			cc->cc_exec_entity[direct].cc_cancel = true;
-			CTR3(KTR_CALLOUT, "cancelled %p func %p arg %p",
-			    c, c->c_func, c->c_arg);
-			KASSERT(!cc_cce_migrating(cc, direct),
-			    ("callout wrongly scheduled for migration"));
-			CC_UNLOCK(cc);
-			KASSERT(!sq_locked, ("sleepqueue chain locked"));
-			return (1);
-		} else if ((c->c_flags & CALLOUT_DFRMIGRATION) != 0) {
-			c->c_flags &= ~CALLOUT_DFRMIGRATION;
-			CTR3(KTR_CALLOUT, "postponing stop %p func %p arg %p",
-			    c, c->c_func, c->c_arg);
-			CC_UNLOCK(cc);
-			return (1);
-		}
-		CTR3(KTR_CALLOUT, "failed to stop %p func %p arg %p",
-		    c, c->c_func, c->c_arg);
 		CC_UNLOCK(cc);
-		KASSERT(!sq_locked, ("sleepqueue chain still locked"));
-		return (0);
+	} else {
+		if (class != NULL)
+			class->lc_unlock(c->c_lock);
 	}
-	if (sq_locked)
-		sleepq_release(&cc->cc_exec_entity[direct].cc_waiting);
 
-	c->c_flags &= ~(CALLOUT_ACTIVE | CALLOUT_PENDING);
-
 	CTR3(KTR_CALLOUT, "cancelled %p func %p arg %p",
 	    c, c->c_func, c->c_arg);
-	if ((c->c_flags & CALLOUT_PROCESSED) == 0) {
-		if (cc->cc_exec_next_dir == c)
-			cc->cc_exec_next_dir = LIST_NEXT(c, c_links.le);
-		LIST_REMOVE(c, c_links.le);
-	} else
-		TAILQ_REMOVE(&cc->cc_expireq, c, c_links.tqe);
-	callout_cc_del(c, cc);
 
-	CC_UNLOCK(cc);
-	return (1);
+	return (cancelled & 1);
 }
 
 void
Index: sys/sys/_callout.h
===================================================================
--- sys/sys/_callout.h	(revision 276240)
+++ sys/sys/_callout.h	(working copy)
@@ -46,6 +46,17 @@
 SLIST_HEAD(callout_slist, callout);
 TAILQ_HEAD(callout_tailq, callout);
 
+typedef void callout_func_t(void *);
+
+struct callout_args {
+	sbintime_t time;		/* absolute time for the event */
+	sbintime_t precision;		/* delta allowed wrt opt */
+	void	*arg;			/* function argument */
+	callout_func_t *func;		/* function to call */
+	int	flags;			/* flags passed to callout_reset() */
+	int	cpu;			/* CPU we're scheduled on */
+};
+
 struct callout {
 	union {
 		LIST_ENTRY(callout) le;
@@ -52,13 +63,13 @@
 		SLIST_ENTRY(callout) sle;
 		TAILQ_ENTRY(callout) tqe;
 	} c_links;
-	sbintime_t c_time;			/* ticks to the event */
+	sbintime_t c_time;			/* absolute time for the event */
 	sbintime_t c_precision;			/* delta allowed wrt opt */
 	void	*c_arg;				/* function argument */
-	void	(*c_func)(void *);		/* function to call */
-	struct lock_object *c_lock;		/* lock to handle */
+	callout_func_t *c_func;			/* function to call */
+	struct lock_object *c_lock;		/* callback lock */
 	int	c_flags;			/* state of this entry */
-	volatile int c_cpu;			/* CPU we're scheduled on */
+	int	c_cpu;				/* CPU we're scheduled on */
 };
 
 #endif
Index: sys/sys/callout.h
===================================================================
--- sys/sys/callout.h	(revision 276240)
+++ sys/sys/callout.h	(working copy)
@@ -46,7 +46,7 @@
 #define	CALLOUT_MPSAFE		0x0008 /* callout handler is mp safe */
 #define	CALLOUT_RETURNUNLOCKED	0x0010 /* handler returns with mtx unlocked */
 #define	CALLOUT_SHAREDLOCK	0x0020 /* callout lock held in shared mode */
-#define	CALLOUT_DFRMIGRATION	0x0040 /* callout in deferred migration mode */
+#define	CALLOUT_DEFRESTART	0x0040 /* callout restart is deferred */
 #define	CALLOUT_PROCESSED	0x0080 /* callout in wheel or processing list? */
 #define	CALLOUT_DIRECT 		0x0100 /* allow exec from hw int context */
 
@@ -65,7 +65,8 @@
 #ifdef _KERNEL
 #define	callout_active(c)	((c)->c_flags & CALLOUT_ACTIVE)
 #define	callout_deactivate(c)	((c)->c_flags &= ~CALLOUT_ACTIVE)
-#define	callout_drain(c)	_callout_stop_safe(c, 1)
+int	callout_drain(struct callout *);
+int	callout_drain_async(struct callout *, callout_func_t *, void *);
 void	callout_init(struct callout *, int);
 void	_callout_init_lock(struct callout *, struct lock_object *, int);
 #define	callout_init_mtx(c, mtx, flags)					\
@@ -79,7 +80,7 @@
 	   NULL, (flags))
 #define	callout_pending(c)	((c)->c_flags & CALLOUT_PENDING)
 int	callout_reset_sbt_on(struct callout *, sbintime_t, sbintime_t,
-	    void (*)(void *), void *, int, int);
+	    callout_func_t *, void *, int, int);
 #define	callout_reset_sbt(c, sbt, pr, fn, arg, flags)			\
     callout_reset_sbt_on((c), (sbt), (pr), (fn), (arg), (c)->c_cpu, (flags))
 #define	callout_reset_sbt_curcpu(c, sbt, pr, fn, arg, flags)		\
@@ -103,8 +104,7 @@
 int	callout_schedule_on(struct callout *, int, int);
 #define	callout_schedule_curcpu(c, on_tick)				\
     callout_schedule_on((c), (on_tick), PCPU_GET(cpuid))
-#define	callout_stop(c)		_callout_stop_safe(c, 0)
-int	_callout_stop_safe(struct callout *, int);
+int	callout_stop(struct callout *);
 void	callout_process(sbintime_t now);
 
 #endif

--------------080102030604030205080208--