From owner-freebsd-smp@FreeBSD.ORG Mon Mar 17 16:44:10 2008 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0B423106566B; Mon, 17 Mar 2008 16:44:10 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id 4BCC58FC1C; Mon, 17 Mar 2008 16:44:08 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8s) with ESMTP id 235801822-1834499 for multiple; Mon, 17 Mar 2008 12:42:28 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.14.2/8.14.2) with ESMTP id m2HGi3qT078981; Mon, 17 Mar 2008 12:44:05 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: freebsd-smp@freebsd.org Date: Mon, 17 Mar 2008 11:27:20 -0400 User-Agent: KMail/1.9.7 References: <20080315024114.GD67856@elvis.mu.org> In-Reply-To: <20080315024114.GD67856@elvis.mu.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200803171127.20561.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Mon, 17 Mar 2008 12:44:05 -0400 (EDT) X-Virus-Scanned: ClamAV 0.91.2/6275/Mon Mar 17 10:08:48 2008 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: stable@freebsd.org, Alfred Perlstein Subject: Re: timeout/untimeout race conditions/crash [patch] X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Mar 2008 16:44:10 -0000 On Friday 14 March 2008 10:41:14 pm Alfred Perlstein wrote: > We think we tracked down a defect in timeout/untimeout in > FreeBSD. > > We have reduced the problem to the following scenario: > > 2+ cpu system, one cpu is running softclock at the same time > another thread is running on another cpu which makes use of > timeout/untimeout. > > CPU 0 is running "softclock" > CPU 1 is running "driver" with Giant held. > > softclock: mtx_lock_spin(&callout_lock) > softclock: CACHES the callout structure's fields. > softclock: sees that it's a CALLOUT_LOCAL_ALLOC > softclock: executes this code: > if (c->c_flags & CALLOUT_LOCAL_ALLOC) { > c->c_func = NULL; > c->c_flags = CALLOUT_LOCAL_ALLOC; > SLIST_INSERT_HEAD(&callfree, c, > c_links.sle); > curr_callout = NULL; > } else { > > NOTE: that c->c_func has been set to NULL and curr_callout > is also NULL. > softclock: mtx_unlock_spin(&callout_lock) > driver: calls untimeout(), the following sequence happens: > mtx_lock_spin(&callout_lock); > if (handle.callout->c_func == ftn && handle.callout->c_arg == arg) > callout_stop(handle.callout); > mtx_unlock_spin(&callout_lock); > > NOTE: untimeout() sees that handle.callout->c_func is not set > to the function so it does NOT call callout_stop(9)! > driver: free's backing structure for c->c_arg. > softclock: executes callout. > softclock: likely crashes at this point due to access after free. > > I have a patch I'm trying out here, but I need feedback on it. > > The way the patch works is to treat CALLOUT_LOCAL_ALLOC (timeout/untimeout) > callouts the same as ~CALLOUT_LOCAL_ALLOC allocs, and moves the > freelist manipulation to the end of the callout dispatch. > > Some light testing seems to have the system work. > > We are doing some testing in-house to also make sure this works. > > Please provide feedback. > > See attached delta. This is not a bug. Don't use untimeout(9) as it is not guaranteed to be reliable. Instead, use callout_*(). Your patch doesn't solve any races as the driver detach routine needs to use callout_drain() and not just callout_stop/untimeout anyways. Fix your broken drivers. -- John Baldwin From owner-freebsd-smp@FreeBSD.ORG Mon Mar 17 20:26:53 2008 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8D0F11065671 for ; Mon, 17 Mar 2008 20:26:53 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id 7F7EE8FC16 for ; Mon, 17 Mar 2008 20:26:53 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: by elvis.mu.org (Postfix, from userid 1192) id AE7C51A4D84; Mon, 17 Mar 2008 13:10:14 -0700 (PDT) Date: Mon, 17 Mar 2008 13:10:14 -0700 From: Alfred Perlstein To: John Baldwin Message-ID: <20080317201014.GA67856@elvis.mu.org> References: <20080315024114.GD67856@elvis.mu.org> <200803171127.20561.jhb@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200803171127.20561.jhb@freebsd.org> User-Agent: Mutt/1.4.2.3i Cc: stable@freebsd.org, freebsd-smp@freebsd.org Subject: Re: timeout/untimeout race conditions/crash [patch] X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Mar 2008 20:26:53 -0000 * John Baldwin [080317 09:43] wrote: > > This is not a bug. Don't use untimeout(9) as it is not guaranteed to be > reliable. Instead, use callout_*(). Your patch doesn't solve any races as > the driver detach routine needs to use callout_drain() and not just > callout_stop/untimeout anyways. Fix your broken drivers. I understand that some old Giant locked code can issue timeout/untimeout without Giant held, which would certainly cause this issue to happen and is uncorrectable, however, this is with completely Giant locked code. We are not trying to use timeout(9) for mpsafe code, this is old code and relies upon Giant. Giant locked code should be timeout/untimeout safe. As explained in my email, there exists a condition where the Giant locked code can have a timer fire even though proper Giant locking is observed. For a Giant locked subsystem, one should be able to have the following code work: mtx_lock(&Giant); /* formerly spl higher than softclock */ untimeout(&func, arg, &sc->handle); free(sc); mtx_unlock(&Giant); /* formerly splx() */ Normally splsoftclock would completely block the timeout from firing and this sort of code would be safe. It is no longer safe due to a BUG in the way that Giant is used. Please reread the original mail to better understand the synopsis of the problem. thank you, -Alfred From owner-freebsd-smp@FreeBSD.ORG Mon Mar 17 22:29:51 2008 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 24950106566B; Mon, 17 Mar 2008 22:29:51 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id 2FEDB8FC20; Mon, 17 Mar 2008 22:29:50 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8s) with ESMTP id 235838216-1834499 for multiple; Mon, 17 Mar 2008 18:28:01 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.14.2/8.14.2) with ESMTP id m2HMTbv0081565; Mon, 17 Mar 2008 18:29:37 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: Alfred Perlstein Date: Mon, 17 Mar 2008 16:59:33 -0400 User-Agent: KMail/1.9.7 References: <20080315024114.GD67856@elvis.mu.org> <200803171127.20561.jhb@freebsd.org> <20080317201014.GA67856@elvis.mu.org> In-Reply-To: <20080317201014.GA67856@elvis.mu.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200803171659.33547.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Mon, 17 Mar 2008 18:29:37 -0400 (EDT) X-Virus-Scanned: ClamAV 0.91.2/6277/Mon Mar 17 14:08:59 2008 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: stable@freebsd.org, freebsd-smp@freebsd.org Subject: Re: timeout/untimeout race conditions/crash [patch] X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Mar 2008 22:29:51 -0000 On Monday 17 March 2008 04:10:14 pm Alfred Perlstein wrote: > * John Baldwin [080317 09:43] wrote: > > > > This is not a bug. Don't use untimeout(9) as it is not guaranteed to be > > reliable. Instead, use callout_*(). Your patch doesn't solve any races as > > the driver detach routine needs to use callout_drain() and not just > > callout_stop/untimeout anyways. Fix your broken drivers. > > I understand that some old Giant locked code can issue timeout/untimeout > without Giant held, which would certainly cause this issue to happen > and is uncorrectable, however, this is with completely Giant locked > code. > > We are not trying to use timeout(9) for mpsafe code, this is old > code and relies upon Giant. > > Giant locked code should be timeout/untimeout safe. As explained > in my email, there exists a condition where the Giant locked code > can have a timer fire even though proper Giant locking is observed. > > For a Giant locked subsystem, one should be able to have the following > code work: > > mtx_lock(&Giant); /* formerly spl higher than softclock */ > untimeout(&func, arg, &sc->handle); > free(sc); > mtx_unlock(&Giant); /* formerly splx() */ > > Normally splsoftclock would completely block the timeout from firing > and this sort of code would be safe. It is no longer safe due to > a BUG in the way that Giant is used. > > Please reread the original mail to better understand the synopsis > of the problem. Hmm. My worry is about leaving the callout structure around while invoking the timeout routine itself, but it is already off the callwheel so it shouldn't be visible via untimeout() to any other code. I guess the patch is ok, but I'll be happy when we can axe timeout/untimeout altogether. -- John Baldwin From owner-freebsd-smp@FreeBSD.ORG Fri Mar 21 04:32:26 2008 Return-Path: Delivered-To: smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1E69B1065672 for ; Fri, 21 Mar 2008 04:32:26 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id 094D08FC1F for ; Fri, 21 Mar 2008 04:32:25 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: by elvis.mu.org (Postfix, from userid 1192) id D77C61A4D7E; Thu, 20 Mar 2008 21:32:25 -0700 (PDT) Date: Thu, 20 Mar 2008 21:32:25 -0700 From: Alfred Perlstein To: smp@freebsd.org Message-ID: <20080321043225.GZ67856@elvis.mu.org> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="6lXr1rPCNTf1w0X8" Content-Disposition: inline User-Agent: Mutt/1.4.2.3i Cc: Subject: request for review, callout fix. X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Mar 2008 04:32:26 -0000 --6lXr1rPCNTf1w0X8 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hi guys, attached is a fix for the timeout/untimeout race with Giant locked code. Basically the old code would make the callout inaccessable right before calling it inside of softclock. However only the callout lock is held, so when switching to the callout's associated mutex (in this case Giant) there's a race where a "local" (timeout/untimeout) callout would be fired even if stopped. This patch fixes that. We've run several hours of regression testing on a version of this for 6.x. People internal to Juniper and iedowse@ helped with this. Please review/comment. thank you, -- - Alfred Perlstein --6lXr1rPCNTf1w0X8 Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="kern_timeout.diff" cvs diff: Diffing . Index: kern_timeout.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_timeout.c,v retrieving revision 1.110 diff -u -r1.110 kern_timeout.c --- kern_timeout.c 12 Mar 2008 06:31:06 -0000 1.110 +++ kern_timeout.c 21 Mar 2008 04:28:35 -0000 @@ -230,11 +230,8 @@ c_arg = c->c_arg; c_flags = c->c_flags; if (c->c_flags & CALLOUT_LOCAL_ALLOC) { - c->c_func = NULL; c->c_flags = CALLOUT_LOCAL_ALLOC; - SLIST_INSERT_HEAD(&callfree, c, - c_links.sle); - curr_callout = NULL; + curr_callout = c; } else { c->c_flags = (c->c_flags & ~CALLOUT_PENDING); @@ -299,6 +296,24 @@ class->lc_unlock(c_lock); skip: mtx_lock_spin(&callout_lock); + /* + * If the current callout is locally + * allocated (timeout(9)) and if it has not + * been killed by untimeout(9) + * then put it on the freelist. + * + * Note: we need to check the cached + * copy of c_flags because if it was not + * local, then it's not safe to deref the + * callout pointer. + */ + if (c_flags & CALLOUT_LOCAL_ALLOC && + !(c->c_flags & + (CALLOUT_PENDING | CALLOUT_ACTIVE))) { + c->c_func = NULL; + SLIST_INSERT_HEAD(&callfree, c, + c_links.sle); + } curr_callout = NULL; if (callout_wait) { /* --6lXr1rPCNTf1w0X8-- From owner-freebsd-smp@FreeBSD.ORG Fri Mar 21 19:08:03 2008 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6C2581065679; Fri, 21 Mar 2008 19:08:03 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id B58408FC1E; Fri, 21 Mar 2008 19:08:02 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8s) with ESMTP id 236285243-1834499 for multiple; Fri, 21 Mar 2008 15:09:00 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.14.2/8.14.2) with ESMTP id m2LJ7d5t029644; Fri, 21 Mar 2008 15:07:39 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: freebsd-smp@freebsd.org Date: Fri, 21 Mar 2008 13:14:03 -0400 User-Agent: KMail/1.9.7 References: <20080321043225.GZ67856@elvis.mu.org> In-Reply-To: <20080321043225.GZ67856@elvis.mu.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200803211314.04002.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Fri, 21 Mar 2008 15:07:40 -0400 (EDT) X-Virus-Scanned: ClamAV 0.91.2/6316/Fri Mar 21 10:29:54 2008 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Alfred Perlstein Subject: Re: request for review, callout fix. X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Mar 2008 19:08:03 -0000 On Friday 21 March 2008 12:32:25 am Alfred Perlstein wrote: > Hi guys, attached is a fix for the timeout/untimeout race with > Giant locked code. > > Basically the old code would make the callout inaccessable > right before calling it inside of softclock. > > However only the callout lock is held, so when switching to > the callout's associated mutex (in this case Giant) there's > a race where a "local" (timeout/untimeout) callout would be > fired even if stopped. > > This patch fixes that. We've run several hours of regression > testing on a version of this for 6.x. > > People internal to Juniper and iedowse@ helped with this. > > Please review/comment. Curious as to how c->c_flags could change if CALLOUT_LOCAL_ALLOC is set? Since it hasn't been enqueued on the callfree list, it isn't visible to any other code, so nothing should be able to mark it active or pending. IOW, you should be able to do this in your second hunk: if (c_flags & CALLOUT_LOCAL_ALLOC) { KASSERT(c->c_flags == CALLOUT_LOCAL_ALLOC, ("corrupted callout")); c->c_func = NULL; SLIST_INSERT_HEAD(...); } -- John Baldwin From owner-freebsd-smp@FreeBSD.ORG Fri Mar 21 20:30:40 2008 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 03AA11065701; Fri, 21 Mar 2008 20:30:40 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id DCF928FC23; Fri, 21 Mar 2008 20:30:39 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: by elvis.mu.org (Postfix, from userid 1192) id CF41D1A4D7E; Fri, 21 Mar 2008 13:30:39 -0700 (PDT) Date: Fri, 21 Mar 2008 13:30:39 -0700 From: Alfred Perlstein To: John Baldwin Message-ID: <20080321203039.GJ67856@elvis.mu.org> References: <20080321043225.GZ67856@elvis.mu.org> <200803211314.04002.jhb@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200803211314.04002.jhb@freebsd.org> User-Agent: Mutt/1.4.2.3i Cc: freebsd-smp@freebsd.org Subject: Re: request for review, callout fix. X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Mar 2008 20:30:40 -0000 * John Baldwin [080321 12:08] wrote: > On Friday 21 March 2008 12:32:25 am Alfred Perlstein wrote: > > Hi guys, attached is a fix for the timeout/untimeout race with > > Giant locked code. > > > > Basically the old code would make the callout inaccessable > > right before calling it inside of softclock. > > > > However only the callout lock is held, so when switching to > > the callout's associated mutex (in this case Giant) there's > > a race where a "local" (timeout/untimeout) callout would be > > fired even if stopped. > > > > This patch fixes that. We've run several hours of regression > > testing on a version of this for 6.x. > > > > People internal to Juniper and iedowse@ helped with this. > > > > Please review/comment. > > Curious as to how c->c_flags could change if CALLOUT_LOCAL_ALLOC is set? > Since it hasn't been enqueued on the callfree list, it isn't visible to any > other code, so nothing should be able to mark it active or pending. IOW, you > should be able to do this in your second hunk: > > if (c_flags & CALLOUT_LOCAL_ALLOC) { > KASSERT(c->c_flags == CALLOUT_LOCAL_ALLOC, ("corrupted callout")); > c->c_func = NULL; > SLIST_INSERT_HEAD(...); > } It's more hairy than that. Actually, I think you're right... The confusion is the race for "callout_stop_safe", but now that I think about it, the softclock will only "grab" this callout if untimeout has NOT yet been called. If untimeout HAS been called, then softclock won't even see the callout. If untimeout HAS NOT been called and softclock grabs the callout, it will have cleared CALLOUT_PENDING and then untimeout (callout_stop_safe) will no longer free it. Therefore it is safe to omit the check for flags as you suggest. Is that right? -- - Alfred Perlstein From owner-freebsd-smp@FreeBSD.ORG Fri Mar 21 20:55:12 2008 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A656B106566C; Fri, 21 Mar 2008 20:55:12 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id B21528FC13; Fri, 21 Mar 2008 20:55:11 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8s) with ESMTP id 236295102-1834499 for multiple; Fri, 21 Mar 2008 16:56:23 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.14.2/8.14.2) with ESMTP id m2LKt8Mf030415; Fri, 21 Mar 2008 16:55:08 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: Alfred Perlstein Date: Fri, 21 Mar 2008 16:54:36 -0400 User-Agent: KMail/1.9.7 References: <20080321043225.GZ67856@elvis.mu.org> <200803211314.04002.jhb@freebsd.org> <20080321203039.GJ67856@elvis.mu.org> In-Reply-To: <20080321203039.GJ67856@elvis.mu.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200803211654.36299.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Fri, 21 Mar 2008 16:55:08 -0400 (EDT) X-Virus-Scanned: ClamAV 0.91.2/6317/Fri Mar 21 13:33:40 2008 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: freebsd-smp@freebsd.org Subject: Re: request for review, callout fix. X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Mar 2008 20:55:12 -0000 On Friday 21 March 2008 04:30:39 pm Alfred Perlstein wrote: > * John Baldwin [080321 12:08] wrote: > > On Friday 21 March 2008 12:32:25 am Alfred Perlstein wrote: > > > Hi guys, attached is a fix for the timeout/untimeout race with > > > Giant locked code. > > > > > > Basically the old code would make the callout inaccessable > > > right before calling it inside of softclock. > > > > > > However only the callout lock is held, so when switching to > > > the callout's associated mutex (in this case Giant) there's > > > a race where a "local" (timeout/untimeout) callout would be > > > fired even if stopped. > > > > > > This patch fixes that. We've run several hours of regression > > > testing on a version of this for 6.x. > > > > > > People internal to Juniper and iedowse@ helped with this. > > > > > > Please review/comment. > > > > Curious as to how c->c_flags could change if CALLOUT_LOCAL_ALLOC is set? > > Since it hasn't been enqueued on the callfree list, it isn't visible to any > > other code, so nothing should be able to mark it active or pending. IOW, you > > should be able to do this in your second hunk: > > > > if (c_flags & CALLOUT_LOCAL_ALLOC) { > > KASSERT(c->c_flags == CALLOUT_LOCAL_ALLOC, ("corrupted callout")); > > c->c_func = NULL; > > SLIST_INSERT_HEAD(...); > > } > > It's more hairy than that. > > Actually, I think you're right... > > The confusion is the race for "callout_stop_safe", > but now that I think about it, the softclock will only > "grab" this callout if untimeout has NOT yet been called. > > If untimeout HAS been called, then softclock won't even see > the callout. Yes. > If untimeout HAS NOT been called and softclock grabs the > callout, it will have cleared CALLOUT_PENDING and then > untimeout (callout_stop_safe) will no longer free it. Yes. > Therefore it is safe to omit the check for flags as you > suggest. > > Is that right? Yes, I believe so. The tricky case is the race you are originally trying to fix which is that the untimeout() comes in after the spin lock is dropped but before Giant is acquired, but that falls under your second case above. -- John Baldwin From owner-freebsd-smp@FreeBSD.ORG Fri Mar 21 23:52:05 2008 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E21301065672; Fri, 21 Mar 2008 23:52:05 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id C22CC8FC17; Fri, 21 Mar 2008 23:52:05 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: by elvis.mu.org (Postfix, from userid 1192) id 330691A4D7E; Fri, 21 Mar 2008 16:52:05 -0700 (PDT) Date: Fri, 21 Mar 2008 16:52:05 -0700 From: Alfred Perlstein To: John Baldwin Message-ID: <20080321235205.GO67856@elvis.mu.org> References: <20080321043225.GZ67856@elvis.mu.org> <200803211314.04002.jhb@freebsd.org> <20080321203039.GJ67856@elvis.mu.org> <200803211654.36299.jhb@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200803211654.36299.jhb@freebsd.org> User-Agent: Mutt/1.4.2.3i Cc: freebsd-smp@freebsd.org Subject: Re: request for review, callout fix. X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Mar 2008 23:52:06 -0000 * John Baldwin [080321 13:55] wrote: > On Friday 21 March 2008 04:30:39 pm Alfred Perlstein wrote: > > * John Baldwin [080321 12:08] wrote: > > > On Friday 21 March 2008 12:32:25 am Alfred Perlstein wrote: > > > > Hi guys, attached is a fix for the timeout/untimeout race with > > > > Giant locked code. > > > > > > > > Basically the old code would make the callout inaccessable > > > > right before calling it inside of softclock. > > > > > > > > However only the callout lock is held, so when switching to > > > > the callout's associated mutex (in this case Giant) there's > > > > a race where a "local" (timeout/untimeout) callout would be > > > > fired even if stopped. > > > > > > > > This patch fixes that. We've run several hours of regression > > > > testing on a version of this for 6.x. > > > > > > > > People internal to Juniper and iedowse@ helped with this. > > > > > > > > Please review/comment. > > > > > > Curious as to how c->c_flags could change if CALLOUT_LOCAL_ALLOC is set? > > > Since it hasn't been enqueued on the callfree list, it isn't visible to > any > > > other code, so nothing should be able to mark it active or pending. IOW, > you > > > should be able to do this in your second hunk: > > > > > > if (c_flags & CALLOUT_LOCAL_ALLOC) { > > > KASSERT(c->c_flags == CALLOUT_LOCAL_ALLOC, ("corrupted callout")); > > > c->c_func = NULL; > > > SLIST_INSERT_HEAD(...); > > > } > > > > It's more hairy than that. > > > > Actually, I think you're right... > > > > The confusion is the race for "callout_stop_safe", > > but now that I think about it, the softclock will only > > "grab" this callout if untimeout has NOT yet been called. > > > > If untimeout HAS been called, then softclock won't even see > > the callout. > > Yes. > > > If untimeout HAS NOT been called and softclock grabs the > > callout, it will have cleared CALLOUT_PENDING and then > > untimeout (callout_stop_safe) will no longer free it. > > Yes. > > > Therefore it is safe to omit the check for flags as you > > suggest. > > > > Is that right? > > Yes, I believe so. The tricky case is the race you are originally trying to > fix which is that the untimeout() comes in after the spin lock is dropped but > before Giant is acquired, but that falls under your second case above. Thank you, I'll be committing later tonight. -Alfred