From owner-freebsd-smp@FreeBSD.ORG Mon Mar 17 16:44:10 2008 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0B423106566B; Mon, 17 Mar 2008 16:44:10 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id 4BCC58FC1C; Mon, 17 Mar 2008 16:44:08 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8s) with ESMTP id 235801822-1834499 for multiple; Mon, 17 Mar 2008 12:42:28 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.14.2/8.14.2) with ESMTP id m2HGi3qT078981; Mon, 17 Mar 2008 12:44:05 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: freebsd-smp@freebsd.org Date: Mon, 17 Mar 2008 11:27:20 -0400 User-Agent: KMail/1.9.7 References: <20080315024114.GD67856@elvis.mu.org> In-Reply-To: <20080315024114.GD67856@elvis.mu.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200803171127.20561.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Mon, 17 Mar 2008 12:44:05 -0400 (EDT) X-Virus-Scanned: ClamAV 0.91.2/6275/Mon Mar 17 10:08:48 2008 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: stable@freebsd.org, Alfred Perlstein Subject: Re: timeout/untimeout race conditions/crash [patch] X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Mar 2008 16:44:10 -0000 On Friday 14 March 2008 10:41:14 pm Alfred Perlstein wrote: > We think we tracked down a defect in timeout/untimeout in > FreeBSD. > > We have reduced the problem to the following scenario: > > 2+ cpu system, one cpu is running softclock at the same time > another thread is running on another cpu which makes use of > timeout/untimeout. > > CPU 0 is running "softclock" > CPU 1 is running "driver" with Giant held. > > softclock: mtx_lock_spin(&callout_lock) > softclock: CACHES the callout structure's fields. > softclock: sees that it's a CALLOUT_LOCAL_ALLOC > softclock: executes this code: > if (c->c_flags & CALLOUT_LOCAL_ALLOC) { > c->c_func = NULL; > c->c_flags = CALLOUT_LOCAL_ALLOC; > SLIST_INSERT_HEAD(&callfree, c, > c_links.sle); > curr_callout = NULL; > } else { > > NOTE: that c->c_func has been set to NULL and curr_callout > is also NULL. > softclock: mtx_unlock_spin(&callout_lock) > driver: calls untimeout(), the following sequence happens: > mtx_lock_spin(&callout_lock); > if (handle.callout->c_func == ftn && handle.callout->c_arg == arg) > callout_stop(handle.callout); > mtx_unlock_spin(&callout_lock); > > NOTE: untimeout() sees that handle.callout->c_func is not set > to the function so it does NOT call callout_stop(9)! > driver: free's backing structure for c->c_arg. > softclock: executes callout. > softclock: likely crashes at this point due to access after free. > > I have a patch I'm trying out here, but I need feedback on it. > > The way the patch works is to treat CALLOUT_LOCAL_ALLOC (timeout/untimeout) > callouts the same as ~CALLOUT_LOCAL_ALLOC allocs, and moves the > freelist manipulation to the end of the callout dispatch. > > Some light testing seems to have the system work. > > We are doing some testing in-house to also make sure this works. > > Please provide feedback. > > See attached delta. This is not a bug. Don't use untimeout(9) as it is not guaranteed to be reliable. Instead, use callout_*(). Your patch doesn't solve any races as the driver detach routine needs to use callout_drain() and not just callout_stop/untimeout anyways. Fix your broken drivers. -- John Baldwin