From owner-freebsd-current@FreeBSD.ORG Tue Aug 16 17:50:23 2005 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id ED96D16A41F for ; Tue, 16 Aug 2005 17:50:23 +0000 (GMT) (envelope-from jhb@FreeBSD.org) Received: from mv.twc.weather.com (mv.twc.weather.com [65.212.71.225]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6821E43D46 for ; Tue, 16 Aug 2005 17:50:23 +0000 (GMT) (envelope-from jhb@FreeBSD.org) Received: from [10.50.40.201] (Not Verified[10.50.40.201]) by mv.twc.weather.com with NetIQ MailMarshal (v6, 0, 3, 8) id ; Tue, 16 Aug 2005 14:05:11 -0400 From: John Baldwin To: freebsd-current@freebsd.org Date: Tue, 16 Aug 2005 13:21:13 -0400 User-Agent: KMail/1.8 References: <20050811220017.A72944@hades.admin.frm2> <200508120937.26659.jhb@FreeBSD.org> <430034B9.1010706@samsco.org> In-Reply-To: <430034B9.1010706@samsco.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200508161321.14735.jhb@FreeBSD.org> Cc: Joerg Pulz Subject: Re: 6.0-BETA2: taskqueue_drain for if_xl.c:2796 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Aug 2005 17:50:24 -0000 On Monday 15 August 2005 02:22 am, Scott Long wrote: > John Baldwin wrote: > > On Thursday 11 August 2005 07:07 pm, Scott Long wrote: > I still don't see what the point is of having a function that only > removes pending tasks from the queue if it doesn't also drain in > progress tasks. Simple example: > > CPU1 calls xl_stop. The XL_LOCK gets taken. At the same time, CPU2 > calls taskqueue_swi, which pops the queue and calls xl_rxeof_task. It > tries to take the XL_LOCK but instead sleeps (or adaptively spins) > because CPU1 already has it. CPU1 calls taskqueue_stop, which doesn't > do anything because the task was alrady popped off. > > Now, at some point, CPU1 **MUST** drop the XL_LOCK and let CPU2 continue > that task to completion. There simply is no way around that, right? So > why muddle the API with a function that really doesn't solve any > problems. The argument that being able to call taskqueue_stop with > locks held is a convenience doesn't hold water. You'll still need to > drop locks in order to be sure that a mess isn't created. Well, you need to think through this longer. During a normal stop, we don't care if the task runs later if it is just going to stop anyway, which it will with the IFF_DRV_RUNNING checks I added (which it needs anyway even if you don't add a taskqueue_stop()). See, when CPU2 eventually runs the xl_rxeof_task, it will drop the lock and return instantly because the xl_stop() function clears IFF_DRV_RUNNING in addition to calling taskqueue_stop(). Now, let's go back to the detach case where you actually need a drain. In that case you call xl_stop() first. It would be nice if that could go ahead and cancel the task if it's still in the queue without blocking at all. Calling taskqueue_stop() in xl_stop() will do that for you just fine. If you don't do that, you have to wait until any other tasks ahead of you on the queue decide to finish before your task can run. That's nuts! Using FOO_LOCK(sc); taskqueue_stop(&sc->task); FOO_UNLOCK(sc); taskqueue_drain(&sc->task); Results in the behavior that if the task hasn't run yet, you don't block, and thus you only block if the task is already running and your driver can guarantee that that won't be forever since it can ensure the task handler checks for the interface being stopped after it does its own FOO_LOCK(). While we are on detach, you already have to add the IFF_DRV_RUNNING checks to the task handler even if you still use taskqueue_drain() in xl_stop() so that you don't block forever in that case (that is, the task handler has to realize that it's supposed to die so it will return and the draining thread will be resumed.) One other thing here. taskqueue_drain() is currently called from xl_stop(), which is called from xl_init_locked(), which is called from xl_intr(). Think "sleeping in an interrupt thread." There be dragons. :) This is common to almost all the ethernet drivers I've looked at in the past few weeks (de, dc, fxp, hme, sf, my, ste, pcn, el, and xl). > So what does that really buy you since you still have to deal with the > possible blocked task problem? Better to just organize the code so that > you can call taskqueue_dequeue with locks released and let the problem > be dealt with in a consistent way, rather than encourage people roll > their own dequeueing logic (best case) or become careless with > parallelization (worse case). I think that all that you are doing is > finding a way to cheat around a real synchronization problem. There are > a lot of network drivers in the tree that are notorious for > fundamentally poor locking design. I'd really rather see them be fixed > to conform to good APIs, rather than add marginal APIs that paper over > the problems. Errm, from all the driver's I've worked on so far, stop() is always called with the lock held the entire time and doesn't drop it. The drivers are written this way in 4.x as well (they don't drop spl()), so using callout_stop() and taskqueue_stop() instead of callout_drain() and taskqueue_drain() in foo_stop() is a lot simpler change. stop() is also called from interrupt handlers and can't really do anything that sleeps. Also, in general the entry points to the driver (ifnet functions, ifmedia functions, interrupt handlers, etc.) tend to lock the driver lock and the internal functions don't mess with the lock (except to drop the lock around calls to ifp->if_input()). That's a lot simpler than having to rewrite stop(). Also, letting me do a stop() that just lets the driver be in control of how long it has to wait in detach() when it tries to cancel its task rather than being subject to the whims of over tasks on the queue. Again, we are talking about a model where we fire off an event (stop your callout, stop your task) and defer the blocking of acknowledgement until a more convenient time. Outside of the detach() case, we don't even need to block waiting for the ack because there's no harm in having the event handler run, note that the interface is stopped, and abort. See, where's the problem in your original case with CPU2 running the task if the task handler checks the interface to see if it's been stopped and will abort if it gets run after CPU1 has done foo_stop()? For the non-detach() case we don't care when or if the task runs. For the detach() case we have to make sure it doesn't try to use our mutex (or the text of our kernel module) before we destroy the mutex and return from detach(). In other words, the drain() isn't there to synchronize state of the device itself, but to guarantee order of operations for driver teardown. -- John Baldwin <>< http://www.FreeBSD.org/~jhb/ "Power Users Use the Power to Serve" = http://www.FreeBSD.org