From owner-freebsd-hackers@FreeBSD.ORG Tue Nov 6 11:02:07 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9149DF88; Tue, 6 Nov 2012 11:02:07 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-lb0-f182.google.com (mail-lb0-f182.google.com [209.85.217.182]) by mx1.freebsd.org (Postfix) with ESMTP id 1B8838FC08; Tue, 6 Nov 2012 11:02:05 +0000 (UTC) Received: by mail-lb0-f182.google.com with SMTP id b5so358576lbd.13 for ; Tue, 06 Nov 2012 03:02:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=w4yGp9U/n96Elsl/O7xSXBb9MXYxxBlOVjOlABXhw6c=; b=aXpAb2kvJDbRv9cxBgs7i6NdKYg+hjIphfhQebQItTwBW6oqfNJ1j1tjKrA0pnmF4c dyM4w6b8Mw5wn9LZ7m0GNRo8j4Bq95vYzAhk70RYky6gZKus7ygZDSsdPGWyMnC8HqlK OYFGeeHaBriAJDpCmMzTfAjTXiNIv84oFNOYtK0URdw3VJAjMWOwmKcmp6Yfnq/5aaNO 8WCFJayw0Vr8WuSz06vZ1j2LKI5Blw7HNOjKqEDmDakAPfCSsKkTpU4YdlohH65LZOYz sYqT2deeuDcQK09Y72j3cbZIyNALGjFikDOsJ+G2CCuY3GCKzW0An7GbpDUZJIlkG7CX zmIA== MIME-Version: 1.0 Received: by 10.152.123.103 with SMTP id lz7mr667341lab.21.1352199724609; Tue, 06 Nov 2012 03:02:04 -0800 (PST) Sender: asmrookie@gmail.com Received: by 10.112.30.37 with HTTP; Tue, 6 Nov 2012 03:02:04 -0800 (PST) In-Reply-To: References: <50587F8D.9060102@FreeBSD.org> <5058C68B.1010508@FreeBSD.org> <50596019.5060708@FreeBSD.org> <505AF836.7050004@FreeBSD.org> <506C461F.2050008@FreeBSD.org> Date: Tue, 6 Nov 2012 11:02:04 +0000 X-Google-Sender-Auth: 5HYjbN29SWFpmKacqw44Yeol1UI Message-ID: Subject: Re: ule+smp: small optimization for turnstile priority lending From: Attilio Rao To: Andriy Gapon Content-Type: text/plain; charset=UTF-8 Cc: freebsd-hackers , Jeff Roberson X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: attilio@FreeBSD.org List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Nov 2012 11:02:07 -0000 On 10/29/12, Attilio Rao wrote: > On Mon, Oct 29, 2012 at 4:56 PM, Attilio Rao wrote: >> On Wed, Oct 3, 2012 at 3:05 PM, Andriy Gapon wrote: >>> on 20/09/2012 16:14 Attilio Rao said the following: >>>> On 9/20/12, Andriy Gapon wrote: >>> [snip] >>>>> The patch works well as far as I can tell. Thank you! >>>>> There is one warning with full witness enables but it appears to be >>>>> harmless >>>>> (so >>>>> far): >>>> >>>> Andriy, >>>> thanks a lot for your testing and reports you made so far. >>>> Unfortunately I'm going off for 2 weeks now and I won't work on >>>> FreeBSD for that timeframe. I will get back to those in 2 weeks then. >>>> If you want to play more with this idea feel free to extend/fix/etc. >>>> this patch. >>> >>> Unfortunately I haven't found time to work on this further, but I have >>> some >>> additional thoughts. >>> >>> First, I think that the witness message was not benign and it actually >>> warns about >>> the same kind of deadlock that I originally had. >>> The problem is that sched_lend_prio() is called with target thread's >>> td_lock held, >>> which is a lock of tdq on the thread's CPU. Then, in your patch, we >>> acquire >>> current tdq's lock to modify its load. But if two tdq locks are to be >>> held at the >>> same time, then they must be locked using tdq_lock_pair, so that lock >>> order is >>> maintained. With the patch no tdq lock order can be maintained (can >>> arbitrary) >>> and thus it is possible to run into a deadlock. >> >> Indeed. I realized this after thinking about this problem while I was >> on holiday. >> >>> >>> I see two possibilities here, but don't rule out that there can be more. >>> >>> 1. Do something like my patch does. That is, manipulate current thread's >>> tdq in >>> advance before any other thread/tdq locks are acquired (or td_lock is >>> changed to >>> point to a different lock and current tdq is unlocked). The API can be >>> made more >>> generic in nature. E.g. it can look like this: >>> void >>> sched_thread_block(struct thread *td, int inhibitor) >>> { >>> struct tdq *tdq; >>> >>> THREAD_LOCK_ASSERT(td, MA_OWNED); >>> KASSERT(td == curthread, >>> ("sched_thread_block: only curthread is supported")); >>> tdq = TDQ_SELF(); >>> TDQ_LOCK_ASSERT(tdq, MA_OWNED); >>> MPASS(td->td_lock == TDQ_LOCKPTR(tdq)); >>> TD_SET_INHIB(td, inhibitor); >>> tdq_load_rem(tdq, td); >>> tdq_setlowpri(tdq, td); >>> } >>> >>> >>> 2. Try to do things from sched_lend_prio based on curthread's state. >>> This, as it >>> seems, would require completely lock-less manipulations of current tdq. >>> E.g. for >>> tdq_load we could use atomic operations (it is already accessed >>> locklessly, but >>> not modified). But for tdq_lowpri a more elaborate trick would be >>> required like >>> having a separate field for a temporary value. >> >> No, this is not going to work because tdq_lowpri and tdq_load are >> updated and used in the same critical path (and possibly also other >> tdq* members in tdq_runqueue_add(), for example, but I didn't bother >> to also check that). > > Ok, so here I wasn't completely right because td_lowpri and tdq_load > are not read in the same critical path (cpu_search() and > sched_pickcpu() in the self case). > However, I was over-estimating a factor in this patch: I don't think > we need to fix real tdq_load for self in that case before cpu_search() > because self is handled in a very special way, with its own > comparison. I then think that a patch local to sched_pickcpu() on the > self path should be enough. > > This brings us to another bigger issue, however. tdq_lowpri handling > is not perfect in that patch. Think about the case where the thread to > be switched out is, infact, the lowest priority one. In that case, > what happens is that what we should do is recalculating tdq_lowpri > which is usually a very expensive operation and requires TDQ self > locking to be in place to be brought on correctly. I think that the easy fix for this is to implement a second field called tdq_sec_lowpri where we store the next thread to be scheduled, after tdq_lowpri. This is going to not really change anything because we will get it for free when we already calculate tdq_lowpri and it will let us the implement a definitive logic that helps in the case you describe. However I still want to think about the base idea behind the current algorithm in the self/likely cpu pickup. Attilio -- Peace can only be achieved by understanding - A. Einstein