From owner-freebsd-hackers@FreeBSD.ORG  Tue Nov  6 11:02:07 2012
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id 9149DF88;
 Tue,  6 Nov 2012 11:02:07 +0000 (UTC)
 (envelope-from asmrookie@gmail.com)
Received: from mail-lb0-f182.google.com (mail-lb0-f182.google.com
 [209.85.217.182])
 by mx1.freebsd.org (Postfix) with ESMTP id 1B8838FC08;
 Tue,  6 Nov 2012 11:02:05 +0000 (UTC)
Received: by mail-lb0-f182.google.com with SMTP id b5so358576lbd.13
 for <multiple recipients>; Tue, 06 Nov 2012 03:02:04 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:reply-to:sender:in-reply-to:references:date
 :x-google-sender-auth:message-id:subject:from:to:cc:content-type;
 bh=w4yGp9U/n96Elsl/O7xSXBb9MXYxxBlOVjOlABXhw6c=;
 b=aXpAb2kvJDbRv9cxBgs7i6NdKYg+hjIphfhQebQItTwBW6oqfNJ1j1tjKrA0pnmF4c
 dyM4w6b8Mw5wn9LZ7m0GNRo8j4Bq95vYzAhk70RYky6gZKus7ygZDSsdPGWyMnC8HqlK
 OYFGeeHaBriAJDpCmMzTfAjTXiNIv84oFNOYtK0URdw3VJAjMWOwmKcmp6Yfnq/5aaNO
 8WCFJayw0Vr8WuSz06vZ1j2LKI5Blw7HNOjKqEDmDakAPfCSsKkTpU4YdlohH65LZOYz
 sYqT2deeuDcQK09Y72j3cbZIyNALGjFikDOsJ+G2CCuY3GCKzW0An7GbpDUZJIlkG7CX
 zmIA==
MIME-Version: 1.0
Received: by 10.152.123.103 with SMTP id lz7mr667341lab.21.1352199724609; Tue,
 06 Nov 2012 03:02:04 -0800 (PST)
Sender: asmrookie@gmail.com
Received: by 10.112.30.37 with HTTP; Tue, 6 Nov 2012 03:02:04 -0800 (PST)
In-Reply-To: <CAJ-FndCkfF4zDbq6i80P7fCTrVPe+w-x-uqf_-VaWN-Z7Fvg5Q@mail.gmail.com>
References: <50587F8D.9060102@FreeBSD.org>
 <CAJ-FndCWzTBRYsA0mFDCj8RU06ZUTi3G0LeEFcS9_c5zKd+gWQ@mail.gmail.com>
 <5058C68B.1010508@FreeBSD.org>
 <CAJ-FndC8j12a95N0QprYTLJLC06R8jjcaHuxEZKu5D9RHW=ZVw@mail.gmail.com>
 <50596019.5060708@FreeBSD.org>
 <CAJ-FndBxWYOkRCn-DZTdZ+B4RpvsvaxtwDyMc8M7YhRj9DaL2w@mail.gmail.com>
 <505AF836.7050004@FreeBSD.org>
 <CAJ-FndAiiH5ToUtns=jC7_V27BkeOvJ8_T9SOmw2DGdFyfvTdg@mail.gmail.com>
 <506C461F.2050008@FreeBSD.org>
 <CAJ-FndAzuCBCuZurMehDu=OYRCKJ=0RJ8-VcjUE5QRJpxdyWzw@mail.gmail.com>
 <CAJ-FndCkfF4zDbq6i80P7fCTrVPe+w-x-uqf_-VaWN-Z7Fvg5Q@mail.gmail.com>
Date: Tue, 6 Nov 2012 11:02:04 +0000
X-Google-Sender-Auth: 5HYjbN29SWFpmKacqw44Yeol1UI
Message-ID: <CAJ-FndA-r2n9cFDZ9k2Y4NKsGWpm_jh41iUzv20eXc_q26MGaw@mail.gmail.com>
Subject: Re: ule+smp: small optimization for turnstile priority lending
From: Attilio Rao <attilio@freebsd.org>
To: Andriy Gapon <avg@freebsd.org>
Content-Type: text/plain; charset=UTF-8
Cc: freebsd-hackers <freebsd-hackers@freebsd.org>,
 Jeff Roberson <jeff@freebsd.org>
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: attilio@FreeBSD.org
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 06 Nov 2012 11:02:07 -0000

On 10/29/12, Attilio Rao <attilio@freebsd.org> wrote:
> On Mon, Oct 29, 2012 at 4:56 PM, Attilio Rao <attilio@freebsd.org> wrote:
>> On Wed, Oct 3, 2012 at 3:05 PM, Andriy Gapon <avg@freebsd.org> wrote:
>>> on 20/09/2012 16:14 Attilio Rao said the following:
>>>> On 9/20/12, Andriy Gapon <avg@freebsd.org> wrote:
>>> [snip]
>>>>> The patch works well as far as I can tell.  Thank you!
>>>>> There is one warning with full witness enables but it appears to be
>>>>> harmless
>>>>> (so
>>>>> far):
>>>>
>>>> Andriy,
>>>> thanks a lot for your testing and reports you made so far.
>>>> Unfortunately I'm going off for 2 weeks now and I won't work on
>>>> FreeBSD for that timeframe. I will get back to those in 2 weeks then.
>>>> If you want  to play more with this idea feel free to extend/fix/etc.
>>>> this patch.
>>>
>>> Unfortunately I haven't found time to work on this further, but I have
>>> some
>>> additional thoughts.
>>>
>>> First, I think that the witness message was not benign and it actually
>>> warns about
>>> the same kind of deadlock that I originally had.
>>> The problem is that sched_lend_prio() is called with target thread's
>>> td_lock held,
>>> which is a lock of tdq on the thread's CPU.  Then, in your patch, we
>>> acquire
>>> current tdq's lock to modify its load.  But if two tdq locks are to be
>>> held at the
>>> same time, then they must be locked using tdq_lock_pair, so that lock
>>> order is
>>> maintained.  With the patch no tdq lock order can be maintained (can
>>> arbitrary)
>>> and thus it is possible to run into a deadlock.
>>
>> Indeed. I realized this after thinking about this problem while I was
>> on holiday.
>>
>>>
>>> I see two possibilities here, but don't rule out that there can be more.
>>>
>>> 1. Do something like my patch does.  That is, manipulate current thread's
>>> tdq in
>>> advance before any other thread/tdq locks are acquired (or td_lock is
>>> changed to
>>> point to a different lock and current tdq is unlocked).  The API can be
>>> made more
>>> generic in nature.  E.g. it can look like this:
>>> void
>>> sched_thread_block(struct thread *td, int inhibitor)
>>> {
>>>         struct tdq *tdq;
>>>
>>>         THREAD_LOCK_ASSERT(td, MA_OWNED);
>>>         KASSERT(td == curthread,
>>>             ("sched_thread_block: only curthread is supported"));
>>>         tdq = TDQ_SELF();
>>>         TDQ_LOCK_ASSERT(tdq, MA_OWNED);
>>>         MPASS(td->td_lock == TDQ_LOCKPTR(tdq));
>>>         TD_SET_INHIB(td, inhibitor);
>>>         tdq_load_rem(tdq, td);
>>>         tdq_setlowpri(tdq, td);
>>> }
>>>
>>>
>>> 2. Try to do things from sched_lend_prio based on curthread's state.
>>> This, as it
>>> seems, would require completely lock-less manipulations of current tdq.
>>> E.g. for
>>> tdq_load we could use atomic operations (it is already accessed
>>> locklessly, but
>>> not modified).  But for tdq_lowpri a more elaborate trick would be
>>> required like
>>> having a separate field for a temporary value.
>>
>> No, this is not going to work because tdq_lowpri and tdq_load are
>> updated and used in the same critical path (and possibly also other
>> tdq* members in tdq_runqueue_add(), for example, but I didn't bother
>> to also check that).
>
> Ok, so here I wasn't completely right because td_lowpri and tdq_load
> are not read in the same critical path (cpu_search() and
> sched_pickcpu() in the self case).
> However, I was over-estimating a factor in this patch: I don't think
> we need to fix real tdq_load for self in that case before cpu_search()
> because self is handled in a very special way, with its own
> comparison. I then think that a patch local to sched_pickcpu() on the
> self path should be enough.
>
> This brings us to another bigger issue, however. tdq_lowpri handling
> is not perfect in that patch. Think about the case where the thread to
> be switched out is, infact, the lowest priority one. In that case,
> what happens is that what we should do is recalculating tdq_lowpri
> which is usually a very expensive operation and requires TDQ self
> locking to be in place to be brought on correctly.

I think that the easy fix for this is to implement a second field
called tdq_sec_lowpri where we store the next thread to be scheduled,
after tdq_lowpri. This is going to not really change anything because
we will get it for free when we already calculate tdq_lowpri and it
will let us the implement a definitive logic that helps in the case
you describe.

However I still want to think about the base idea behind the current
algorithm in the self/likely cpu pickup.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein