From owner-freebsd-arch@freebsd.org  Wed Aug 23 20:58:51 2017
Return-Path: <owner-freebsd-arch@freebsd.org>
Delivered-To: freebsd-arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8C662DEED20
 for <freebsd-arch@mailman.ysv.freebsd.org>;
 Wed, 23 Aug 2017 20:58:51 +0000 (UTC)
 (envelope-from truckman@FreeBSD.org)
Received: from gw.catspoiler.org (unknown [IPv6:2602:304:b010:ef20::f2])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "gw.catspoiler.org", Issuer "gw.catspoiler.org" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 4FA3E818CB;
 Wed, 23 Aug 2017 20:58:51 +0000 (UTC)
 (envelope-from truckman@FreeBSD.org)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
 by gw.catspoiler.org (8.15.2/8.15.2) with ESMTP id v7NKwg2a037103;
 Wed, 23 Aug 2017 13:58:46 -0700 (PDT)
 (envelope-from truckman@FreeBSD.org)
Message-Id: <201708232058.v7NKwg2a037103@gw.catspoiler.org>
Date: Wed, 23 Aug 2017 13:58:42 -0700 (PDT)
From: Don Lewis <truckman@FreeBSD.org>
Subject: Re: ULE steal_idle questions
To: avg@FreeBSD.org
cc: freebsd-arch@FreeBSD.org
In-Reply-To: <d9dae0c1-e718-13fe-b6b5-87160c71784e@FreeBSD.org>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 23 Aug 2017 20:58:51 -0000

On 23 Aug, Andriy Gapon wrote:
> On 23/08/2017 18:04, Don Lewis wrote:
>> I've been looking at the steal_idle code in tdq_idled() and found some
>> things that puzzle me.
>> 
>> Consider a machine with three CPUs:
>>   A, which is idle
>>   B, which is busy running a thread
>>   C, which is busy running a thread and has another thread in queue
>> It would seem to make sense that the tdq_load values for these three
>> CPUs would be 0, 1, and 2 respectively in order to select the best CPU
>> to run a new thread.
>> 
>> If so, then why do we pass thresh=1 to sched_highest() in the code that
>> implements steal_idle?  That value is used to set cs_limit which is used
>> in this comparison in cpu_search:
>>                         if (match & CPU_SEARCH_HIGHEST)
>>                                 if (tdq->tdq_load >= hgroup.cs_limit &&
>> That would seem to make CPU B a candidate for stealing a thread from.
>> Ignoring CPU C for the moment, that shouldn't happen if the thread is
>> running, but even if it was possible, it would just make CPU B go idle,
>> which isn't terribly helpful in terms of load balancing and would just
>> thrash the caches.  The same comparison is repeated in tdq_idled() after
>> a candidate CPU has been chosen:
>>                 if (steal->tdq_load < thresh || steal->tdq_transferable == 0) {
>>                         tdq_unlock_pair(tdq, steal);
>>                         continue;
>>                 }
>> 
>> It looks to me like there is an off-by-one error here, and there is a
>> similar problem in the code that implements kern.sched.balance.
> 
> 
> I agree with your analysis.  I had the same questions as well.
> I think that the tdq_transferable check is what saves the code from
> running into any problems.  But it indeed would make sense for the code
> to understand that tdq_load includes a currently running, never
> transferable thread as well.

Yes, I think that the tdq_transferable check will fail, but at the cost
of an unnecessary tdq_lock_pair()/tdq_unlock_pair() and another loop
iteration, including another expensive sched_highest() call.  Consider
the case of a close to fully loaded system where all of the other CPUs
each have one running thread.  The current code will try each of the
other CPUs, calling tdq_lock_pair(), failing the tdq_transferable check,
calling tdq_unlock_pair(), then restarting the loop with that CPU
removed from mask, and all of this done with interrupts disabled.  The
proper thing to do in this case would be to just go into the idle state.

>> The reason I ask is that I've been debugging random segfaults and other
>> strange errors on my Ryzen machine and the problems mostly go away if I
>> either disable kern.sched.steal_idle and kern_sched.balance, or if I
>> leave kern_sched.steal_idle enabled and hack the code to change the
>> value of thresh from 1 to 2.  See
>> <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221029> for the gory
>> details.  I don't know if my CPU has what AMD calls the "performance
>> marginality issue".
> 
> I have been following your experiments and it's interesting that
> "massaging" the CPU in certain ways makes it a bit happier.  But
> certainly the fault is with the CPU as the code is trouble-free on many
> different architectures including x86, and various processors from both
> Intel and AMD [with earlier CPU families].

The results of my experiments so far are pointing to the time spent
looping in tdq_idled() as at least one cause of the problems.  If I
restrict tdq_idled() to look at just a single core, things are happy. If
I then set steal_thresh to 1 so that it has to loop, then things get
unhappy.  If I allow tdq_idled() to look at both the current core and
current CCX with thresh at the CCX level set to 1 so that the code
loops, things are unhappy.  If I set thresh to 2 so that the code does
not loop unnecessarily, things get happy again.  If I allow tdq_idled()
to look at the entire topology (so that there are now three calls to 
sched_highest(), things get unhappy again.

There is a known issue with executing IRET on one SMT thread when the
other SMT thread on that core is "busy", but I don't know how that would
affect things.  Perhaps that can be fixed in microcode.

sched_highest() looks like it is really expensive in terms of CPU
cycles.  On Ryzen, if we can't find a suitable thread on the current CXX
to transfer and step up to the chip level, sched_highest() will
recalculate the load on the current CCX even though we have already
rejected it.  Things get worse when using cpuset because if tdq_move()
fails due to cpuset (or other) restrictions, then we call
sched_highest() all over again to redo all the calculations, but with
the previously chosen CPU removed from the potential choices.  This will
get even worse with Threadripper and beyond.  Even if it doesn't cause
obvious breakage, it is bad for interrupt latency.

I'm not convinced that the ghc and go problems are Ryzen bugs and not
bugs in the code for those two ports.  I've never seen build failures
for those on my FX-8320E, but the increased number of threads on Ryzen
might expose some latent problems.