From owner-freebsd-arch@freebsd.org  Fri Aug 25 18:24:19 2017
Return-Path: <owner-freebsd-arch@freebsd.org>
Delivered-To: freebsd-arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id F14B9DDE844
 for <freebsd-arch@mailman.ysv.freebsd.org>;
 Fri, 25 Aug 2017 18:24:19 +0000 (UTC)
 (envelope-from truckman@FreeBSD.org)
Received: from gw.catspoiler.org (unknown [IPv6:2602:304:b010:ef20::f2])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "gw.catspoiler.org", Issuer "gw.catspoiler.org" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id B4C3B6B765;
 Fri, 25 Aug 2017 18:24:19 +0000 (UTC)
 (envelope-from truckman@FreeBSD.org)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
 by gw.catspoiler.org (8.15.2/8.15.2) with ESMTP id v7PIOA6q048321;
 Fri, 25 Aug 2017 11:24:14 -0700 (PDT)
 (envelope-from truckman@FreeBSD.org)
Message-Id: <201708251824.v7PIOA6q048321@gw.catspoiler.org>
Date: Fri, 25 Aug 2017 11:24:10 -0700 (PDT)
From: Don Lewis <truckman@FreeBSD.org>
Subject: Re: ULE steal_idle questions
To: avg@FreeBSD.org
cc: freebsd-arch@FreeBSD.org
In-Reply-To: <201708241641.v7OGf3pA042851@gw.catspoiler.org>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 25 Aug 2017 18:24:20 -0000

On 24 Aug, To: avg@FreeBSD.org wrote:
> Aside from the Ryzen problem, I think the steal_idle code should be
> re-written so that it doesn't block interrupts for so long.  In its
> current state, interrupt latence increases with the number of cores and
> the complexity of the topology.
> 
> What I'm thinking is that we should set a flag at the start of the
> search for a thread to steal.  If we are preempted by another, higher
> priority thread, that thread will clear the flag.  Next we start the
> loop to search up the hierarchy.  Once we find a candidate CPU:
> 
>                 steal = TDQ_CPU(cpu);
>                 CPU_CLR(cpu, &mask);
>                 tdq_lock_pair(tdq, steal);
> 		if (tdq->tdq_load != 0) {
> 			goto out; /* to exit loop and switch to the new thread */
> 		}
> 		if (flag was cleared) {
> 			tdq_unlock_pair(tdq, steal);
> 			goto restart; /* restart the search */
> 		}
> 		if (steal->tdq_load < thresh || steal->tdq_transferable == 0 ||
> 		    tdq_move(steal, tdq) == 0) {
> 			tdq_unlock_pair(tdq, steal);
> 			continue;
> 		}
> 	    out:
> 	    	TDQ_UNLOCK(steal);
> 	    	clear flag;
> 	    	mi_switch(SW_VOL | SWT_IDLE, NULL);
> 	    	thread_unlock(curthread);
> 	    	return (0);
> 
> And we also have to clear the flag if we did not find a thread to steal.

I've implemented something like this and added a bunch of counters to it
to get a better understanding of its behavior.  Instead of adding a flag
to detect preemption, I used the same switchcnt test as is used by
sched_idletd().  These are the results of a ~9 hour poudriere run:

kern.sched.steal.none: 9971668   # no threads were stolen
kern.sched.steal.fail: 23709     # unable to steal from cpu=sched_highest()
kern.sched.steal.level2: 191839  # somewhere on this chip
kern.sched.steal.level1: 557659  # a core on this CCX
kern.sched.steal.level0: 4555426 # the other SMT thread on this core
kern.sched.steal.restart: 404    # preemption detected so restart the search
kern.sched.steal.call: 15276638  # of times tdq_idled() called

There are a few surprises here.

One is the number of failed moves.  I don't know if the load on the
source CPU fell below thresh, tdq_transferable went to zero, or if
tdq_move() failed.  I also wonder if the failures are evenly distributed
across CPUs.  It is possible that these failures are concentrated on CPU
0, which handles most interrupts.  If interrupts don't affect switchcnt,
then the data collected by sched_highest() could be a bit stale and we
would not know it.

Something else that I did not expect is the how frequently threads are
stolen from the other SMT thread on the same core, even though I
increased steal_thresh from 2 to 3 to account for the off-by-one
problem.  This is true even right after the system has booted and no
significant load has been applied.  My best guess is that because of
affinity, both the parent and child processes run on the same CPU after
fork(), and if a number of processes are forked() in quick succession,
the run queue of that CPU can get really long.  Forcing a thread
migration in exec() might be a good solution.