From owner-freebsd-arch@freebsd.org Fri Aug 25 18:24:19 2017 Return-Path: Delivered-To: freebsd-arch@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id F14B9DDE844 for ; Fri, 25 Aug 2017 18:24:19 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (unknown [IPv6:2602:304:b010:ef20::f2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "gw.catspoiler.org", Issuer "gw.catspoiler.org" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id B4C3B6B765; Fri, 25 Aug 2017 18:24:19 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.15.2/8.15.2) with ESMTP id v7PIOA6q048321; Fri, 25 Aug 2017 11:24:14 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <201708251824.v7PIOA6q048321@gw.catspoiler.org> Date: Fri, 25 Aug 2017 11:24:10 -0700 (PDT) From: Don Lewis Subject: Re: ULE steal_idle questions To: avg@FreeBSD.org cc: freebsd-arch@FreeBSD.org In-Reply-To: <201708241641.v7OGf3pA042851@gw.catspoiler.org> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Aug 2017 18:24:20 -0000 On 24 Aug, To: avg@FreeBSD.org wrote: > Aside from the Ryzen problem, I think the steal_idle code should be > re-written so that it doesn't block interrupts for so long. In its > current state, interrupt latence increases with the number of cores and > the complexity of the topology. > > What I'm thinking is that we should set a flag at the start of the > search for a thread to steal. If we are preempted by another, higher > priority thread, that thread will clear the flag. Next we start the > loop to search up the hierarchy. Once we find a candidate CPU: > > steal = TDQ_CPU(cpu); > CPU_CLR(cpu, &mask); > tdq_lock_pair(tdq, steal); > if (tdq->tdq_load != 0) { > goto out; /* to exit loop and switch to the new thread */ > } > if (flag was cleared) { > tdq_unlock_pair(tdq, steal); > goto restart; /* restart the search */ > } > if (steal->tdq_load < thresh || steal->tdq_transferable == 0 || > tdq_move(steal, tdq) == 0) { > tdq_unlock_pair(tdq, steal); > continue; > } > out: > TDQ_UNLOCK(steal); > clear flag; > mi_switch(SW_VOL | SWT_IDLE, NULL); > thread_unlock(curthread); > return (0); > > And we also have to clear the flag if we did not find a thread to steal. I've implemented something like this and added a bunch of counters to it to get a better understanding of its behavior. Instead of adding a flag to detect preemption, I used the same switchcnt test as is used by sched_idletd(). These are the results of a ~9 hour poudriere run: kern.sched.steal.none: 9971668 # no threads were stolen kern.sched.steal.fail: 23709 # unable to steal from cpu=sched_highest() kern.sched.steal.level2: 191839 # somewhere on this chip kern.sched.steal.level1: 557659 # a core on this CCX kern.sched.steal.level0: 4555426 # the other SMT thread on this core kern.sched.steal.restart: 404 # preemption detected so restart the search kern.sched.steal.call: 15276638 # of times tdq_idled() called There are a few surprises here. One is the number of failed moves. I don't know if the load on the source CPU fell below thresh, tdq_transferable went to zero, or if tdq_move() failed. I also wonder if the failures are evenly distributed across CPUs. It is possible that these failures are concentrated on CPU 0, which handles most interrupts. If interrupts don't affect switchcnt, then the data collected by sched_highest() could be a bit stale and we would not know it. Something else that I did not expect is the how frequently threads are stolen from the other SMT thread on the same core, even though I increased steal_thresh from 2 to 3 to account for the off-by-one problem. This is true even right after the system has booted and no significant load has been applied. My best guess is that because of affinity, both the parent and child processes run on the same CPU after fork(), and if a number of processes are forked() in quick succession, the run queue of that CPU can get really long. Forcing a thread migration in exec() might be a good solution.