From owner-freebsd-arch@freebsd.org Sat Aug 26 17:50:25 2017 Return-Path: Delivered-To: freebsd-arch@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 02D06DD6A9B for ; Sat, 26 Aug 2017 17:50:25 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (unknown [IPv6:2602:304:b010:ef20::f2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "gw.catspoiler.org", Issuer "gw.catspoiler.org" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id B683072746; Sat, 26 Aug 2017 17:50:24 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.15.2/8.15.2) with ESMTP id v7QHoG2c053745; Sat, 26 Aug 2017 10:50:20 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <201708261750.v7QHoG2c053745@gw.catspoiler.org> Date: Sat, 26 Aug 2017 10:50:16 -0700 (PDT) From: Don Lewis Subject: Re: ULE steal_idle questions To: avg@FreeBSD.org cc: freebsd-arch@FreeBSD.org In-Reply-To: <201708251824.v7PIOA6q048321@gw.catspoiler.org> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Aug 2017 17:50:25 -0000 On 25 Aug, To: avg@FreeBSD.org wrote: > On 24 Aug, To: avg@FreeBSD.org wrote: >> Aside from the Ryzen problem, I think the steal_idle code should be >> re-written so that it doesn't block interrupts for so long. In its >> current state, interrupt latence increases with the number of cores and >> the complexity of the topology. >> >> What I'm thinking is that we should set a flag at the start of the >> search for a thread to steal. If we are preempted by another, higher >> priority thread, that thread will clear the flag. Next we start the >> loop to search up the hierarchy. Once we find a candidate CPU: >> >> steal = TDQ_CPU(cpu); >> CPU_CLR(cpu, &mask); >> tdq_lock_pair(tdq, steal); >> if (tdq->tdq_load != 0) { >> goto out; /* to exit loop and switch to the new thread */ >> } >> if (flag was cleared) { >> tdq_unlock_pair(tdq, steal); >> goto restart; /* restart the search */ >> } >> if (steal->tdq_load < thresh || steal->tdq_transferable == 0 || >> tdq_move(steal, tdq) == 0) { >> tdq_unlock_pair(tdq, steal); >> continue; >> } >> out: >> TDQ_UNLOCK(steal); >> clear flag; >> mi_switch(SW_VOL | SWT_IDLE, NULL); >> thread_unlock(curthread); >> return (0); >> >> And we also have to clear the flag if we did not find a thread to steal. > > I've implemented something like this and added a bunch of counters to it > to get a better understanding of its behavior. Instead of adding a flag > to detect preemption, I used the same switchcnt test as is used by > sched_idletd(). These are the results of a ~9 hour poudriere run: > > kern.sched.steal.none: 9971668 # no threads were stolen > kern.sched.steal.fail: 23709 # unable to steal from cpu=sched_highest() > kern.sched.steal.level2: 191839 # somewhere on this chip > kern.sched.steal.level1: 557659 # a core on this CCX > kern.sched.steal.level0: 4555426 # the other SMT thread on this core > kern.sched.steal.restart: 404 # preemption detected so restart the search > kern.sched.steal.call: 15276638 # of times tdq_idled() called > > There are a few surprises here. > > One is the number of failed moves. I don't know if the load on the > source CPU fell below thresh, tdq_transferable went to zero, or if > tdq_move() failed. I also wonder if the failures are evenly distributed > across CPUs. It is possible that these failures are concentrated on CPU > 0, which handles most interrupts. If interrupts don't affect switchcnt, > then the data collected by sched_highest() could be a bit stale and we > would not know it. Most of the above failed moves were do to the either tdq_load dropping below the threshold or tdq_transferable going to zero. These are evenly distributed across CPUs that we want to steal from. I didn't not bin the results by which CPU this code was running on. Actual failures of tdq_move() are bursty and not evenly distributed across CPUs. I've created this review for my changes: https://reviews.freebsd.org/D12130