From owner-freebsd-arch@freebsd.org Wed Aug 23 20:58:51 2017 Return-Path: Delivered-To: freebsd-arch@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8C662DEED20 for ; Wed, 23 Aug 2017 20:58:51 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (unknown [IPv6:2602:304:b010:ef20::f2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "gw.catspoiler.org", Issuer "gw.catspoiler.org" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4FA3E818CB; Wed, 23 Aug 2017 20:58:51 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.15.2/8.15.2) with ESMTP id v7NKwg2a037103; Wed, 23 Aug 2017 13:58:46 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <201708232058.v7NKwg2a037103@gw.catspoiler.org> Date: Wed, 23 Aug 2017 13:58:42 -0700 (PDT) From: Don Lewis Subject: Re: ULE steal_idle questions To: avg@FreeBSD.org cc: freebsd-arch@FreeBSD.org In-Reply-To: MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Aug 2017 20:58:51 -0000 On 23 Aug, Andriy Gapon wrote: > On 23/08/2017 18:04, Don Lewis wrote: >> I've been looking at the steal_idle code in tdq_idled() and found some >> things that puzzle me. >> >> Consider a machine with three CPUs: >> A, which is idle >> B, which is busy running a thread >> C, which is busy running a thread and has another thread in queue >> It would seem to make sense that the tdq_load values for these three >> CPUs would be 0, 1, and 2 respectively in order to select the best CPU >> to run a new thread. >> >> If so, then why do we pass thresh=1 to sched_highest() in the code that >> implements steal_idle? That value is used to set cs_limit which is used >> in this comparison in cpu_search: >> if (match & CPU_SEARCH_HIGHEST) >> if (tdq->tdq_load >= hgroup.cs_limit && >> That would seem to make CPU B a candidate for stealing a thread from. >> Ignoring CPU C for the moment, that shouldn't happen if the thread is >> running, but even if it was possible, it would just make CPU B go idle, >> which isn't terribly helpful in terms of load balancing and would just >> thrash the caches. The same comparison is repeated in tdq_idled() after >> a candidate CPU has been chosen: >> if (steal->tdq_load < thresh || steal->tdq_transferable == 0) { >> tdq_unlock_pair(tdq, steal); >> continue; >> } >> >> It looks to me like there is an off-by-one error here, and there is a >> similar problem in the code that implements kern.sched.balance. > > > I agree with your analysis. I had the same questions as well. > I think that the tdq_transferable check is what saves the code from > running into any problems. But it indeed would make sense for the code > to understand that tdq_load includes a currently running, never > transferable thread as well. Yes, I think that the tdq_transferable check will fail, but at the cost of an unnecessary tdq_lock_pair()/tdq_unlock_pair() and another loop iteration, including another expensive sched_highest() call. Consider the case of a close to fully loaded system where all of the other CPUs each have one running thread. The current code will try each of the other CPUs, calling tdq_lock_pair(), failing the tdq_transferable check, calling tdq_unlock_pair(), then restarting the loop with that CPU removed from mask, and all of this done with interrupts disabled. The proper thing to do in this case would be to just go into the idle state. >> The reason I ask is that I've been debugging random segfaults and other >> strange errors on my Ryzen machine and the problems mostly go away if I >> either disable kern.sched.steal_idle and kern_sched.balance, or if I >> leave kern_sched.steal_idle enabled and hack the code to change the >> value of thresh from 1 to 2. See >> for the gory >> details. I don't know if my CPU has what AMD calls the "performance >> marginality issue". > > I have been following your experiments and it's interesting that > "massaging" the CPU in certain ways makes it a bit happier. But > certainly the fault is with the CPU as the code is trouble-free on many > different architectures including x86, and various processors from both > Intel and AMD [with earlier CPU families]. The results of my experiments so far are pointing to the time spent looping in tdq_idled() as at least one cause of the problems. If I restrict tdq_idled() to look at just a single core, things are happy. If I then set steal_thresh to 1 so that it has to loop, then things get unhappy. If I allow tdq_idled() to look at both the current core and current CCX with thresh at the CCX level set to 1 so that the code loops, things are unhappy. If I set thresh to 2 so that the code does not loop unnecessarily, things get happy again. If I allow tdq_idled() to look at the entire topology (so that there are now three calls to sched_highest(), things get unhappy again. There is a known issue with executing IRET on one SMT thread when the other SMT thread on that core is "busy", but I don't know how that would affect things. Perhaps that can be fixed in microcode. sched_highest() looks like it is really expensive in terms of CPU cycles. On Ryzen, if we can't find a suitable thread on the current CXX to transfer and step up to the chip level, sched_highest() will recalculate the load on the current CCX even though we have already rejected it. Things get worse when using cpuset because if tdq_move() fails due to cpuset (or other) restrictions, then we call sched_highest() all over again to redo all the calculations, but with the previously chosen CPU removed from the potential choices. This will get even worse with Threadripper and beyond. Even if it doesn't cause obvious breakage, it is bad for interrupt latency. I'm not convinced that the ghc and go problems are Ryzen bugs and not bugs in the code for those two ports. I've never seen build failures for those on my FX-8320E, but the increased number of threads on Ryzen might expose some latent problems.