From owner-freebsd-bugs@freebsd.org Sat Aug 19 06:10:27 2017 Return-Path: Delivered-To: freebsd-bugs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C0C92DD3B60 for ; Sat, 19 Aug 2017 06:10:27 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id A686A67327 for ; Sat, 19 Aug 2017 06:10:27 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v7J6ARgr042617 for ; Sat, 19 Aug 2017 06:10:27 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-bugs@FreeBSD.org Subject: [Bug 221029] AMD Ryzen: strange compilation failures using poudriere or plain buildkernel/buildworld Date: Sat, 19 Aug 2017 06:10:27 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 11.1-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: truckman@FreeBSD.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Aug 2017 06:10:27 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D221029 --- Comment #73 from Don Lewis --- When I've examined a ghc core file, gdb thought that rip was pointing at co= de and allowed me to disassemble it. I didn't see anything that looked like = it could cause SIGBUS. I don't think I've ever had a successful ghc build on my Ryzen machine. For a while I've been suspicious that the problems are triggered by the migration of threads between CPU cores. One thing that made me suspect thi= s is that most of the early tests that people did like running games and synthet= ic tests like prime95 would create a fixed number of threads that probably alw= ays stayed running on the same cores. Parallel sofware builds are a lot more chaotic with lots of processes being created and destroyed, with a lot of thread migration being necessary to keep the load on all cores roughly balanced. For the last week or so I've been running experiments where I start multiple parallel buildworlds at the same time but with different MAKEOBJDIRPREFIX values and different cpuset cpu masks. I was looking for any evidence that migration between different threads on the same core, or between different cores in the same CCX, or migrating between different CCXs would trigger bu= ild failures. The interesting result is that I observed no failures at all! O= ne possibility is that my test script was buggy and was missing build failures= .=20 Another is that the value that I used for "make -j" vs. the number of logic= al cpus in the cpuset was not resulting in much migration. A third is that the use of cpuset was inhibiting the ability of for the scheduler to migrate threads to balance the the load across all cores. I started looking at the scheduler code to see if I could understand what m= ight be going on, but the code is pretty confusing. I did stumble across some n= ice sysctl tuning knobs that looked like they might be interesting to experiment with. The first is kern.sched.balance "Enables the long-term load balancer= ".=20 This is enabled by default and periodically moves threads from the most loa= ded CPU to the least loaded CPU. I disabled this. The next knob is kern.sched.steal_idle "Attempts to steal work from other cores before idlin= g".=20 I disabled this as well. The last is kern.sched.affinity, "Number of hz tic= ks to keep thread affinity for". I think if the previous two knobs are turned off, this will only come into play if a thread has been sleeping more than = the specified time. If so, it probably gets scheduled on the CPU with the least load when the thread wakes up. The default value is 1. I cranked it up to 1000, which should be long enough for any of its state in cache to have been fully flushed. After using this big hammer, I started a poudriere run to build my set of ~= 1700 ports. The result was interesting. The only two failures were the typical= ghc SIGBUS failure, and chromium failed to build with the rename problem. CPU utilization wasn't great due to some cores running out of work to do, so I typically saw 5%-10% idle times during the poudriere run. I think that the affinity knob is probably the key one here. I'll try cran= king it down to something a bit lower and re-enabling the balancing algorithms to see what happens. --=20 You are receiving this mail because: You are the assignee for the bug.=