Date: Sat, 19 Aug 2017 06:10:27 +0000 From: bugzilla-noreply@freebsd.org To: freebsd-bugs@FreeBSD.org Subject: [Bug 221029] AMD Ryzen: strange compilation failures using poudriere or plain buildkernel/buildworld Message-ID: <bug-221029-8-2ydrLb5A8y@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-221029-8@https.bugs.freebsd.org/bugzilla/> References: <bug-221029-8@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D221029 --- Comment #73 from Don Lewis <truckman@FreeBSD.org> --- When I've examined a ghc core file, gdb thought that rip was pointing at co= de and allowed me to disassemble it. I didn't see anything that looked like = it could cause SIGBUS. I don't think I've ever had a successful ghc build on my Ryzen machine. For a while I've been suspicious that the problems are triggered by the migration of threads between CPU cores. One thing that made me suspect thi= s is that most of the early tests that people did like running games and synthet= ic tests like prime95 would create a fixed number of threads that probably alw= ays stayed running on the same cores. Parallel sofware builds are a lot more chaotic with lots of processes being created and destroyed, with a lot of thread migration being necessary to keep the load on all cores roughly balanced. For the last week or so I've been running experiments where I start multiple parallel buildworlds at the same time but with different MAKEOBJDIRPREFIX values and different cpuset cpu masks. I was looking for any evidence that migration between different threads on the same core, or between different cores in the same CCX, or migrating between different CCXs would trigger bu= ild failures. The interesting result is that I observed no failures at all! O= ne possibility is that my test script was buggy and was missing build failures= .=20 Another is that the value that I used for "make -j" vs. the number of logic= al cpus in the cpuset was not resulting in much migration. A third is that the use of cpuset was inhibiting the ability of for the scheduler to migrate threads to balance the the load across all cores. I started looking at the scheduler code to see if I could understand what m= ight be going on, but the code is pretty confusing. I did stumble across some n= ice sysctl tuning knobs that looked like they might be interesting to experiment with. The first is kern.sched.balance "Enables the long-term load balancer= ".=20 This is enabled by default and periodically moves threads from the most loa= ded CPU to the least loaded CPU. I disabled this. The next knob is kern.sched.steal_idle "Attempts to steal work from other cores before idlin= g".=20 I disabled this as well. The last is kern.sched.affinity, "Number of hz tic= ks to keep thread affinity for". I think if the previous two knobs are turned off, this will only come into play if a thread has been sleeping more than = the specified time. If so, it probably gets scheduled on the CPU with the least load when the thread wakes up. The default value is 1. I cranked it up to 1000, which should be long enough for any of its state in cache to have been fully flushed. After using this big hammer, I started a poudriere run to build my set of ~= 1700 ports. The result was interesting. The only two failures were the typical= ghc SIGBUS failure, and chromium failed to build with the rename problem. CPU utilization wasn't great due to some cores running out of work to do, so I typically saw 5%-10% idle times during the poudriere run. I think that the affinity knob is probably the key one here. I'll try cran= king it down to something a bit lower and re-enabling the balancing algorithms to see what happens. --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-221029-8-2ydrLb5A8y>