Date: Sat, 19 Aug 2017 06:10:27 +0000 From: bugzilla-noreply@freebsd.org To: freebsd-bugs@FreeBSD.org Subject: [Bug 221029] AMD Ryzen: strange compilation failures using poudriere or plain buildkernel/buildworld Message-ID: <bug-221029-8-2ydrLb5A8y@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-221029-8@https.bugs.freebsd.org/bugzilla/> References: <bug-221029-8@https.bugs.freebsd.org/bugzilla/>
index | next in thread | previous in thread | raw e-mail
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221029 --- Comment #73 from Don Lewis <truckman@FreeBSD.org> --- When I've examined a ghc core file, gdb thought that rip was pointing at code and allowed me to disassemble it. I didn't see anything that looked like it could cause SIGBUS. I don't think I've ever had a successful ghc build on my Ryzen machine. For a while I've been suspicious that the problems are triggered by the migration of threads between CPU cores. One thing that made me suspect this is that most of the early tests that people did like running games and synthetic tests like prime95 would create a fixed number of threads that probably always stayed running on the same cores. Parallel sofware builds are a lot more chaotic with lots of processes being created and destroyed, with a lot of thread migration being necessary to keep the load on all cores roughly balanced. For the last week or so I've been running experiments where I start multiple parallel buildworlds at the same time but with different MAKEOBJDIRPREFIX values and different cpuset cpu masks. I was looking for any evidence that migration between different threads on the same core, or between different cores in the same CCX, or migrating between different CCXs would trigger build failures. The interesting result is that I observed no failures at all! One possibility is that my test script was buggy and was missing build failures. Another is that the value that I used for "make -j" vs. the number of logical cpus in the cpuset was not resulting in much migration. A third is that the use of cpuset was inhibiting the ability of for the scheduler to migrate threads to balance the the load across all cores. I started looking at the scheduler code to see if I could understand what might be going on, but the code is pretty confusing. I did stumble across some nice sysctl tuning knobs that looked like they might be interesting to experiment with. The first is kern.sched.balance "Enables the long-term load balancer". This is enabled by default and periodically moves threads from the most loaded CPU to the least loaded CPU. I disabled this. The next knob is kern.sched.steal_idle "Attempts to steal work from other cores before idling". I disabled this as well. The last is kern.sched.affinity, "Number of hz ticks to keep thread affinity for". I think if the previous two knobs are turned off, this will only come into play if a thread has been sleeping more than the specified time. If so, it probably gets scheduled on the CPU with the least load when the thread wakes up. The default value is 1. I cranked it up to 1000, which should be long enough for any of its state in cache to have been fully flushed. After using this big hammer, I started a poudriere run to build my set of ~1700 ports. The result was interesting. The only two failures were the typical ghc SIGBUS failure, and chromium failed to build with the rename problem. CPU utilization wasn't great due to some cores running out of work to do, so I typically saw 5%-10% idle times during the poudriere run. I think that the affinity knob is probably the key one here. I'll try cranking it down to something a bit lower and re-enabling the balancing algorithms to see what happens. -- You are receiving this mail because: You are the assignee for the bug.help
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-221029-8-2ydrLb5A8y>
