FreeBSD Mail Archives

Date:      Sat, 19 Aug 2017 06:10:27 +0000
From:      bugzilla-noreply@freebsd.org
To:        freebsd-bugs@FreeBSD.org
Subject:   [Bug 221029] AMD Ryzen: strange compilation failures using poudriere or plain buildkernel/buildworld
Message-ID:  <bug-221029-8-2ydrLb5A8y@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-221029-8@https.bugs.freebsd.org/bugzilla/>
References:  <bug-221029-8@https.bugs.freebsd.org/bugzilla/>

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D221029

--- Comment #73 from Don Lewis <truckman@FreeBSD.org> ---
When I've examined a ghc core file, gdb thought that rip was pointing at co=
de
and allowed me to disassemble it.   I didn't see anything that looked like =
it
could cause SIGBUS.

I don't think I've ever had a successful ghc build on my Ryzen machine.


For a while I've been suspicious that the problems are triggered by the
migration of threads between CPU cores.  One thing that made me suspect thi=
s is
that most of the early tests that people did like running games and synthet=
ic
tests like prime95 would create a fixed number of threads that probably alw=
ays
stayed running on the same cores.  Parallel sofware builds are a lot more
chaotic with lots of processes being created and destroyed, with a lot of
thread migration being necessary to keep the load on all cores roughly
balanced.

For the last week or so I've been running experiments where I start multiple
parallel buildworlds at the same time but with different MAKEOBJDIRPREFIX
values and different cpuset cpu masks.  I was looking for any evidence that
migration between different threads on the same core, or between different
cores in the same CCX, or migrating between different CCXs would trigger bu=
ild
failures.  The interesting result is that I observed no failures at all!  O=
ne
possibility is that my test script was buggy and was missing build failures=
.=20
Another is that the value that I used for "make -j" vs. the number of logic=
al
cpus in the cpuset was not resulting in much migration.  A third is that the
use of cpuset was inhibiting the ability of for the scheduler to migrate
threads to balance the the load across all cores.

I started looking at the scheduler code to see if I could understand what m=
ight
be going on, but the code is pretty confusing.  I did stumble across some n=
ice
sysctl tuning knobs that looked like they might be interesting to experiment
with.  The first is kern.sched.balance "Enables the long-term load balancer=
".=20
This is enabled by default and periodically moves threads from the most loa=
ded
CPU to the least loaded CPU.  I disabled this.  The next knob is
kern.sched.steal_idle "Attempts to steal work from other cores before idlin=
g".=20
I disabled this as well. The last is kern.sched.affinity, "Number of hz tic=
ks
to keep thread affinity for".  I think if the previous two knobs are turned
off, this will only come into play if a thread has been sleeping more than =
the
specified time.  If so, it probably gets scheduled on the CPU with the least
load when the thread wakes up.  The default value is 1.  I cranked it up to
1000, which should be long enough for any of its state in cache to have been
fully flushed.

After using this big hammer, I started a poudriere run to build my set of ~=
1700
ports.  The result was interesting.  The only two failures were the typical=
 ghc
SIGBUS failure, and chromium failed to build with the rename problem.  CPU
utilization wasn't great due to some cores running out of work to do, so I
typically saw 5%-10% idle times during the poudriere run.

I think that the affinity knob is probably the key one here.  I'll try cran=
king
it down to something a bit lower and re-enabling the balancing algorithms to
see what happens.

--=20
You are receiving this mail because:
You are the assignee for the bug.=

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-221029-8-2ydrLb5A8y>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation