Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 20 Jul 2017 00:00:56 +0000
From:      bugzilla-noreply@freebsd.org
To:        freebsd-bugs@FreeBSD.org
Subject:   [Bug 219399] System panics after several hours of 14-threads-compilation orgies using poudriere on AMD Ryzen...
Message-ID:  <bug-219399-8-QDE3cF8mC3@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-219399-8@https.bugs.freebsd.org/bugzilla/>
References:  <bug-219399-8@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D219399

--- Comment #70 from Don Lewis <truckman@FreeBSD.org> ---
I now think that AGESA 1006 actually didn't fix anything for me.  I must ha=
ve
gotten lucky with that first poudriere run after the BIOS upgrade.  The next
time I ran poudriere, I got a silent reboot after ~3 hours.  The times to
failure just looked too consistent for me, so I looked at the poudriere bui=
ld
logs to see what was being built at the time of the crash.  One of them was
openjdk7.  One of the ports that got built when I restarted poudriere to bu=
ild
the remaining ports that failed after the BIOS upgrade was openoffice, which
uses java, so things started making sense.

If I try try building openjdk7, I can pretty much consistently trigger a sy=
stem
reboot, even with SMT off, only two cores enabled in the BIOS, the CPU clock
speed lowered to 3 GHz, and the RAM clock cranked down from 2400 MHz to 1866
MHz.

Then I marked openjdk7 BROKEN so that poudriere doesn't build it and skips =
the
ports that depend on it, the system stayed up and poudriere ran for almost 9
hours, though two ports failed with the jemalloc assertion failure that I
previously mentioned.

I also now think that the Dragonfly patch isn't needed on FreeBSD and
potentially could be harmful.  It is meant to work around what looks like a
Ryzen SMT bug.  The problem appears to be triggered by executing code close=
 to
the top of user address space.  On Dragonfly, the signal trampoline code is
located just above the stack and very close to the top of user address spac=
e.=20
By adding space to the end of sigtramp.S, the trampoline code is moved to a
lower starting address.  On FreeBSD, the signal trampoline code was moved t=
o a
separate memory page so that the stack could be marked non-executable.  This
page is located at the very top of user address space.  I haven't looked at
what all is in this page, but if the contents are loaded started at the bot=
tom
of the page, then the start of the signal trampoline is likely to be at a l=
ower
address than on Dragonfly.  If other code is loaded in this page after the
signal trampoline, then adding space at the end could move that code closer=
 to
the danger zone.  In any case, I had been doing much of my testing with SMT
disabled, so I removed this patch from my kernel.

After backing out the Dragonfly patch and also marking bootstrap-openjdk as
BROKEN to eliminate any vestige of java, setting the RAM and CPU clocks bac=
k to
auto, I ran poudriere again and the run was mostly successful, though I did=
 see
a lang/go build failure due to a runaway build problem.

I then enabled SMT and core performance boost and ran poudriere again.  I
observed build failures of lang/go, gdb, and cairo.  I didn't see any obvio=
us
problems with the latter two, it looked like something in each just returned
the wrong exit status.  Restarted poudriere successfully built the latter t=
wo,
but go failed again.  The go failures appeared to be caused by some sort of
corruption of its malloc state.  Note: go is multi-threaded.

Just for grins, I decided to try building ports in an i386 jail.  I got no
unexpected failures.  The results were the same when I re-enabled the java
ports.  It successfully built 1594 ports in 8 hours 33 minutes.  I was even
able to build lang/ghc on i386.  That one always had segfaults in the boots=
trap
compiler for me on amd64.  I have no idea if it uses threads, though.

At least on my hardware there are one or more problems with amd64 code.  It
might just be multi-threaded processes.  The java problem could also be cau=
sed
by the hotspot compiler, which may look like self-modifying code.  In any c=
ase,
it can cause system hangs or reboots and may also corrupt the state of other
processes.  I finally received the hardware to set up a serial console
yesterday, but I haven't had time to install it yet.  The reboots that I've
seen don't seem to leave any trace in the logs, don't seem to trigger ddb, =
and
don't leave crash dumps.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-219399-8-QDE3cF8mC3>