Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 09 Oct 2017 06:35:31 +0000
From:      bugzilla-noreply@freebsd.org
To:        freebsd-bugs@FreeBSD.org
Subject:   [Bug 219399] System panics after several hours of 14-threads-compilation orgies using poudriere on AMD Ryzen...
Message-ID:  <bug-219399-8-R1qdHFXI09@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-219399-8@https.bugs.freebsd.org/bugzilla/>
References:  <bug-219399-8@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D219399

--- Comment #259 from Don Lewis <truckman@FreeBSD.org> ---
(In reply to SF from comment #256)
Neither of my AM4 boards have a VRM frequency adjustment, and none of my la=
rge
collection of non-AM4 boards have it either.  I think this feature is pretty
rare.

The highest temperature that I observed in my testing was about 62 C, and t=
hat
was  only on very hot afternoons in an un-airconditioned room.  We only
recently got temperature monitoring working in FreeBSD for Ryzen, so I don't
know what the CPU temperature was in my early testing, but the room tempera=
ture
was probably 10C lower on my overnight tests and it didn't seem to make any
difference.  Disabling all but two cores in the BIOS also didn't make the
errors go away.  That should have reduced power consumption and heat
dissipation to something like 25W.  Reducing the CPU and RAM clock frequenc=
ies
also did not help.  Forcing the cooling fans to run at full speed full time
also did not help.  The default fan curve never cranked up the fan speed th=
is
high.  This doesn't look like a thermal or voltage regulation issue to me.

The only thing that really seemed to improve the results that I was seeing =
was
tweaking the scheduler to limit the migration of threads between cores, and=
 the
effect was not at all subtle.

The AMD Community Forum thread that I cited has posts from a large number of
Linux users who were experiencing the random segfault problem.  Many of them
worked with AMD customer support who suggested trying a number of different
things (mostly voltage tweaks, disabling SMT, disabling OPCACHE, etc.) that
really didn't seem to solve the problem.  At best they reduced the frequenc=
y of
the errors.  AMD does now say that there is a "performance marginality" iss=
ue
and has been doing warranty replacements of CPUs for users who have this
problem and generally people who have gotten replacement CPUs have been hap=
py
with the results.  I don't think AMD would be spending the money to do this=
 if
the problem could be fixed with a motherboard BIOS upgrade that would tweak=
 the
default VRM settings.  Apparently AMD is now able to screen for this problem
because they also stated that Threadripper is not affected and it uses two =
of
the Ryzen die (with the same stepping as the Ryzen CPU chips).

In my case, I just received a warranty CPU replacement.  The random compiler
segfaults are now gone.  The only info that I had to send AMD was my CPU pa=
rt
and serial numbers, a description of my hardware (PSU, RAM, motherboard, BI=
OS
revision, etc.), a photo of the BIOS screen showing voltages and temperatur=
es,
and a photo of my case interior so they could look for any potential cooling
problems.  Based on that, they approved an RMA and sent me a replacement CP=
U.=20
It doesn't look like they thought that any BIOS tuning tweaks would be worth
trying.  I still see some random build failures, but I see the same sorts of
failures on my AMD FX-8320E.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-219399-8-R1qdHFXI09>