Date: Mon, 09 Oct 2017 06:35:31 +0000 From: bugzilla-noreply@freebsd.org To: freebsd-bugs@FreeBSD.org Subject: [Bug 219399] System panics after several hours of 14-threads-compilation orgies using poudriere on AMD Ryzen... Message-ID: <bug-219399-8-R1qdHFXI09@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-219399-8@https.bugs.freebsd.org/bugzilla/> References: <bug-219399-8@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D219399 --- Comment #259 from Don Lewis <truckman@FreeBSD.org> --- (In reply to SF from comment #256) Neither of my AM4 boards have a VRM frequency adjustment, and none of my la= rge collection of non-AM4 boards have it either. I think this feature is pretty rare. The highest temperature that I observed in my testing was about 62 C, and t= hat was only on very hot afternoons in an un-airconditioned room. We only recently got temperature monitoring working in FreeBSD for Ryzen, so I don't know what the CPU temperature was in my early testing, but the room tempera= ture was probably 10C lower on my overnight tests and it didn't seem to make any difference. Disabling all but two cores in the BIOS also didn't make the errors go away. That should have reduced power consumption and heat dissipation to something like 25W. Reducing the CPU and RAM clock frequenc= ies also did not help. Forcing the cooling fans to run at full speed full time also did not help. The default fan curve never cranked up the fan speed th= is high. This doesn't look like a thermal or voltage regulation issue to me. The only thing that really seemed to improve the results that I was seeing = was tweaking the scheduler to limit the migration of threads between cores, and= the effect was not at all subtle. The AMD Community Forum thread that I cited has posts from a large number of Linux users who were experiencing the random segfault problem. Many of them worked with AMD customer support who suggested trying a number of different things (mostly voltage tweaks, disabling SMT, disabling OPCACHE, etc.) that really didn't seem to solve the problem. At best they reduced the frequenc= y of the errors. AMD does now say that there is a "performance marginality" iss= ue and has been doing warranty replacements of CPUs for users who have this problem and generally people who have gotten replacement CPUs have been hap= py with the results. I don't think AMD would be spending the money to do this= if the problem could be fixed with a motherboard BIOS upgrade that would tweak= the default VRM settings. Apparently AMD is now able to screen for this problem because they also stated that Threadripper is not affected and it uses two = of the Ryzen die (with the same stepping as the Ryzen CPU chips). In my case, I just received a warranty CPU replacement. The random compiler segfaults are now gone. The only info that I had to send AMD was my CPU pa= rt and serial numbers, a description of my hardware (PSU, RAM, motherboard, BI= OS revision, etc.), a photo of the BIOS screen showing voltages and temperatur= es, and a photo of my case interior so they could look for any potential cooling problems. Based on that, they approved an RMA and sent me a replacement CP= U.=20 It doesn't look like they thought that any BIOS tuning tweaks would be worth trying. I still see some random build failures, but I see the same sorts of failures on my AMD FX-8320E. --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-219399-8-R1qdHFXI09>