Date: Thu, 20 Jul 2017 00:00:56 +0000 From: bugzilla-noreply@freebsd.org To: freebsd-bugs@FreeBSD.org Subject: [Bug 219399] System panics after several hours of 14-threads-compilation orgies using poudriere on AMD Ryzen... Message-ID: <bug-219399-8-QDE3cF8mC3@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-219399-8@https.bugs.freebsd.org/bugzilla/> References: <bug-219399-8@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D219399 --- Comment #70 from Don Lewis <truckman@FreeBSD.org> --- I now think that AGESA 1006 actually didn't fix anything for me. I must ha= ve gotten lucky with that first poudriere run after the BIOS upgrade. The next time I ran poudriere, I got a silent reboot after ~3 hours. The times to failure just looked too consistent for me, so I looked at the poudriere bui= ld logs to see what was being built at the time of the crash. One of them was openjdk7. One of the ports that got built when I restarted poudriere to bu= ild the remaining ports that failed after the BIOS upgrade was openoffice, which uses java, so things started making sense. If I try try building openjdk7, I can pretty much consistently trigger a sy= stem reboot, even with SMT off, only two cores enabled in the BIOS, the CPU clock speed lowered to 3 GHz, and the RAM clock cranked down from 2400 MHz to 1866 MHz. Then I marked openjdk7 BROKEN so that poudriere doesn't build it and skips = the ports that depend on it, the system stayed up and poudriere ran for almost 9 hours, though two ports failed with the jemalloc assertion failure that I previously mentioned. I also now think that the Dragonfly patch isn't needed on FreeBSD and potentially could be harmful. It is meant to work around what looks like a Ryzen SMT bug. The problem appears to be triggered by executing code close= to the top of user address space. On Dragonfly, the signal trampoline code is located just above the stack and very close to the top of user address spac= e.=20 By adding space to the end of sigtramp.S, the trampoline code is moved to a lower starting address. On FreeBSD, the signal trampoline code was moved t= o a separate memory page so that the stack could be marked non-executable. This page is located at the very top of user address space. I haven't looked at what all is in this page, but if the contents are loaded started at the bot= tom of the page, then the start of the signal trampoline is likely to be at a l= ower address than on Dragonfly. If other code is loaded in this page after the signal trampoline, then adding space at the end could move that code closer= to the danger zone. In any case, I had been doing much of my testing with SMT disabled, so I removed this patch from my kernel. After backing out the Dragonfly patch and also marking bootstrap-openjdk as BROKEN to eliminate any vestige of java, setting the RAM and CPU clocks bac= k to auto, I ran poudriere again and the run was mostly successful, though I did= see a lang/go build failure due to a runaway build problem. I then enabled SMT and core performance boost and ran poudriere again. I observed build failures of lang/go, gdb, and cairo. I didn't see any obvio= us problems with the latter two, it looked like something in each just returned the wrong exit status. Restarted poudriere successfully built the latter t= wo, but go failed again. The go failures appeared to be caused by some sort of corruption of its malloc state. Note: go is multi-threaded. Just for grins, I decided to try building ports in an i386 jail. I got no unexpected failures. The results were the same when I re-enabled the java ports. It successfully built 1594 ports in 8 hours 33 minutes. I was even able to build lang/ghc on i386. That one always had segfaults in the boots= trap compiler for me on amd64. I have no idea if it uses threads, though. At least on my hardware there are one or more problems with amd64 code. It might just be multi-threaded processes. The java problem could also be cau= sed by the hotspot compiler, which may look like self-modifying code. In any c= ase, it can cause system hangs or reboots and may also corrupt the state of other processes. I finally received the hardware to set up a serial console yesterday, but I haven't had time to install it yet. The reboots that I've seen don't seem to leave any trace in the logs, don't seem to trigger ddb, = and don't leave crash dumps. --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-219399-8-QDE3cF8mC3>