Date: Tue, 21 Jun 2011 15:32:43 -0400 From: Paul Mather <paul@gromit.dlib.vt.edu> To: Nathan Whitehorn <nwhitehorn@freebsd.org> Cc: freebsd-ppc@freebsd.org Subject: Re: Xserve G5 keeps shutting down Message-ID: <E5EE3F19-79AB-417C-A7EE-0F95CE9DB921@gromit.dlib.vt.edu> In-Reply-To: <4DFFDEEE.40200@freebsd.org> References: <38D89FC6-13F1-4AEF-AF41-0A377EE49DC4@gromit.dlib.vt.edu> <4DFFDEEE.40200@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Jun 20, 2011, at 7:59 PM, Nathan Whitehorn wrote: > On 06/20/11 15:22, Paul Mather wrote: >> I'm running FreeBSD/powerpc64 -CURRENT on an Xserve G5. With a = recent kernel, the system will not stay up for more than a few hours at = a time. :-( >>=20 >> I have no idea why the machine is shutting off. There is no panic or = crash dump and there is no indication in the logs of anything awry. The = system just powers down. The times this has happened when I have been = there have not indicated anything stressing the system (like all fans = racing madly) and oftentimes the system has been relatively idle. = (Oddly, it never appears to my knowledge to have shut down when doing = sometime potentially taxing, such as a make -j5 buildworld or the = likes.) >>=20 >> The main thing I have noticed since building this new kernel is that = the fans are now controlled automatically, i.e., there is now no need = for the tickle-the-fan-controller cron job of yore, meaning the fans = won't race when in single user mode (e.g., during an installworld). >=20 > If the temperature on any sensor exceeds its maximum value, it will = cause the machine to shut off. There was at one point a problem with = some of the sensor drivers that would would report erroneous crazy = values sometimes. Most of the known problems were fixed andreast a few = weeks ago, but it looks like you ran into another. My work desktop has a = ds1775 and a max6690, and has no problems, but not an ad7417, so I would = guess the problem lies there. Could you try commenting out line 116 of = /sys/powerpc/powermac/powermac_thermal.c? That will cause it to spam the = console (and dmesg) about the error, identifying the sensor, but not = shut off the machine and so both keep your server on and let us work out = the problem. I built a new kernel with the shutdown line identified above commented = out. The resultant system stayed up for several hours doing various -j5 = buildworld/buildkernels but just now shut down. :-( Unfortunately, = nothing appeared on the console, so there is no logged reason for the = shutdown. I started up the system again, but it shut down again after a few = minutes of uptime. When I started it up for the third (and last time), = I managed to grab this output from the temp/fan sysctls before it shut = down (a minute or two after booting up): paul@backup:/home/paul> sysctl -a | egrep 'dev.*temp|fans' machdep.manage_fans: 1 dev.max6690.0.%pnpinfo: name=3Dtemp-monitor compat=3Dmax6690 dev.max6690.0.sensor.sys_ctrlr_ambient.temp: 41.5C dev.max6690.0.sensor.sys_ctrlr_internal.temp: 50.1C dev.fcu.0.fans.cpu_a_1.minrpm: 1200 dev.fcu.0.fans.cpu_a_1.maxrpm: 14000 dev.fcu.0.fans.cpu_a_1.rpm: 1984 dev.fcu.0.fans.cpu_a_2.minrpm: 1200 dev.fcu.0.fans.cpu_a_2.maxrpm: 14000 dev.fcu.0.fans.cpu_a_2.rpm: 1984 dev.fcu.0.fans.cpu_a_3.minrpm: 1200 dev.fcu.0.fans.cpu_a_3.maxrpm: 14000 dev.fcu.0.fans.cpu_a_3.rpm: 1984 dev.fcu.0.fans.cpu_b_1.minrpm: 1200 dev.fcu.0.fans.cpu_b_1.maxrpm: 14000 dev.fcu.0.fans.cpu_b_1.rpm: 1984 dev.fcu.0.fans.cpu_b_2.minrpm: 1200 dev.fcu.0.fans.cpu_b_2.maxrpm: 14000 dev.fcu.0.fans.cpu_b_2.rpm: 1984 dev.fcu.0.fans.cpu_b_3.minrpm: 1200 dev.fcu.0.fans.cpu_b_3.maxrpm: 14000 dev.fcu.0.fans.cpu_b_3.rpm: 1984 dev.fcu.0.fans.sys_ctrlr_fan.minpwm: 40 dev.fcu.0.fans.sys_ctrlr_fan.maxpwm: 100 dev.fcu.0.fans.sys_ctrlr_fan.pwm: 54 dev.fcu.0.fans.sys_ctrlr_fan.rpm: 11264 dev.fcu.0.fans.pci_fan.minpwm: 40 dev.fcu.0.fans.pci_fan.maxpwm: 100 dev.fcu.0.fans.pci_fan.pwm: 48 dev.fcu.0.fans.pci_fan.rpm: 9792 dev.ad7417.0.sensor.cpu_a_ad7417_amb.temp: 36.7C dev.ad7417.0.sensor.cpu_a_diode_temp.temp: 53.8C dev.ad7417.1.sensor.cpu_b_ad7417_amb.temp: 32.0C dev.ad7417.1.sensor.cpu_b_diode_temp.temp: 52.6C dev.ds1775.0.%pnpinfo: name=3Dtemp-monitor compat=3Dlm75 The cpu_{a,b}_diode_temp temperatures were higher during the buildworld = (63--67C) and it stayed up at that time. I'm flummoxed at this point as to what is responsible for the shutdowns. = Are there any other hardware monitoring-related shutdowns in the kernel = code? The funny thing about the ad7417 device is that I only recently = added it to my kernel config file as I noticed it had appeared in = GENERIC. Tomorrow I'll build a GENERIC kernel with the shutdown line commented = out, and see if I have any better luck with that. Cheers, Paul.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E5EE3F19-79AB-417C-A7EE-0F95CE9DB921>