Date: Fri, 18 Jan 2013 05:38:30 +0100 From: Matthew Rezny <mrezny@hexaneinc.com> To: <freebsd-ppc@freebsd.org> Subject: Re: PowerMac G5 spurious sensor readings Message-ID: <51169.1358483910@hexaneinc.com>
next in thread | raw e-mail | index | archive | help
On Thu 13/01/17 21:59 , Matthew Rezny wrote:: >I have a G5 of the first model (PowerMac7,2) on which I've been using Free= BSD/ppc64 for over a year. Today, it suddenly rebooted. Not the first time = by any means, but this is the first time I found the following log message: >Jan 17 17:32:19 powermac kernel: WARNING: Current temperature (MLB MAX6690= AMB:127.8 C) exceeds critical temperature (80.0 C)! Shutting down! > >This is the first time I have seen such a message. After reboot, that sens= or shows a temperature near 30C, which seems appropriate. The reading of 12= 7.8C looks suspiciously like a max value. My only guess is there was a bad = read that resulted in=20 >the sensor value going over the threshold. That raises a question in my mi= nd as to whether there is any filtering or sanity checking of the data. Cou= ld a single bad read cause the threshold to be exceeded and trigger shutdow= n immediately, or would=20 >the excessive value have to be returned from that sensor multiple times fo= r it to be believed an acted upon? > >$ uname -a >FreeBSD powermac 9.1-RC1 FreeBSD 9.1-RC1 #0: Thu Aug 16 00:43:39 UTC 2012 = root@anacreon.physics.wisc.edu:/usr/obj/usr/src/sys/GENERIC64 powerpc > >The build is a bit old, though I wouldn't expect too much change to the co= de in question since then. I will update to 9.1-RELEASE or -STABLE in the n= ext few days, but as this is a problem that has happened once in over a yea= r, I wouldn't call it=20 >resolved just by a quick failure to reproduce after updating. > >I was already planning to do an update after the box has completed it's cu= rrent task. I noticed a problem with excessive output causing the console t= o hang. A couple days ago I found the machine apparently hung in that the k= eyboard and mouse were=20 >not responsive, but I found it was still alive on the network and I could = ssh in to reboot. The only clues were no buffer space for dmesg to output a= nything before reboot, and a rather full /var/log/messages file which had e= xhausted the drive.=20 >Under the same workload (and after freeing some drive space), the problem = reoccurred in a matter of hours, but this time with me watching. While runn= ing ddrescue against a drive with some bad sectors, read errors flood the c= onsole in spurts. When=20 >some dozens of read errors are displayed at once, the console scrolls whol= e pages by in a fraction of a second, and then goes dead. Messages that sho= uld go to console are not shown on screen but are in the log. Attempts to s= witch virtual console or=20 >to reboot are not successful, but ssh access continues to work and the box= is clearly still processing other workloads. The only sign of life from th= e console are the messages about flushing buffers just before completion of= the reboot commanded=20 >via ssh. > Just a few hours later, it strikes again. Jan 17 23:06:11 powermac kernel: WARNING: Current temperature (MLB MAX6690 = AMB: 127.0 C) exceeds critical temperature (80.0 C)! Shutting down! I took a peek in smu.c and powermac_thermal.c. In the former, smu_sensor_re= ad() has a check for an error returned from smu_run_cmd() but no checks on = the returned data. In the later, pmac_therm_manage_fans() invokes smu_senso= r_read() and considers the returned value as valid if greater than zero. No= other sanity checks are performed. Looking at the datasheet[1] for max6690, I see that 127C is the maximum rea= dable temperature, which is represented as 01111111. The value 10000000 is = documented as representing a diode fault. As there is no upper range check,= the diode fault condition will be interpreted as slightly over 127C. I thi= nk it would be appropriate to treat as invalid any raw sensor value with th= e MSB set. Additionally, the check on line 105 of pmac_therm_manage_fans sh= ould really be "if (temp >=3D 0)" rather than just "if (temp > 0)" as a val= ue of 0 is a valid value for zero degrees and all actual errors are represe= nted as a value of -1. I have not looked at the datasheets for other relevant sensors, but being t= hat there are no range checks in any of the cases in smu_sensor_read(), I c= urrently consider them all suspect pending review. [1] http://datasheets.maximintegrated.com/en/ds/MAX6690.pdf (Page 11, Table= 2)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51169.1358483910>