Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 22 Feb 2013 22:09:18 +0100
From:      Matthew Rezny <mrezny@hexaneinc.com>
To:        <freebsd-ppc@freebsd.org>
Subject:   Re: PowerMac G5 spurious sensor readings
Message-ID:  <20130222220918.00005998@unknown>
In-Reply-To: <51169.1358483910@hexaneinc.com>
References:  <51169.1358483910@hexaneinc.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 18 Jan 2013 05:38:30 +0100
Matthew Rezny <mrezny@hexaneinc.com> wrote:

> On Thu 13/01/17 21:59 , Matthew Rezny  wrote::
> 
> >I have a G5 of the first model (PowerMac7,2) on which I've been
> >using FreeBSD/ppc64 for over a year. Today, it suddenly rebooted.
> >Not the first time by any means, but this is the first time I found
> >the following log message: Jan 17 17:32:19 powermac kernel: WARNING:
> >Current temperature (MLB MAX6690 AMB:127.8 C) exceeds critical
> >temperature (80.0 C)! Shutting down!
> >
> >This is the first time I have seen such a message. After reboot,
> >that sensor shows a temperature near 30C, which seems appropriate.
> >The reading of 127.8C looks suspiciously like a max value. My only
> >guess is there was a bad read that resulted in the sensor value
> >going over the threshold. That raises a question in my mind as to
> >whether there is any filtering or sanity checking of the data. Could
> >a single bad read cause the threshold to be exceeded and trigger
> >shutdown immediately, or would the excessive value have to be
> >returned from that sensor multiple times for it to be believed an
> >acted upon?
> >
> >$ uname -a
> >FreeBSD powermac 9.1-RC1 FreeBSD 9.1-RC1 #0: Thu Aug 16 00:43:39 UTC
> >2012
> >root@anacreon.physics.wisc.edu:/usr/obj/usr/src/sys/GENERIC64
> >powerpc
> >
> >The build is a bit old, though I wouldn't expect too much change to
> >the code in question since then. I will update to 9.1-RELEASE or
> >-STABLE in the next few days, but as this is a problem that has
> >happened once in over a year, I wouldn't call it resolved just by a
> >quick failure to reproduce after updating.
> >
> >I was already planning to do an update after the box has completed
> >it's current task. I noticed a problem with excessive output causing
> >the console to hang. A couple days ago I found the machine
> >apparently hung in that the keyboard and mouse were not responsive,
> >but I found it was still alive on the network and I could ssh in to
> >reboot. The only clues were no buffer space for dmesg to output
> >anything before reboot, and a rather full /var/log/messages file
> >which had exhausted the drive. Under the same workload (and after
> >freeing some drive space), the problem reoccurred in a matter of
> >hours, but this time with me watching. While running ddrescue
> >against a drive with some bad sectors, read errors flood the console
> >in spurts. When some dozens of read errors are displayed at once,
> >the console scrolls whole pages by in a fraction of a second, and
> >then goes dead. Messages that should go to console are not shown on
> >screen but are in the log. Attempts to switch virtual console or to
> >reboot are not successful, but ssh access continues to work and the
> >box is clearly still processing other workloads. The only sign of
> >life from the console are the messages about flushing buffers just
> >before completion of the reboot commanded via ssh.
> >
> 
> Just a few hours later, it strikes again.
> Jan 17 23:06:11 powermac kernel: WARNING: Current temperature (MLB
> MAX6690 AMB: 127.0 C) exceeds critical temperature (80.0 C)! Shutting
> down!
> 
> I took a peek in smu.c and powermac_thermal.c. In the former,
> smu_sensor_read() has a check for an error returned from
> smu_run_cmd() but no checks on the returned data. In the later,
> pmac_therm_manage_fans() invokes smu_sensor_read() and considers the
> returned value as valid if greater than zero. No other sanity checks
> are performed.
> 
> Looking at the datasheet[1] for max6690, I see that 127C is the
> maximum readable temperature, which is represented as 01111111. The
> value 10000000 is documented as representing a diode fault. As there
> is no upper range check, the diode fault condition will be
> interpreted as slightly over 127C. I think it would be appropriate to
> treat as invalid any raw sensor value with the MSB set. Additionally,
> the check on line 105 of pmac_therm_manage_fans should really be "if
> (temp >= 0)" rather than just "if (temp > 0)" as a value of 0 is a
> valid value for zero degrees and all actual errors are represented as
> a value of -1.
> 
> I have not looked at the datasheets for other relevant sensors, but
> being that there are no range checks in any of the cases in
> smu_sensor_read(), I currently consider them all suspect pending
> review.
> 
> [1] http://datasheets.maximintegrated.com/en/ds/MAX6690.pdf (Page 11,
> Table 2)
> 

It has been some time since I first looked at this, but something in
the back of my mind said my first glance at it was flawed, so I
recently revisited the matter.

I realize I was looking in the wrong place previously. After reading
through powermac_thermal.c I had done a search for sensor_read() and
found a match in smu.c so I started from there. However, there is no
SMU on this early model G5. Searching more than just the same
directory, I found sensor_read() occurs in many places. The pertinent
function is really in sys/dev/iicbus/max6690.c as max6690_sensor_read().

Looking at the correct function, I see multiple problems. max6690_read()
is called twice, both times storing the return value in err. The value
of err is checked only once, after both calls, so if the first call
returns an error but the second succeeds, the error indicator is
overwritten and calculations will be done with bad data. The check for
err < 0 should be done after each call to max6690_read(). After that
basic check, which would only indicate an error if there was a problem
with the I2C bus transaction, computations are done without any further
data validation. As stated before, the sensor may signal an error
condition by returning the value 10000000. There is no check for that
reserved value, so when the sensor attempts to indicate an error, we
end up with a temperature of 127C being returned from
max6690_sensor_read(). That value causes havoc for obvious reasons.

This should be easy to clean up but I don't have time at the moment to
do the testing of the changes on the one machine to which this is
applicable. If nobody else deals with it, I'll get around to doing so
and post a patch when that time comes.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130222220918.00005998>