Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 18 Jan 2013 05:38:30 +0100
From:      Matthew Rezny <mrezny@hexaneinc.com>
To:        <freebsd-ppc@freebsd.org>
Subject:   Re: PowerMac G5 spurious sensor readings
Message-ID:  <51169.1358483910@hexaneinc.com>

next in thread | raw e-mail | index | archive | help
On Thu 13/01/17 21:59 , Matthew Rezny  wrote::

>I have a G5 of the first model (PowerMac7,2) on which I've been using Free=
BSD/ppc64 for over a year. Today, it suddenly rebooted. Not the first time =
by any means, but this is the first time I found the following log message:
>Jan 17 17:32:19 powermac kernel: WARNING: Current temperature (MLB MAX6690=
 AMB:127.8 C) exceeds critical temperature (80.0 C)! Shutting down!
>
>This is the first time I have seen such a message. After reboot, that sens=
or shows a temperature near 30C, which seems appropriate. The reading of 12=
7.8C looks suspiciously like a max value. My only guess is there was a bad =
read that resulted in=20
>the sensor value going over the threshold. That raises a question in my mi=
nd as to whether there is any filtering or sanity checking of the data. Cou=
ld a single bad read cause the threshold to be exceeded and trigger shutdow=
n immediately, or would=20
>the excessive value have to be returned from that sensor multiple times fo=
r it to be believed an acted upon?
>
>$ uname -a
>FreeBSD powermac 9.1-RC1 FreeBSD 9.1-RC1 #0: Thu Aug 16 00:43:39 UTC 2012 =
    root@anacreon.physics.wisc.edu:/usr/obj/usr/src/sys/GENERIC64  powerpc
>
>The build is a bit old, though I wouldn't expect too much change to the co=
de in question since then. I will update to 9.1-RELEASE or -STABLE in the n=
ext few days, but as this is a problem that has happened once in over a yea=
r, I wouldn't call it=20
>resolved just by a quick failure to reproduce after updating.
>
>I was already planning to do an update after the box has completed it's cu=
rrent task. I noticed a problem with excessive output causing the console t=
o hang. A couple days ago I found the machine apparently hung in that the k=
eyboard and mouse were=20
>not responsive, but I found it was still alive on the network and I could =
ssh in to reboot. The only clues were no buffer space for dmesg to output a=
nything before reboot, and a rather full /var/log/messages file which had e=
xhausted the drive.=20
>Under the same workload (and after freeing some drive space), the problem =
reoccurred in a matter of hours, but this time with me watching. While runn=
ing ddrescue against a drive with some bad sectors, read errors flood the c=
onsole in spurts. When=20
>some dozens of read errors are displayed at once, the console scrolls whol=
e pages by in a fraction of a second, and then goes dead. Messages that sho=
uld go to console are not shown on screen but are in the log. Attempts to s=
witch virtual console or=20
>to reboot are not successful, but ssh access continues to work and the box=
 is clearly still processing other workloads. The only sign of life from th=
e console are the messages about flushing buffers just before completion of=
 the reboot commanded=20
>via ssh.
>

Just a few hours later, it strikes again.
Jan 17 23:06:11 powermac kernel: WARNING: Current temperature (MLB MAX6690 =
AMB: 127.0 C) exceeds critical temperature (80.0 C)! Shutting down!

I took a peek in smu.c and powermac_thermal.c. In the former, smu_sensor_re=
ad() has a check for an error returned from smu_run_cmd() but no checks on =
the returned data. In the later, pmac_therm_manage_fans() invokes smu_senso=
r_read() and considers the returned value as valid if greater than zero. No=
 other sanity checks are performed.

Looking at the datasheet[1] for max6690, I see that 127C is the maximum rea=
dable temperature, which is represented as 01111111. The value 10000000 is =
documented as representing a diode fault. As there is no upper range check,=
 the diode fault condition will be interpreted as slightly over 127C. I thi=
nk it would be appropriate to treat as invalid any raw sensor value with th=
e MSB set. Additionally, the check on line 105 of pmac_therm_manage_fans sh=
ould really be "if (temp >=3D 0)" rather than just "if (temp > 0)" as a val=
ue of 0 is a valid value for zero degrees and all actual errors are represe=
nted as a value of -1.

I have not looked at the datasheets for other relevant sensors, but being t=
hat there are no range checks in any of the cases in smu_sensor_read(), I c=
urrently consider them all suspect pending review.

[1] http://datasheets.maximintegrated.com/en/ds/MAX6690.pdf (Page 11, Table=
 2)





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51169.1358483910>