Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 12 Oct 2014 16:51:18 -0700
From:      Justin Hibbits <chmeeedalf@gmail.com>
To:        Matthew Rezny <mrezny@hexaneinc.com>
Cc:        freebsd-ppc@freebsd.org
Subject:   Re: PowerMac G5 spurious sensor readings
Message-ID:  <20141012165118.0b44f322@zhabar.attlocal.net>
In-Reply-To: <20130222220918.00005998@unknown>
References:  <51169.1358483910@hexaneinc.com> <20130222220918.00005998@unknown>

next in thread | previous in thread | raw e-mail | index | archive | help
--Sig_/81.W0_NboPiiUUlKsNAW0Y8
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Fri, 22 Feb 2013 22:09:18 +0100
Matthew Rezny <mrezny@hexaneinc.com> wrote:

> On Fri, 18 Jan 2013 05:38:30 +0100
> Matthew Rezny <mrezny@hexaneinc.com> wrote:
>=20
> > On Thu 13/01/17 21:59 , Matthew Rezny  wrote::
> >=20
> > >I have a G5 of the first model (PowerMac7,2) on which I've been
> > >using FreeBSD/ppc64 for over a year. Today, it suddenly rebooted.
> > >Not the first time by any means, but this is the first time I found
> > >the following log message: Jan 17 17:32:19 powermac kernel:
> > >WARNING: Current temperature (MLB MAX6690 AMB:127.8 C) exceeds
> > >critical temperature (80.0 C)! Shutting down!
> > >
> > >This is the first time I have seen such a message. After reboot,
> > >that sensor shows a temperature near 30C, which seems appropriate.
> > >The reading of 127.8C looks suspiciously like a max value. My only
> > >guess is there was a bad read that resulted in the sensor value
> > >going over the threshold. That raises a question in my mind as to
> > >whether there is any filtering or sanity checking of the data.
> > >Could a single bad read cause the threshold to be exceeded and
> > >trigger shutdown immediately, or would the excessive value have to
> > >be returned from that sensor multiple times for it to be believed
> > >an acted upon?
> > >
> > >$ uname -a
> > >FreeBSD powermac 9.1-RC1 FreeBSD 9.1-RC1 #0: Thu Aug 16 00:43:39
> > >UTC 2012
> > >root@anacreon.physics.wisc.edu:/usr/obj/usr/src/sys/GENERIC64
> > >powerpc
> > >
> > >The build is a bit old, though I wouldn't expect too much change to
> > >the code in question since then. I will update to 9.1-RELEASE or
> > >-STABLE in the next few days, but as this is a problem that has
> > >happened once in over a year, I wouldn't call it resolved just by a
> > >quick failure to reproduce after updating.
> > >
> > >I was already planning to do an update after the box has completed
> > >it's current task. I noticed a problem with excessive output
> > >causing the console to hang. A couple days ago I found the machine
> > >apparently hung in that the keyboard and mouse were not responsive,
> > >but I found it was still alive on the network and I could ssh in to
> > >reboot. The only clues were no buffer space for dmesg to output
> > >anything before reboot, and a rather full /var/log/messages file
> > >which had exhausted the drive. Under the same workload (and after
> > >freeing some drive space), the problem reoccurred in a matter of
> > >hours, but this time with me watching. While running ddrescue
> > >against a drive with some bad sectors, read errors flood the
> > >console in spurts. When some dozens of read errors are displayed
> > >at once, the console scrolls whole pages by in a fraction of a
> > >second, and then goes dead. Messages that should go to console are
> > >not shown on screen but are in the log. Attempts to switch virtual
> > >console or to reboot are not successful, but ssh access continues
> > >to work and the box is clearly still processing other workloads.
> > >The only sign of life from the console are the messages about
> > >flushing buffers just before completion of the reboot commanded
> > >via ssh.
> > >
> >=20
> > Just a few hours later, it strikes again.
> > Jan 17 23:06:11 powermac kernel: WARNING: Current temperature (MLB
> > MAX6690 AMB: 127.0 C) exceeds critical temperature (80.0 C)!
> > Shutting down!
> >=20
> > I took a peek in smu.c and powermac_thermal.c. In the former,
> > smu_sensor_read() has a check for an error returned from
> > smu_run_cmd() but no checks on the returned data. In the later,
> > pmac_therm_manage_fans() invokes smu_sensor_read() and considers the
> > returned value as valid if greater than zero. No other sanity checks
> > are performed.
> >=20
> > Looking at the datasheet[1] for max6690, I see that 127C is the
> > maximum readable temperature, which is represented as 01111111. The
> > value 10000000 is documented as representing a diode fault. As there
> > is no upper range check, the diode fault condition will be
> > interpreted as slightly over 127C. I think it would be appropriate
> > to treat as invalid any raw sensor value with the MSB set.
> > Additionally, the check on line 105 of pmac_therm_manage_fans
> > should really be "if (temp >=3D 0)" rather than just "if (temp > 0)"
> > as a value of 0 is a valid value for zero degrees and all actual
> > errors are represented as a value of -1.
> >=20
> > I have not looked at the datasheets for other relevant sensors, but
> > being that there are no range checks in any of the cases in
> > smu_sensor_read(), I currently consider them all suspect pending
> > review.
> >=20
> > [1] http://datasheets.maximintegrated.com/en/ds/MAX6690.pdf (Page
> > 11, Table 2)
> >=20
>=20
> It has been some time since I first looked at this, but something in
> the back of my mind said my first glance at it was flawed, so I
> recently revisited the matter.
>=20
> I realize I was looking in the wrong place previously. After reading
> through powermac_thermal.c I had done a search for sensor_read() and
> found a match in smu.c so I started from there. However, there is no
> SMU on this early model G5. Searching more than just the same
> directory, I found sensor_read() occurs in many places. The pertinent
> function is really in sys/dev/iicbus/max6690.c as
> max6690_sensor_read().
>=20
> Looking at the correct function, I see multiple problems.
> max6690_read() is called twice, both times storing the return value
> in err. The value of err is checked only once, after both calls, so
> if the first call returns an error but the second succeeds, the error
> indicator is overwritten and calculations will be done with bad data.
> The check for err < 0 should be done after each call to
> max6690_read(). After that basic check, which would only indicate an
> error if there was a problem with the I2C bus transaction,
> computations are done without any further data validation. As stated
> before, the sensor may signal an error condition by returning the
> value 10000000. There is no check for that reserved value, so when
> the sensor attempts to indicate an error, we end up with a
> temperature of 127C being returned from max6690_sensor_read(). That
> value causes havoc for obvious reasons.
>=20
> This should be easy to clean up but I don't have time at the moment to
> do the testing of the changes on the one machine to which this is
> applicable. If nobody else deals with it, I'll get around to doing so
> and post a patch when that time comes.

Found this in my email archives.  I committed the error
checking for the integer part as r273016.

Thanks!

- Justin

--Sig_/81.W0_NboPiiUUlKsNAW0Y8
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBAgAGBQJUOxP4AAoJEDDHhY43vi25cjkH/joyaGFo6yKPQFwBjKhI/zrw
sAZfrJcclR61KM47/WwBsfhL8TDtHelNVkMyBIxhlAHelvzQGsSQooNGu5anJ2cw
J9LKoqStOGuxQx0KErbLsD9UN+7eMD/l0OlKIW99RLqzRkpNm5kwyAPfgxeRIzQa
p13rpDAMXMNFt8DYGfKVi/9zJz5ZrbHBqumVjRlhpIEL/sHfJDeg2cU2OE4YQHxG
W4g/PfleMh2iZNTC/OeQVcDu6ChhDbX9jNyOEanKgIQwckIuTQX8ZBQ1OgURBYTU
FyYG7ttH1sthwujSsp8DKrazxHhzIFfB59H9D/zN6+gUPLHFQkaUsZRJBDDn0jY=
=F88h
-----END PGP SIGNATURE-----

--Sig_/81.W0_NboPiiUUlKsNAW0Y8--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20141012165118.0b44f322>