Date: Sun, 12 Oct 2014 16:51:18 -0700 From: Justin Hibbits <chmeeedalf@gmail.com> To: Matthew Rezny <mrezny@hexaneinc.com> Cc: freebsd-ppc@freebsd.org Subject: Re: PowerMac G5 spurious sensor readings Message-ID: <20141012165118.0b44f322@zhabar.attlocal.net> In-Reply-To: <20130222220918.00005998@unknown> References: <51169.1358483910@hexaneinc.com> <20130222220918.00005998@unknown>
next in thread | previous in thread | raw e-mail | index | archive | help
--Sig_/81.W0_NboPiiUUlKsNAW0Y8 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Fri, 22 Feb 2013 22:09:18 +0100 Matthew Rezny <mrezny@hexaneinc.com> wrote: > On Fri, 18 Jan 2013 05:38:30 +0100 > Matthew Rezny <mrezny@hexaneinc.com> wrote: >=20 > > On Thu 13/01/17 21:59 , Matthew Rezny wrote:: > >=20 > > >I have a G5 of the first model (PowerMac7,2) on which I've been > > >using FreeBSD/ppc64 for over a year. Today, it suddenly rebooted. > > >Not the first time by any means, but this is the first time I found > > >the following log message: Jan 17 17:32:19 powermac kernel: > > >WARNING: Current temperature (MLB MAX6690 AMB:127.8 C) exceeds > > >critical temperature (80.0 C)! Shutting down! > > > > > >This is the first time I have seen such a message. After reboot, > > >that sensor shows a temperature near 30C, which seems appropriate. > > >The reading of 127.8C looks suspiciously like a max value. My only > > >guess is there was a bad read that resulted in the sensor value > > >going over the threshold. That raises a question in my mind as to > > >whether there is any filtering or sanity checking of the data. > > >Could a single bad read cause the threshold to be exceeded and > > >trigger shutdown immediately, or would the excessive value have to > > >be returned from that sensor multiple times for it to be believed > > >an acted upon? > > > > > >$ uname -a > > >FreeBSD powermac 9.1-RC1 FreeBSD 9.1-RC1 #0: Thu Aug 16 00:43:39 > > >UTC 2012 > > >root@anacreon.physics.wisc.edu:/usr/obj/usr/src/sys/GENERIC64 > > >powerpc > > > > > >The build is a bit old, though I wouldn't expect too much change to > > >the code in question since then. I will update to 9.1-RELEASE or > > >-STABLE in the next few days, but as this is a problem that has > > >happened once in over a year, I wouldn't call it resolved just by a > > >quick failure to reproduce after updating. > > > > > >I was already planning to do an update after the box has completed > > >it's current task. I noticed a problem with excessive output > > >causing the console to hang. A couple days ago I found the machine > > >apparently hung in that the keyboard and mouse were not responsive, > > >but I found it was still alive on the network and I could ssh in to > > >reboot. The only clues were no buffer space for dmesg to output > > >anything before reboot, and a rather full /var/log/messages file > > >which had exhausted the drive. Under the same workload (and after > > >freeing some drive space), the problem reoccurred in a matter of > > >hours, but this time with me watching. While running ddrescue > > >against a drive with some bad sectors, read errors flood the > > >console in spurts. When some dozens of read errors are displayed > > >at once, the console scrolls whole pages by in a fraction of a > > >second, and then goes dead. Messages that should go to console are > > >not shown on screen but are in the log. Attempts to switch virtual > > >console or to reboot are not successful, but ssh access continues > > >to work and the box is clearly still processing other workloads. > > >The only sign of life from the console are the messages about > > >flushing buffers just before completion of the reboot commanded > > >via ssh. > > > > >=20 > > Just a few hours later, it strikes again. > > Jan 17 23:06:11 powermac kernel: WARNING: Current temperature (MLB > > MAX6690 AMB: 127.0 C) exceeds critical temperature (80.0 C)! > > Shutting down! > >=20 > > I took a peek in smu.c and powermac_thermal.c. In the former, > > smu_sensor_read() has a check for an error returned from > > smu_run_cmd() but no checks on the returned data. In the later, > > pmac_therm_manage_fans() invokes smu_sensor_read() and considers the > > returned value as valid if greater than zero. No other sanity checks > > are performed. > >=20 > > Looking at the datasheet[1] for max6690, I see that 127C is the > > maximum readable temperature, which is represented as 01111111. The > > value 10000000 is documented as representing a diode fault. As there > > is no upper range check, the diode fault condition will be > > interpreted as slightly over 127C. I think it would be appropriate > > to treat as invalid any raw sensor value with the MSB set. > > Additionally, the check on line 105 of pmac_therm_manage_fans > > should really be "if (temp >=3D 0)" rather than just "if (temp > 0)" > > as a value of 0 is a valid value for zero degrees and all actual > > errors are represented as a value of -1. > >=20 > > I have not looked at the datasheets for other relevant sensors, but > > being that there are no range checks in any of the cases in > > smu_sensor_read(), I currently consider them all suspect pending > > review. > >=20 > > [1] http://datasheets.maximintegrated.com/en/ds/MAX6690.pdf (Page > > 11, Table 2) > >=20 >=20 > It has been some time since I first looked at this, but something in > the back of my mind said my first glance at it was flawed, so I > recently revisited the matter. >=20 > I realize I was looking in the wrong place previously. After reading > through powermac_thermal.c I had done a search for sensor_read() and > found a match in smu.c so I started from there. However, there is no > SMU on this early model G5. Searching more than just the same > directory, I found sensor_read() occurs in many places. The pertinent > function is really in sys/dev/iicbus/max6690.c as > max6690_sensor_read(). >=20 > Looking at the correct function, I see multiple problems. > max6690_read() is called twice, both times storing the return value > in err. The value of err is checked only once, after both calls, so > if the first call returns an error but the second succeeds, the error > indicator is overwritten and calculations will be done with bad data. > The check for err < 0 should be done after each call to > max6690_read(). After that basic check, which would only indicate an > error if there was a problem with the I2C bus transaction, > computations are done without any further data validation. As stated > before, the sensor may signal an error condition by returning the > value 10000000. There is no check for that reserved value, so when > the sensor attempts to indicate an error, we end up with a > temperature of 127C being returned from max6690_sensor_read(). That > value causes havoc for obvious reasons. >=20 > This should be easy to clean up but I don't have time at the moment to > do the testing of the changes on the one machine to which this is > applicable. If nobody else deals with it, I'll get around to doing so > and post a patch when that time comes. Found this in my email archives. I committed the error checking for the integer part as r273016. Thanks! - Justin --Sig_/81.W0_NboPiiUUlKsNAW0Y8 Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAEBAgAGBQJUOxP4AAoJEDDHhY43vi25cjkH/joyaGFo6yKPQFwBjKhI/zrw sAZfrJcclR61KM47/WwBsfhL8TDtHelNVkMyBIxhlAHelvzQGsSQooNGu5anJ2cw J9LKoqStOGuxQx0KErbLsD9UN+7eMD/l0OlKIW99RLqzRkpNm5kwyAPfgxeRIzQa p13rpDAMXMNFt8DYGfKVi/9zJz5ZrbHBqumVjRlhpIEL/sHfJDeg2cU2OE4YQHxG W4g/PfleMh2iZNTC/OeQVcDu6ChhDbX9jNyOEanKgIQwckIuTQX8ZBQ1OgURBYTU FyYG7ttH1sthwujSsp8DKrazxHhzIFfB59H9D/zN6+gUPLHFQkaUsZRJBDDn0jY= =F88h -----END PGP SIGNATURE----- --Sig_/81.W0_NboPiiUUlKsNAW0Y8--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20141012165118.0b44f322>