Date: Fri, 12 Aug 2016 11:54:38 -0700 From: John Baldwin <jhb@freebsd.org> To: freebsd-hackers@freebsd.org Cc: David Wolfskill <david@catwhisker.org> Subject: Re: "ipmi0: KCS..." whines Message-ID: <2855524.PakqtZoDR6@ralph.baldwin.cx> In-Reply-To: <20160811175409.GW1112@albert.catwhisker.org> References: <20160811175409.GW1112@albert.catwhisker.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday, August 11, 2016 10:54:09 AM David Wolfskill wrote: > I'm trying to figure out how to cope with some messages: > > ... kernel: ipmi0: KCS: Failed to read completion code > ... kernel: ipmi0: KCS error: ff > > that recently started showing up on some machines at work (which run > somethinng based on FreeBSD). > > As the messages started showing up shortly after a colleague committed > some code that would issue "ipmitool dcmi power reading" periodically, I > tried issuing the command myself... to no (initial) effect. > > I then tried doing it in a loop: > > foreach n (`jot 10`) > ipmitool dcmi power reading > end > > Nothing. But then, 10 isn't a very big number, so I tried again with 32 > -- and got a few. > > That said, I have seen no evidence of data corruption or mis-reporting > that coincides with the messages being issued: at this point, the > messages seem to merely be annoying, and there's nothing I can do to > stop getting them (except to not issue "ipmitool dcmi power reading", > perhaps). > > Also, I have no real evidence that whatever is causing the messages to > be issued "in real life" is actually related to my "articial" recreation > of the symptoms. > > I poked around a bit, and found that the messages are issued from > src/sys/dev/ipmi/ipmi_kcs.c, where device_printf(9) is used to emit > them. (Eventually, that gets around to using printf(9), which is why I > was unable to merely use syslog.conf to stop spamming the console.) > > As a proof-of-concept, I cobbled up the attached patch (with significant > help from a colleague). It seems to do as intended: suppress the > messages for a "normal" boot, but emit them for a verbose boot. > > I'm not at all happy with that approach: it's way too far on the > "treating the symptom" side than "fixing the root cause." On the other > hand, it's not clear that emitting a message about which I can do > nothing useful is all that ... useful. :-} > > But ... none of the machines to which I have access and have the > requisite hardware are running stock FreeBSD. > > Is anyone able to reproduce this on a stock FreeBSD machine? > > Better: anyone have a better idea how to address thhe issue? So the issue is probably that the BMC controller on your box is sometimes slow in responding. The completion code is the third byte of the reply we wait to read after sending a request to the BMC via KCS. However, the first two bytes just echo back the request ID and command we asked for, so it may be that the BMC echoes those back right away without waiting for whatever work it needs to do to handle the request to complete, but doesn't send the completion code (the status of the request) until the request is fully processed. The driver is complaining that the BMC didn't respond with the completion code before it's timeout expired. The default timeout is MAX_TIMEOUT in sys/dev/ipmi/ipmivars.h which corresponds to 6 seconds. It may be that occasionally some "background" task runs in the BMC OS that delays responses to handling commands. It could also be that whatever work the BMC has to do to read this specific value is actually timing out or having issues in the hardware, etc. You could try increasing the timeout in MAX_TIMEOUT (just increase '6' to however many seconds you want to tolerate), but keep in mind that the CPU sits and spins polling for a reply (though the cure may be worse than the disease). You might also try polling this sensor less often. We could maybe use ppsratecheck() to rate limit the errors, but that's sort of papering over the problem that the BMC is timing out the request. A larger option is to modify the IPMI driver to support interrupt-driven operation (and not just polled) in which case a longer timeout might not hurt so much (you at least wouldn't be spinning on the CPU for N seconds). -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2855524.PakqtZoDR6>