Date: Tue, 1 Aug 2023 14:47:55 +0000 From: Stephan de Wit <stephan.de.wit@deciso.com> To: "freebsd-hackers@FreeBSD.org" <freebsd-hackers@FreeBSD.org> Subject: lockmgr() in network driver context Message-ID: <AM6P193MB0311A026447E2B8523F4AE89BC0AA@AM6P193MB0311.EURP193.PROD.OUTLOOK.COM>
next in thread | raw e-mail | index | archive | help
[-- Attachment #1 --] Hi all, I’m encountering i2c bus failures that need more graceful error handling. The driver operates its management threads in polling mode within callout context. As a hardware constraint, communication via said i2c bus can only happen for a single network port at a time and thus needs to be mtx_lock()'d. With this lock held, i2c transfers are initiated by the driver and the done event is signaled via an ISR. Since we cannot sleep both in callout (not allowed according to [1]) and mutex context (will panic), the second thread waiting on the bus will spin if it does not live on the same CPU[1], and the i2c transfer itself is handled via a DELAY() busy loop. If the bus fails, we have 2 cores spinning at 100%, causing connections to drop and all sorts of other carnage to happen. Reducing the timeout is not an option, as the transfers might take a while. Sleeping under a MTX_DEF is not possible, instant panic. However, if I replace the first mtx_lock() with a lockmgr() call that was initialized with LK_TIMELOCK | LK_CANRECURSE and a given timeout, I'm able to put the thread waiting on the comms bus to be available to sleep. If I replace the hardcoded DELAY()s further down the line during the i2c transfers with mtx_sleep() (and wakeup_one() in the ISR), no CPU cycles are wasted, this is also the case for normal operation, improving the current situation by a mile. As to the nature of these bus failures, they seem to happen very infrequently and nigh on impossible to reproduce on a production system, but they always seem to recover after a certain amount of time. To reproduce the issue, I have wired some physical switches to the i2c bus to simulate the outages. From a design perspective, the i2c bus should be a singleton and should be treated as such via the queueing of transfer tasks including metadata to link such transfers back to specific port ids, however, a change like this would require significant effort with a large risk of regressions, and I'm not even sure if sleeping in such a scenario would be considered a valid thing to do in FreeBSD context, so I would prefer sticking to the battle-hardened code. The documentation for lockmgr() tells me to stay away from it, but I seem to have little options left. Also, the solution seems to violate the FreeBSD guidelines for sleeping in callout context. Any thoughts? Thanks in advance. [1] https://man.freebsd.org/cgi/man.cgi?locking(9) [-- Attachment #2 --] <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="Generator" content="Microsoft Word 15 (filtered medium)"> <style><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0cm; font-size:11.0pt; font-family:"Calibri",sans-serif; mso-ligatures:standardcontextual; mso-fareast-language:EN-US;} a:link, span.MsoHyperlink {mso-style-priority:99; color:#0563C1; text-decoration:underline;} p.MsoPlainText, li.MsoPlainText, div.MsoPlainText {mso-style-priority:99; mso-style-link:"Plain Text Char"; margin:0cm; font-size:11.0pt; font-family:"Calibri",sans-serif; mso-ligatures:standardcontextual; mso-fareast-language:EN-US;} span.EmailStyle17 {mso-style-type:personal-compose; font-family:"Calibri",sans-serif; color:windowtext;} span.PlainTextChar {mso-style-name:"Plain Text Char"; mso-style-priority:99; mso-style-link:"Plain Text"; font-family:"Calibri",sans-serif;} .MsoChpDefault {mso-style-type:export-only; font-family:"Calibri",sans-serif; mso-fareast-language:EN-US;} @page WordSection1 {size:612.0pt 792.0pt; margin:72.0pt 72.0pt 72.0pt 72.0pt;} div.WordSection1 {page:WordSection1;} --></style><!--[if gte mso 9]><xml> <o:shapedefaults v:ext="edit" spidmax="1026" /> </xml><![endif]--><!--[if gte mso 9]><xml> <o:shapelayout v:ext="edit"> <o:idmap v:ext="edit" data="1" /> </o:shapelayout></xml><![endif]--> </head> <body lang="en-NL" link="#0563C1" vlink="#954F72" style="word-wrap:break-word"> <div class="WordSection1"> <p class="MsoPlainText"><span lang="en-NL">Hi all,<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL"><o:p> </o:p></span></p> <p class="MsoPlainText"><span lang="EN-US">I’m encountering i2c bus failures that need more graceful error handling. <o:p></o:p></span></p> <p class="MsoPlainText"><span lang="EN-US">The driver operates its management threads in polling mode within callout context.<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">As a hardware constraint, communication via said i2c bus can only<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">happen for a single network port at a time and thus needs to be mtx_lock()'d.<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">With this lock held, i2c transfers are initiated by the driver and the done<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">event is signaled via an ISR. Since we cannot sleep both in callout (not<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">allowed according to [1]) and mutex context (will panic), the second thread<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">waiting on the bus will spin if it does not live on the same CPU[1], and the<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">i2c transfer itself is handled via a DELAY() busy loop. If the bus fails,<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">we have 2 cores spinning at 100%, causing connections to drop and all<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">sorts of other carnage to happen. Reducing the timeout is not an option, as<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">the transfers might take a while. <o:p> </o:p></span></p> <p class="MsoPlainText"><span lang="en-NL"><o:p> </o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">Sleeping under a MTX_DEF is not</span><span lang="en-NL"> </span><span lang="en-NL">possible, instant panic. However, if I replace <o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">the first mtx_lock</span><span lang="EN-US">()</span><span lang="en-NL"> with a</span><span lang="en-NL"> </span><span lang="en-NL">lockmgr() call that was initialized with LK_TIMELOCK | <o:p> </o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">LK_CANRECURSE and a given</span><span lang="en-NL"> </span><span lang="en-NL">timeout, I'm able to put the thread waiting on<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">the comms bus to be available to</span><span lang="en-NL"> </span><span lang="en-NL">sleep. If I replace the hardcoded DELAY()s</span><span lang="en-NL"> </span><span lang="EN-US"><o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">further down the line during the</span><span lang="en-NL"> </span><span lang="en-NL">i2c transfers with mtx_sleep() (and wakeup_one() in the ISR), <o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">no CPU</span><span lang="en-NL"> </span> <span lang="en-NL">cycles are wasted, this is also the case for normal operation</span><span lang="EN-US">,<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="EN-US">improving the current situation by a mile.<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL"><o:p> </o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">As to the</span><span lang="en-NL"> </span> <span lang="en-NL">nature of these bus failures, they seem to happen very infrequently and nigh<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">on impossible to reproduce on a production system, but they always seem<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">to recover after a certain amount of time. To reproduce the issue, I have<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">wired some physical switches to the i2c bus to simulate the outages. <o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL"><o:p> </o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">From a</span><span lang="en-NL"> </span> <span lang="en-NL">design perspective, the i2c bus should be a singleton and should be treated<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">as such via the queueing of transfer tasks including metadata to link<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">such transfers back to specific port ids, however, a change like this would<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">require significant effort with a large risk of regressions, and I'm not even<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">sure if sleeping in such a scenario would be considered a valid thing to do<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">in FreeBSD context, so I would prefer sticking to the battle-hardened<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">code.<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL"><o:p> </o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">The documentation for lockmgr() tells me to stay away from it, but I<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">seem to have little options left. </span> <span lang="EN-US">Also, the solution seems to violate<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="EN-US">the FreeBSD guidelines for sleeping in callout context. </span><span lang="en-NL">Any thoughts? <o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL"><o:p> </o:p></span></p> <p class="MsoPlainText"><span lang="EN-US">Thanks in advance.<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL"><o:p> </o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">[1]</span><span lang="en-NL"> </span><span lang="en-NL"><a href="https://man.freebsd.org/cgi/man.cgi?locking(9)">https://man.freebsd.org/cgi/man.cgi?locking(9)</a><o:p></o:p></span></p> <p class="MsoNormal"><span lang="en-NL"><o:p> </o:p></span></p> </div> </body> </html>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?AM6P193MB0311A026447E2B8523F4AE89BC0AA>
