Date: Sat, 28 May 2022 00:13:52 +0200 From: Paul Floyd <paulf2718@gmail.com> To: FreeBSD Hackers <freebsd-hackers@freebsd.org> Subject: Hang ast / pipelk / piperd Message-ID: <84015bf9-8504-1c3c-0ba5-58d0d7824843@gmail.com>
next in thread | raw e-mail | index | archive | help
This is a multi-part message in MIME format. --------------5XxUQTN4h6fYAxHzJ5NA3cZn Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Hi I'm debugging two issues with Valgrind on FreeBSD 13.1 and 14, one on amd64 and one on i386. The 1st testcase, on i386, creates 10 threads that all just then call pause(). Then there is a fork(), the parent does a pause() and the child kills the parent(). The error is reproducible. The second testcase, on amd64, runs a loop for 7 tests, each one creating 2 threads. The thread function writes either to a global variable or various types of TLS, using a nanosleep as a way to yeild between the threads. This hang is intermittent. The above detail is probably not that relevant. In both examples Valgrind is hanging with 100% CPU use. In ktrace where things seem to go wrong there is |9340 none-amd64-freebsd GIO fd 28503 read 1 byte "X" 9340 none-amd64-freebsd RET read 1 9340 none-amd64-freebsd CSW stop user "ast" 9340 none-amd64-freebsd CSW resume kernel "pipelk" 9340 none-amd64-freebsd CSW stop kernel "piperd" 9340 none-amd64-freebsd CSW resume kernel "pipelk" 9340 none-amd64-freebsd CSW stop kernel "piperd" ... repeat until killed That read is a pipe used for the Valgrind scheduler lock. The scheduler runs single threaded, and the read above means that one thread has acquired the lock and should be able to run. Instead it looks like there is an ast that gets the kernel stuck in context switches to pipe read and pipe lock states. kill -9 is the only way out. This all worked OK from FreeBSD 11.3 to 13.0. It's quite difficult to trace this within Valgrind. Both hangs seem quite sensitive to timing - in both cases adding or changing nanosleep times seem to make them no longer hang. Adding debug statements to Valgrind can also change the behaviour (and is also unsafe when not holding the scheduler lock). Does this look like a kernel bug? A+ Paul | --------------5XxUQTN4h6fYAxHzJ5NA3cZn Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 7bit <html> <head> <meta http-equiv="content-type" content="text/html; charset=UTF-8"> </head> <body> <p>Hi</p> <p>I'm debugging two issues with Valgrind on FreeBSD 13.1 and 14, one on amd64 and one on i386.</p> <p>The 1st testcase, on i386, creates 10 threads that all just then call pause(). Then there is a fork(), the parent does a pause() and the child kills the parent(). The error is reproducible.<br> </p> <p>The second testcase, on amd64, runs a loop for 7 tests, each one creating 2 threads. The thread function writes either to a global variable or various types of TLS, using a nanosleep as a way to yeild between the threads. This hang is intermittent.<br> </p> <p>The above detail is probably not that relevant.</p> <p>In both examples Valgrind is hanging with 100% CPU use.</p> <p>In ktrace where things seem to go wrong there is<br> </p> <p><br> </p> <pre class="notranslate"><code> 9340 none-amd64-freebsd GIO fd 28503 read 1 byte "X" 9340 none-amd64-freebsd RET read 1 9340 none-amd64-freebsd CSW stop user "ast" 9340 none-amd64-freebsd CSW resume kernel "pipelk" 9340 none-amd64-freebsd CSW stop kernel "piperd" 9340 none-amd64-freebsd CSW resume kernel "pipelk" 9340 none-amd64-freebsd CSW stop kernel "piperd" ... repeat until killed That read is a pipe used for the Valgrind scheduler lock. The scheduler runs single threaded, and the read above means that one thread has acquired the lock and should be able to run. Instead it looks like there is an ast that gets the kernel stuck in context switches to pipe read and pipe lock states. kill -9 is the only way out. This all worked OK from FreeBSD 11.3 to 13.0. It's quite difficult to trace this within Valgrind. Both hangs seem quite sensitive to timing - in both cases adding or changing nanosleep times seem to make them no longer hang. Adding debug statements to Valgrind can also change the behaviour (and is also unsafe when not holding the scheduler lock). Does this look like a kernel bug? A+ Paul </code></pre> </body> </html> --------------5XxUQTN4h6fYAxHzJ5NA3cZn--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?84015bf9-8504-1c3c-0ba5-58d0d7824843>