Date: Sat, 28 May 2022 00:13:52 +0200 From: Paul Floyd <paulf2718@gmail.com> To: FreeBSD Hackers <freebsd-hackers@freebsd.org> Subject: Hang ast / pipelk / piperd Message-ID: <84015bf9-8504-1c3c-0ba5-58d0d7824843@gmail.com>
next in thread | raw e-mail | index | archive | help
This is a multi-part message in MIME format.
--------------5XxUQTN4h6fYAxHzJ5NA3cZn
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Hi
I'm debugging two issues with Valgrind on FreeBSD 13.1 and 14, one on
amd64 and one on i386.
The 1st testcase, on i386, creates 10 threads that all just then call
pause(). Then there is a fork(), the parent does a pause() and the child
kills the parent(). The error is reproducible.
The second testcase, on amd64, runs a loop for 7 tests, each one
creating 2 threads. The thread function writes either to a global
variable or various types of TLS, using a nanosleep as a way to yeild
between the threads. This hang is intermittent.
The above detail is probably not that relevant.
In both examples Valgrind is hanging with 100% CPU use.
In ktrace where things seem to go wrong there is
|9340 none-amd64-freebsd GIO fd 28503 read 1 byte "X" 9340
none-amd64-freebsd RET read 1 9340 none-amd64-freebsd CSW stop user
"ast" 9340 none-amd64-freebsd CSW resume kernel "pipelk" 9340
none-amd64-freebsd CSW stop kernel "piperd" 9340 none-amd64-freebsd CSW
resume kernel "pipelk" 9340 none-amd64-freebsd CSW stop kernel "piperd"
... repeat until killed That read is a pipe used for the Valgrind
scheduler lock. The scheduler runs single threaded, and the read above
means that one thread has acquired the lock and should be able to run.
Instead it looks like there is an ast that gets the kernel stuck in
context switches to pipe read and pipe lock states. kill -9 is the only
way out. This all worked OK from FreeBSD 11.3 to 13.0. It's quite
difficult to trace this within Valgrind. Both hangs seem quite sensitive
to timing - in both cases adding or changing nanosleep times seem to
make them no longer hang. Adding debug statements to Valgrind can also
change the behaviour (and is also unsafe when not holding the scheduler
lock). Does this look like a kernel bug? A+ Paul |
--------------5XxUQTN4h6fYAxHzJ5NA3cZn
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Hi</p>
<p>I'm debugging two issues with Valgrind on FreeBSD 13.1 and 14,
one on amd64 and one on i386.</p>
<p>The 1st testcase, on i386, creates 10 threads that all just then
call pause(). Then there is a fork(), the parent does a pause()
and the child kills the parent(). The error is reproducible.<br>
</p>
<p>The second testcase, on amd64, runs a loop for 7 tests, each one
creating 2 threads. The thread function writes either to a global
variable or various types of TLS, using a nanosleep as a way to
yeild between the threads. This hang is intermittent.<br>
</p>
<p>The above detail is probably not that relevant.</p>
<p>In both examples Valgrind is hanging with 100% CPU use.</p>
<p>In ktrace where things seem to go wrong there is<br>
</p>
<p><br>
</p>
<pre class="notranslate"><code> 9340 none-amd64-freebsd GIO fd 28503 read 1 byte
"X"
9340 none-amd64-freebsd RET read 1
9340 none-amd64-freebsd CSW stop user "ast"
9340 none-amd64-freebsd CSW resume kernel "pipelk"
9340 none-amd64-freebsd CSW stop kernel "piperd"
9340 none-amd64-freebsd CSW resume kernel "pipelk"
9340 none-amd64-freebsd CSW stop kernel "piperd"
... repeat until killed
That read is a pipe used for the Valgrind scheduler lock. The scheduler runs single threaded, and the read above means that one thread has acquired the lock and should be able to run.
Instead it looks like there is an ast that gets the kernel stuck in context switches to pipe read and pipe lock states. kill -9 is the only way out.
This all worked OK from FreeBSD 11.3 to 13.0.
It's quite difficult to trace this within Valgrind. Both hangs seem quite sensitive to timing - in both cases adding or changing nanosleep times seem to make them no longer hang.
Adding debug statements to Valgrind can also change the behaviour (and is also unsafe when not holding the scheduler lock).
Does this look like a kernel bug?
A+
Paul
</code></pre>
</body>
</html>
--------------5XxUQTN4h6fYAxHzJ5NA3cZn--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?84015bf9-8504-1c3c-0ba5-58d0d7824843>
