FreeBSD Mail Archives

Date:      Sat, 28 May 2022 00:13:52 +0200
From:      Paul Floyd <paulf2718@gmail.com>
To:        FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Hang ast / pipelk / piperd
Message-ID:  <84015bf9-8504-1c3c-0ba5-58d0d7824843@gmail.com>

next in thread | raw e-mail | index | archive | help

This is a multi-part message in MIME format.
--------------5XxUQTN4h6fYAxHzJ5NA3cZn
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

Hi

I'm debugging two issues with Valgrind on FreeBSD 13.1 and 14, one on 
amd64 and one on i386.

The 1st testcase, on i386, creates 10 threads that all just then call 
pause(). Then there is a fork(), the parent does a pause() and the child 
kills the parent(). The error is reproducible.

The second testcase, on amd64, runs a loop for 7 tests, each one 
creating 2 threads. The thread function writes either to a global 
variable or various types of TLS, using a nanosleep as a way to yeild 
between the threads. This hang is intermittent.

The above detail is probably not that relevant.

In both examples Valgrind is hanging with 100% CPU use.

In ktrace where things seem to go wrong there is


|9340 none-amd64-freebsd GIO fd 28503 read 1 byte "X" 9340 
none-amd64-freebsd RET read 1 9340 none-amd64-freebsd CSW stop user 
"ast" 9340 none-amd64-freebsd CSW resume kernel "pipelk" 9340 
none-amd64-freebsd CSW stop kernel "piperd" 9340 none-amd64-freebsd CSW 
resume kernel "pipelk" 9340 none-amd64-freebsd CSW stop kernel "piperd" 
... repeat until killed That read is a pipe used for the Valgrind 
scheduler lock. The scheduler runs single threaded, and the read above 
means that one thread has acquired the lock and should be able to run. 
Instead it looks like there is an ast that gets the kernel stuck in 
context switches to pipe read and pipe lock states. kill -9 is the only 
way out. This all worked OK from FreeBSD 11.3 to 13.0. It's quite 
difficult to trace this within Valgrind. Both hangs seem quite sensitive 
to timing - in both cases adding or changing nanosleep times seem to 
make them no longer hang. Adding debug statements to Valgrind can also 
change the behaviour (and is also unsafe when not holding the scheduler 
lock). Does this look like a kernel bug? A+ Paul |

--------------5XxUQTN4h6fYAxHzJ5NA3cZn
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit

<html>
  <head>

    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>Hi</p>
    <p>I'm debugging two issues with Valgrind on FreeBSD 13.1 and 14,
      one on amd64 and one on i386.</p>
    <p>The 1st testcase, on i386, creates 10 threads that all just then
      call pause(). Then there is a fork(), the parent does a pause()
      and the child kills the parent(). The error is reproducible.<br>
    </p>
    <p>The second testcase, on amd64, runs a loop for 7 tests, each one
      creating 2 threads. The thread function writes either to a global
      variable or various types of TLS, using a nanosleep as a way to
      yeild between the threads. This hang is intermittent.<br>
    </p>
    <p>The above detail is probably not that relevant.</p>
    <p>In both examples Valgrind is hanging with 100% CPU use.</p>
    <p>In ktrace where things seem to go wrong there is<br>
    </p>
    <p><br>
    </p>
    <pre class="notranslate"><code>  9340 none-amd64-freebsd GIO   fd 28503 read 1 byte
       "X"
  9340 none-amd64-freebsd RET   read 1
  9340 none-amd64-freebsd CSW   stop user "ast"
  9340 none-amd64-freebsd CSW   resume kernel "pipelk"
  9340 none-amd64-freebsd CSW   stop kernel "piperd"
  9340 none-amd64-freebsd CSW   resume kernel "pipelk"
  9340 none-amd64-freebsd CSW   stop kernel "piperd"
... repeat until killed


That read is a pipe used for the Valgrind scheduler lock. The scheduler runs single threaded, and the read above means that one thread has acquired the lock and should be able to run.

Instead it looks like there is an ast that gets the kernel stuck in context switches to pipe read and pipe lock states. kill -9 is the only way out.

This all worked OK from FreeBSD 11.3 to 13.0.


It's quite difficult to trace this within Valgrind. Both hangs seem quite sensitive to timing - in both cases adding or changing nanosleep times seem to make them no longer hang.
Adding debug statements to Valgrind can also change the behaviour (and is also unsafe when not holding the scheduler lock).

Does this look like a kernel bug?

A+
Paul

</code></pre>
  </body>
</html>

--------------5XxUQTN4h6fYAxHzJ5NA3cZn--

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?84015bf9-8504-1c3c-0ba5-58d0d7824843>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation