Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 30 May 2022 10:15:43 -0400
From:      Mark Johnston <markj@freebsd.org>
To:        Paul Floyd <paulf2718@gmail.com>
Cc:        FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: Hang ast / pipelk / piperd
Message-ID:  <YpTRj7jVE0jfbxPO@nuc>
In-Reply-To: <dca6a5b4-6f0c-98c0-2f2d-6e5da7405af4@gmail.com>
References:  <84015bf9-8504-1c3c-0ba5-58d0d7824843@gmail.com> <dca6a5b4-6f0c-98c0-2f2d-6e5da7405af4@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, May 30, 2022 at 12:19:15AM +0200, Paul Floyd wrote:
> 
> On 5/27/22 22:13, Paul Floyd wrote:
> >
> > Hi
> >
> > I'm debugging two issues with Valgrind on FreeBSD 13.1 and 14, one on 
> > amd64 and one on i386.
> >
> ...
> > |Both hangs seem quite sensitive to timing - in both cases adding or 
> > changing nanosleep times seem to make them no longer hang. |
> > |Adding debug statements to Valgrind can also change the behaviour 
> > (and is also unsafe when not holding the scheduler lock). Does this 
> > look like a kernel bug? |
> 
> [...]
>
> Under gdb I see (and this is quite variable)
> 
> (gdb) info thread
>    Id   Target Id                 Frame
> * 1    LWP 100073 of process 861 vgModuleLocal_do_syscall_for_client_WRK 
> () at m_syswrap/syscall-amd64-freebsd.S:135
>    2    LWP 100215 of process 861 
> vgModuleLocal_do_syscall_for_client_WRK () at 
> m_syswrap/syscall-amd64-freebsd.S:135
>    3    LWP 100216 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    4    LWP 100217 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    5    LWP 100218 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    6    LWP 100219 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    7    LWP 100220 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    8    LWP 100221 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    9    LWP 100222 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    10   LWP 100223 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    11   LWP 100224 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    12   LWP 100225 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    13   LWP 100226 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    14   LWP 100227 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    15   LWP 100228 of process 861 0x00000000380bffac in do_syscall_WRK ()
> 
> do_syscall_WRK is the syscall interface for the Valgrind host, and that 
> will be the threads waiting for the lock.
> 
> Thread 1 and 2 are in do_syscall_for_client, the interface for guest
> syscalls. Thread 1 is doing a _umtx_op syscall, for the pthread_join. 
> Thrread 2 is doing a nanosleep. These are blocking syscalls so they
> release the lock before making the syscall to allow other threads to
> execute.
> 
> I think that in the snapshot above, the lock is released and one
> of threads 3 to 15 should be obtaining the lock and running.
> 
> That's where the kernel context switch / AST seems to be going wrong.
> 
> I don't see anything particularly wrong on the Valgrind side.
> 
> Any ideas what I can do to see why the context switch is hanging?

"procstat -kk <valgrind PID>" might help to reveal what's going on,
since it sounds like the hand/livelock is happening somewhere in the
kernel.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YpTRj7jVE0jfbxPO>