Date: Mon, 30 May 2022 10:15:43 -0400 From: Mark Johnston <markj@freebsd.org> To: Paul Floyd <paulf2718@gmail.com> Cc: FreeBSD Hackers <freebsd-hackers@freebsd.org> Subject: Re: Hang ast / pipelk / piperd Message-ID: <YpTRj7jVE0jfbxPO@nuc> In-Reply-To: <dca6a5b4-6f0c-98c0-2f2d-6e5da7405af4@gmail.com> References: <84015bf9-8504-1c3c-0ba5-58d0d7824843@gmail.com> <dca6a5b4-6f0c-98c0-2f2d-6e5da7405af4@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, May 30, 2022 at 12:19:15AM +0200, Paul Floyd wrote: > > On 5/27/22 22:13, Paul Floyd wrote: > > > > Hi > > > > I'm debugging two issues with Valgrind on FreeBSD 13.1 and 14, one on > > amd64 and one on i386. > > > ... > > |Both hangs seem quite sensitive to timing - in both cases adding or > > changing nanosleep times seem to make them no longer hang. | > > |Adding debug statements to Valgrind can also change the behaviour > > (and is also unsafe when not holding the scheduler lock). Does this > > look like a kernel bug? | > > [...] > > Under gdb I see (and this is quite variable) > > (gdb) info thread > Id Target Id Frame > * 1 LWP 100073 of process 861 vgModuleLocal_do_syscall_for_client_WRK > () at m_syswrap/syscall-amd64-freebsd.S:135 > 2 LWP 100215 of process 861 > vgModuleLocal_do_syscall_for_client_WRK () at > m_syswrap/syscall-amd64-freebsd.S:135 > 3 LWP 100216 of process 861 0x00000000380bffac in do_syscall_WRK () > 4 LWP 100217 of process 861 0x00000000380bffac in do_syscall_WRK () > 5 LWP 100218 of process 861 0x00000000380bffac in do_syscall_WRK () > 6 LWP 100219 of process 861 0x00000000380bffac in do_syscall_WRK () > 7 LWP 100220 of process 861 0x00000000380bffac in do_syscall_WRK () > 8 LWP 100221 of process 861 0x00000000380bffac in do_syscall_WRK () > 9 LWP 100222 of process 861 0x00000000380bffac in do_syscall_WRK () > 10 LWP 100223 of process 861 0x00000000380bffac in do_syscall_WRK () > 11 LWP 100224 of process 861 0x00000000380bffac in do_syscall_WRK () > 12 LWP 100225 of process 861 0x00000000380bffac in do_syscall_WRK () > 13 LWP 100226 of process 861 0x00000000380bffac in do_syscall_WRK () > 14 LWP 100227 of process 861 0x00000000380bffac in do_syscall_WRK () > 15 LWP 100228 of process 861 0x00000000380bffac in do_syscall_WRK () > > do_syscall_WRK is the syscall interface for the Valgrind host, and that > will be the threads waiting for the lock. > > Thread 1 and 2 are in do_syscall_for_client, the interface for guest > syscalls. Thread 1 is doing a _umtx_op syscall, for the pthread_join. > Thrread 2 is doing a nanosleep. These are blocking syscalls so they > release the lock before making the syscall to allow other threads to > execute. > > I think that in the snapshot above, the lock is released and one > of threads 3 to 15 should be obtaining the lock and running. > > That's where the kernel context switch / AST seems to be going wrong. > > I don't see anything particularly wrong on the Valgrind side. > > Any ideas what I can do to see why the context switch is hanging? "procstat -kk <valgrind PID>" might help to reveal what's going on, since it sounds like the hand/livelock is happening somewhere in the kernel.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YpTRj7jVE0jfbxPO>
