From owner-freebsd-hackers@freebsd.org Mon Nov 25 14:45:55 2019 Return-Path: Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 90DBC1AF024 for ; Mon, 25 Nov 2019 14:45:55 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 47M8wg1kgSz4VkS; Mon, 25 Nov 2019 14:45:54 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id xAPEjkdk002853 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO); Mon, 25 Nov 2019 16:45:50 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua xAPEjkdk002853 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id xAPEjkY8002852; Mon, 25 Nov 2019 16:45:46 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 25 Nov 2019 16:45:46 +0200 From: Konstantin Belousov To: Kyle Evans Cc: FreeBSD Hackers Subject: Re: ptrace(2) debugging Message-ID: <20191125144546.GD10580@kib.kiev.ua> References: <20191124113956.GY2707@kib.kiev.ua> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.12.2 (2019-09-21) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FORGED_GMAIL_RCVD,FREEMAIL_FROM, NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on tom.home X-Rspamd-Queue-Id: 47M8wg1kgSz4VkS X-Spamd-Bar: ----- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-5.99 / 15.00]; NEURAL_HAM_MEDIUM(-0.99)[-0.995,0]; NEURAL_HAM_LONG(-1.00)[-0.997,0]; REPLY(-4.00)[] X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Nov 2019 14:45:55 -0000 On Sun, Nov 24, 2019 at 09:26:40PM -0600, Kyle Evans wrote: > On Sun, Nov 24, 2019 at 8:25 PM Kyle Evans wrote: > > > > On Sun, Nov 24, 2019 at 9:17 AM Kyle Evans wrote: > > > > > > On Sun, Nov 24, 2019 at 5:40 AM Konstantin Belousov wrote: > > > > > > > > On Sun, Nov 24, 2019 at 12:01:04AM -0600, Kyle Evans wrote: > > > > > Hi, > > > > > > > > > > I'm working on implementing `reptyr -T` on FreeBSD because I'm pretty > > > > > bad about starting long-running jobs outside of tmux and often desire > > > > > to reparent these jobs into tmux. I've gotten to a point where it's > > > > > getting stuck in waitpid(2) when attempting to work over the session > > > > > leader to ignore SIGHUP. The chain of operations looks roughly like > > > > > this: > > > > > > > > > > PT_ATTACH -> waitpid -> kill(SIGCONT) -> PT_TO_SCE -> waitpid -> > > > > > PT_TO_SCE -> waitpid > > > > > > > > > > Each of the waitpids are paired with a PT_LWPINFO. The first waitpid > > > > > observes SIGSTOP. The second waitpid observes SIGCONT. I would expect > > > > > the third to observe PL_FLAG_SCE on ptrace_lwpinfo->pl_flags, but > > > > > instead it actually hangs as the target process is now sleep-inhibited > > > > > and stuck in "pause" wchan. > > > > > > > > > > I've uploaded a truss excerpt at [0] in case it's helpful -- pid=10204 > > > > > is the process I'm reparenting, initially just attached/detached to > > > > > make sure reptyr *can* do this. pid=10187 is the sshd that it's > > > > > running under, and pid=10188 is the shell running under that. > > > > > > > > > > Anyone have good advice on debugging this? It seems like it might be > > > > > some kind of kernel bug, as it's already done this same dance once > > > > > before when grabbing sshd and my attempts to distill it down to a > > > > > simple test case failed. The FreeBSD part of reptyr needed some love, > > > > > though, so that can't be discounted either. > > > > > > > > > > Thanks, > > > > > > > > > > Kyle Evans > > > > > > > > > > [0] https://people.freebsd.org/~kevans/truss.log > > > > How much work would be to provide a self-contained standalone test ? > > > > > > I'm still struggling to write a self-contained example... > > > unfortunately a basic attach and trace them all to syscall entry isn't > > > sufficient. I'm slowly removing surface area from reptyr to try and > > > narrow it down- its operations between attaching to sshd and the > > > misbehavior are quite extensive, as it mmaps a page into the target, > > > opens a socket established by reptyr and passes an fd back over it. > > > > I managed to narrow it down, kind of. The problem is specifically with > > trying to trace zsh as a session leader. Easiest reproducer is to > > change shell to zsh and run this: > > https://people.freebsd.org/~kevans/ptrace_test.c -> you'll hang and > > have to ^C that sucker. My experiments showed that running this on zsh > > spawned any other way is fine, and changing shell to /bin/sh is also > > fine. > > > > Follow up part three, zsh is in sigsuspend() while a child is > executing and this is the cause. More effective reproducer: > https://people.freebsd.org/~kevans/ptrace_test2.c -> the behavior > makes a little more sense to me, but that seems less than ideal. It is still not quite a reproducer. Can you modify the test to clearly indicate what you want vs. what you get ? I see the legitimate loop of the parent (debugger) doing 59326 ptrace_test2 CALL ptrace(PT_ATTACH,0xe7bf,0x1,0) 59326 ptrace_test2 RET ptrace 0 59326 ptrace_test2 CALL wait4(0xe7bf,0x7fffffffe3bc,0,0) 59326 ptrace_test2 RET wait4 59327/0xe7bf 59326 ptrace_test2 CALL ptrace(PT_LWPINFO,0xe7bf,0x7fffffffe3c0,0xa0) 59326 ptrace_test2 RET ptrace 0 59326 ptrace_test2 CALL ptrace(PT_FOLLOW_FORK,0xe7bf,0,0x1) 59326 ptrace_test2 RET ptrace 0 59326 ptrace_test2 CALL ptrace(PT_GETREGS,0xe7bf,0x7fffffffe300,0) 59326 ptrace_test2 RET ptrace 0 59326 ptrace_test2 CALL kill(0xe7bf,SIGCONT) 59326 ptrace_test2 RET kill 0 59326 ptrace_test2 CALL ptrace(PT_TO_SCE,0xe7bf,0x1,0) 59326 ptrace_test2 RET ptrace 0 59326 ptrace_test2 CALL wait4(0xe7bf,0x7fffffffe25c,0,0) 59326 ptrace_test2 RET wait4 59327/0xe7bf 59326 ptrace_test2 CALL ptrace(PT_LWPINFO,0xe7bf,0x7fffffffe260,0xa0) 59326 ptrace_test2 RET ptrace 0 59326 ptrace_test2 CALL ptrace(PT_DETACH,0xe7bf,0x1,0) 59326 ptrace_test2 RET ptrace 0 in a loop, and I do not see anything wrong with it.