From owner-freebsd-net@FreeBSD.ORG Fri Oct 31 18:25:59 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0452A46C for ; Fri, 31 Oct 2014 18:25:59 +0000 (UTC) Received: from mail-wg0-x22e.google.com (mail-wg0-x22e.google.com [IPv6:2a00:1450:400c:c00::22e]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 86E3EAA0 for ; Fri, 31 Oct 2014 18:25:58 +0000 (UTC) Received: by mail-wg0-f46.google.com with SMTP id x13so7154267wgg.33 for ; Fri, 31 Oct 2014 11:25:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=XBgRBhnFUqg3hWrMVUXSuVycxgeG4MC10bv24B39YGc=; b=pVoR21bc0fsCBNyxilsDHZzhNRMNu++nGyWXG3ICYk5J85pNFsB24b2+LE0n5EcxLv z1Nw09ZSf1cDSSstXTW2wYF2MLMuB2h9cCnvia3PKR+eJ2bv6Y5VVRux2t2ZBlLBJBSr Ag1MjCtT8JaP69QCUu2y08wTF80pv+R7Ykf9tDWrF1iNt1U8pWfZ4lHPvZIbLwPJjPl6 /2+izDw1D25HcRVCoJdDa0+66YX/aR07a3UO9etXsxTYoUQlvC6IoN7QdRe17qTX/mcf WTgq04S8TFR2O6pQ5Ib4lBFRLIaUOrVGZPn18cyM5u5jnYvLVXKHnF75H9bva4ctSIJZ KzlQ== X-Received: by 10.180.12.136 with SMTP id y8mr5702358wib.73.1414779956671; Fri, 31 Oct 2014 11:25:56 -0700 (PDT) Received: from dft-labs.eu (n1x0n-1-pt.tunnel.tserv5.lon1.ipv6.he.net. [2001:470:1f08:1f7::2]) by mx.google.com with ESMTPSA id s10sm7426930wix.14.2014.10.31.11.25.55 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Fri, 31 Oct 2014 11:25:55 -0700 (PDT) Date: Fri, 31 Oct 2014 19:25:53 +0100 From: Mateusz Guzik To: elof2@sentor.se Subject: Re: Unable to kill a non-zombie process with -9 Message-ID: <20141031182552.GA20591@dft-labs.eu> References: <20141009222926.GC1852@funkthat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-net , John-Mark Gurney , snort-devel mailinglist X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Oct 2014 18:25:59 -0000 On Wed, Oct 15, 2014 at 11:41:33AM +0200, elof2@sentor.se wrote: > > Hi! > > Today the problem reoccurred. > I've now debugged the problem a little furter. > > I'm starting snort (as root). > > <<>> > Oct 15 08:46:59 snort[22646]: Initializing daemon mode > Oct 15 08:46:59 snort[22648]: Daemon initialized, signaled parent > pid: 22646 > Oct 15 08:46:59 snort[22648]: Reload thread starting... > Oct 15 08:46:59 snort[22648]: Reload thread started, thread > 0x8146e8800 (22648) > End of log. > > Error! Nothing more happens with the snort process! > Normally it should continue and log these lines as well: > > > snort[nnn]: Decoding Ethernet > snort[nnn]: Checking PID path... > snort[nnn]: PID path stat checked out ok, PID path set to /var/run/ > snort[nnn]: Writing PID "7627" to file "/var/run//snort_mon0.pid" > snort[nnn]: Chroot directory = /usr/foobar/log > snort[nnn]: Set gid to 100 > snort[nnn]: Set uid to 100 > snort[nnn]: > snort[nnn]: --== Initialization Complete ==-- > snort[nnn]: Commencing packet processing (pid=nnn) > > > > > > When looking at this half-started snort process with 'ps', it looks > like this: > > ps faxulwwj 22648 > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND > UID PPID CPU PRI NI MWCHAN PGID SID JOBC > root 22648 51.8 1.1 488552 179344 - Rs 8:46AM 53:06.52 > /usr/local/bin/s 0 1 0 88 0 - 22648 22648 0 > > > The process is still owned by root, so just as the missing log lines > are saying, it has not yet performed any change of uid/gid. > > > > > So there seem to be two questions. > > Q1) > What happens between "Reload thread started, thread 0x8146e8800 > (22648)" and "Decoding Ethernet"? > Apparently something goes wrong here on FreeBSD 10.0. > (this problem does not always occur, sometimes snort start just fine) > Nobody can tell with this data. > Q2) > When the process has frozen in this half-started state, it can't be > killed even with a -9. Why? > Temporarily unkillable processes are standard, you can encounter them in BSDs, Solaris, Linux and I would not be surprised if Windows as well. In short, here and there the kernel does something blocking which is not supposed to fail. For instance in a lot of places FreeBSD kernel just waits for memory to be free when allocating. However, if stuff blocks indefinitely, we may be dealing with a bug. > > > > John-Mark asked me for some debugging info. Here it is: > > I now run 'kill 22648' on the above semi-started process: > > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME > COMMAND UID PPID CPU PRI NI MWCHAN PGID SID JOBC > old root 22648 51.8 1.1 488552 179344 - Rs 8:46AM 53:06.52 > /usr/local/bin/s 0 1 0 88 0 - 22648 22648 0 > new root 22648 52.3 1.1 488552 179344 - Rs 8:46AM 53:36.48 > /usr/local/bin/s 0 1 0 52 0 - 22648 22648 0 > > No change. > > > > kill -9 22648 > > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME > COMMAND UID PPID CPU PRI NI MWCHAN PGID SID JOBC > old root 22648 51.8 1.1 488552 179344 - Rs 8:46AM 53:06.52 > /usr/local/bin/s 0 1 0 88 0 - 22648 22648 0 > new root 22648 37.7 1.1 488552 179344 - Ts 8:46AM 53:50.87 > /usr/local/bin/s 0 1 0 52 0 - 22648 22648 0 > > Less CPU-usage and STAT changed to "Ts". > > > > > kill -CONT 22648 > > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME > COMMAND UID PPID CPU PRI NI MWCHAN PGID SID JOBC > old root 22648 51.8 1.1 488552 179344 - Rs 8:46AM 53:06.52 > /usr/local/bin/s 0 1 0 88 0 - 22648 22648 0 > new root 22648 0.0 1.1 488552 179344 - Ts 8:46AM 53:50.88 > /usr/local/bin/s 0 1 0 52 0 - 22648 22648 0 > > No change except cpu is down to 0. > > > I now start 'kgdb' > info threads > I found two threads for snort, doing a bt for both of them: > 372 Thread 100602 (PID=22648: snort) sched_switch > (td=0xfffff802c061f490, newtd=, flags= optimized out>) at /usr/src/sys/kern/sched_ule.c:1962 > 371 Thread 100598 (PID=22648: snort) sched_switch > (td=0xfffff80221857000, newtd=, flags= optimized out>) at /usr/src/sys/kern/sched_ule.c:1962 > thread 372 > [Switching to thread 372 (Thread 100602)]#0 sched_switch > (td=0xfffff802c061f490, newtd=, flags= optimized out>) at /usr/src/sys/kern/sched_ule.c:1962 > 1962 in /usr/src/sys/kern/sched_ule.c > bt > #0 sched_switch (td=0xfffff802c061f490, newtd=, > flags=) at /usr/src/sys/kern/sched_ule.c:1962 > #1 0xffffffff808b8c1e in mi_switch (flags=266, newtd=0x0) at > /usr/src/sys/kern/kern_synch.c:494 > #2 0xffffffff808c04b0 in thread_suspend_switch > (td=0xfffff802c061f490) at /usr/src/sys/kern/kern_thread.c:883 > #3 0xffffffff808c0276 in thread_single (mode=1) at > /usr/src/sys/kern/kern_thread.c:713 > #4 0xffffffff8087c1bb in exit1 (td=0xfffff802c061f490, rv=9) at > /usr/src/sys/kern/kern_exit.c:180 > #5 0xffffffff808b2faf in sigexit (td=, > sig=) at /usr/src/sys/kern/kern_sig.c:2935 > #6 0xffffffff808b3669 in postsig (sig=) at > /usr/src/sys/kern/kern_sig.c:2822 > #7 0xffffffff808f6f57 in ast (framep=) at > /usr/src/sys/kern/subr_trap.c:271 > #8 0xffffffff80c75870 in Xfast_syscall () at > /usr/src/sys/amd64/amd64/exception.S:416 > #9 0x0000000801d6f19a in ?? () > Previous frame inner to this frame (corrupt stack?) > > > thread 371 > [Switching to thread 371 (Thread 100598)]#0 sched_switch > (td=0xfffff80221857000, newtd=, flags= optimized out>) at /usr/src/sys/kern/sched_ule.c:1962 > 1962 in /usr/src/sys/kern/sched_ule.c > bt > #0 sched_switch (td=0xfffff80221857000, newtd=, > flags=) at /usr/src/sys/kern/sched_ule.c:1962 > #1 0xffffffff808b8c1e in mi_switch (flags=260, newtd=0x0) at > /usr/src/sys/kern/kern_synch.c:494 > #2 0xffffffff808f2e3a in sleepq_wait (wchan=0x0, pri=0) at > /usr/src/sys/kern/subr_sleepqueue.c:620 > #3 0xffffffff80864aad in _cv_wait (cvp=0xffffffff8147a500, > lock=0xffffffff8147a480) at /usr/src/sys/kern/kern_condvar.c:139 Here your thread got blocked trying to acquire a mutex. You can find the owner by inspecting vm (vm=0xffffffff8147a480) -> vm_lock -> mtx_lock. You may need to cast along the way. Roughly speaking (untested): f 4 p (struct vmem *)0xffffffff8147a480)->vm_lock->mtx_lock > #4 0xffffffff808fb05f in vmem_xalloc (vm=0xffffffff8147a480, > size0=, align=, phase=0, > nocross=, minaddr=0, > maxaddr=18446735286768857088, flags=8194, addrp= out>) at /usr/src/sys/kern/subr_vmem.c:1196 > #5 0xffffffff808fae6b in vmem_alloc (vm=0x0, size=0, flags= optimized out>, addrp=0xfffffe0466e1d6e8) at > /usr/src/sys/kern/subr_vmem.c:1082 > #6 0xffffffff80b0fa58 in kmem_malloc (vmem=0xffffffff8147a480, > size=2139729920, flags=2) at /usr/src/sys/vm/vm_kern.c:314 > #7 0xffffffff80b08dfb in uma_large_malloc (size= out>, wait=2) at /usr/src/sys/vm/uma_core.c:1006 > #8 0xffffffff80898cf3 in malloc (size=2139729920, > mtp=0xffffffff813a0450, flags=0) at > /usr/src/sys/kern/kern_malloc.c:520 > #9 0xffffffff8096307b in bpf_buffer_ioctl_sblen > (d=0xfffff80159ea9000, i=) at > /usr/src/sys/net/bpf_buffer.c:183 > #10 0xffffffff80960a3c in bpfioctl (dev=0x0, cmd= out>, addr=0xfffff801fbd06b40 "", flags=0, td=0xfffff80221857000) at > /usr/src/sys/net/bpf.c:408 > #11 0xffffffff807ac1df in devfs_ioctl_f (fp=0xfffff8002b3d9d20, > com=3221504614, data=0xfffff801fbd06b40, cred=, > td=0xfffff80221857000) at /usr/src/sys/fs/devfs/devfs_vnops.c:757 > #12 0xffffffff808fdfae in kern_ioctl (td=0xfffff80221857000, > fd=, com=0) at file.h:319 > #13 0xffffffff808fdd2f in sys_ioctl (td=0xfffff80221857000, > uap=0xfffffe0466e1da40) at /usr/src/sys/kern/sys_generic.c:702 > #14 0xffffffff80c8f117 in amd64_syscall (td=0xfffff80221857000, > traced=0) at subr_syscall.c:134 > #15 0xffffffff80c7580b in Xfast_syscall () at > /usr/src/sys/amd64/amd64/exception.S:391 > #16 0x0000000801d8f08a in ?? () > Previous frame inner to this frame (corrupt stack?) > > -- Mateusz Guzik