From owner-freebsd-stable@freebsd.org Tue Oct 27 10:19:25 2015 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0C78CA1E50F for ; Tue, 27 Oct 2015 10:19:25 +0000 (UTC) (envelope-from bra@fsn.hu) Received: from dg.fsn.hu (dg.fsn.hu [84.2.225.196]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "dg.fsn.hu", Issuer "dg.fsn.hu" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id B65F41B04 for ; Tue, 27 Oct 2015 10:19:24 +0000 (UTC) (envelope-from bra@fsn.hu) Received: by dg.fsn.hu (Postfix, from userid 1003) id 085BE23D2; Tue, 27 Oct 2015 11:10:38 +0100 (CET) X-Bogosity: Ham, tests=bogofilter, spamicity=0.007248, version=1.2.4 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MF-ACE0E1EA [pR: 17.4036] X-CRM114-CacheID: sfid-20151027_11103_C9BEFE98 X-CRM114-Status: Good ( pR: 17.4036 ) X-DSPAM-Result: Whitelisted X-DSPAM-Processed: Tue Oct 27 11:10:38 2015 X-DSPAM-Confidence: 0.9899 X-DSPAM-Probability: 0.0000 X-DSPAM-Signature: 562f4d9e484251701111545 X-DSPAM-Factors: 27, is+unavailable, 0.01000, overflow", 0.01000, Java, 0.01000, could, 0.01000, could, 0.01000, but, 0.01000, but, 0.01000, only+this, 0.01000, reports, 0.01000, a+redis, 0.01000, out+from, 0.01000, Also, 0.01000, Also, 0.01000, some+hours, 0.01000, Date*Oct+2015, 0.01000, this+after, 0.01000, 20+0, 0.01000, 00%25, 0.01000, ideas+about, 0.01000, Received*online.co.hu+[195.228.243.99]), 0.01000, or, 0.01000, or, 0.01000, Received*Tue, 0.01000, this+I've, 0.01000, never, 0.01000, an, 0.01000, X-Spambayes-Classification: ham; 0.00 Received: from [IPv6:::1] (japan.t-online.co.hu [195.228.243.99]) by dg.fsn.hu (Postfix) with ESMTPSA id A90E423CF for ; Tue, 27 Oct 2015 11:10:35 +0100 (CET) To: freebsd-stable From: "Nagy, Attila" Subject: Stuck processes in unkillable (STOP) state, listen queue overflow Message-ID: <562F4D98.9060200@fsn.hu> Date: Tue, 27 Oct 2015 11:10:32 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Oct 2015 10:19:25 -0000 Hi, Recently I've started to see a lot of cases, where the log is full with "listen queue overflow" messages and the process behind the network socket is unavailable. When I open a TCP to it, it opens but nothing happens (for example I get no SMTP banner from postfix, nor I get a log entry about the new connection). I've seen this with Java programs, postfix and redis, basically everything which opens a TCP and listens on the machine. For example, I have a redis process, which listens on 6381. When I telnet into it, the TCP opens, but the program doesn't respond. When I kill it, nothing happens. Even with kill -9 yields only this state: PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAN 776 redis 2 20 0 24112K 2256K STOP 3 16:56 0.00% redis- When I tcpdrop the connections of the process, tcpdrop reports success for the first time and failure for the second (No such process), but the connections remain: # sockstat -4 | grep 776 redis redis-serv 776 6 tcp4 *:6381 *:* redis redis-serv 776 9 tcp4 *:16381 *:* redis redis-serv 776 10 tcp4 127.0.0.1:16381 127.0.0.1:10460 redis redis-serv 776 11 tcp4 127.0.0.1:16381 127.0.0.1:35795 redis redis-serv 776 13 tcp4 127.0.0.1:30027 127.0.0.1:16379 redis redis-serv 776 14 tcp4 127.0.0.1:58802 127.0.0.1:16384 redis redis-serv 776 17 tcp4 127.0.0.1:16381 127.0.0.1:24354 redis redis-serv 776 18 tcp4 127.0.0.1:16381 127.0.0.1:56999 redis redis-serv 776 19 tcp4 127.0.0.1:16381 127.0.0.1:39488 redis redis-serv 776 20 tcp4 127.0.0.1:6381 127.0.0.1:39491 # sockstat -4 | grep 776 | awk '{print "tcpdrop "$6" "$7}' | /bin/sh tcpdrop: getaddrinfo: * port 6381: hostname nor servname provided, or not known tcpdrop: getaddrinfo: * port 16381: hostname nor servname provided, or not known tcpdrop: 127.0.0.1 16381 127.0.0.1 10460: No such process tcpdrop: 127.0.0.1 16381 127.0.0.1 35795: No such process tcpdrop: 127.0.0.1 30027 127.0.0.1 16379: No such process tcpdrop: 127.0.0.1 58802 127.0.0.1 16384: No such process tcpdrop: 127.0.0.1 16381 127.0.0.1 24354: No such process tcpdrop: 127.0.0.1 16381 127.0.0.1 56999: No such process tcpdrop: 127.0.0.1 16381 127.0.0.1 39488: No such process tcpdrop: 127.0.0.1 6381 127.0.0.1 39491: No such process # sockstat -4 | grep 776 redis redis-serv 776 6 tcp4 *:6381 *:* redis redis-serv 776 9 tcp4 *:16381 *:* redis redis-serv 776 10 tcp4 127.0.0.1:16381 127.0.0.1:10460 redis redis-serv 776 11 tcp4 127.0.0.1:16381 127.0.0.1:35795 redis redis-serv 776 13 tcp4 127.0.0.1:30027 127.0.0.1:16379 redis redis-serv 776 14 tcp4 127.0.0.1:58802 127.0.0.1:16384 redis redis-serv 776 17 tcp4 127.0.0.1:16381 127.0.0.1:24354 redis redis-serv 776 18 tcp4 127.0.0.1:16381 127.0.0.1:56999 redis redis-serv 776 19 tcp4 127.0.0.1:16381 127.0.0.1:39488 redis redis-serv 776 20 tcp4 127.0.0.1:6381 127.0.0.1:39491 $ procstat -k 776 PID TID COMM TDNAME KSTACK 776 100725 redis-server - mi_switch sleepq_timedwait_sig _sleep kern_kevent sys_kevent amd64_syscall Xfast_syscall 776 100744 redis-server - mi_switch thread_suspend_switch thread_single exit1 sigexit postsig ast doreti_ast I can do nothing to get out from this state, only reboot helps. The OS is stable/10@r289313, but I could observe this behaviour with earlier releases too. The dmesg is full with lines like these: sonewconn: pcb 0xfffff8004dc54498: Listen queue overflow: 193 already in queue awaiting acceptance (3142 occurrences) sonewconn: pcb 0xfffff8004d9ed188: Listen queue overflow: 193 already in queue awaiting acceptance (3068 occurrences) sonewconn: pcb 0xfffff8004d9ed188: Listen queue overflow: 193 already in queue awaiting acceptance (3057 occurrences) sonewconn: pcb 0xfffff8004d9ed188: Listen queue overflow: 193 already in queue awaiting acceptance (3037 occurrences) sonewconn: pcb 0xfffff8004d9ed188: Listen queue overflow: 193 already in queue awaiting acceptance (3015 occurrences) sonewconn: pcb 0xfffff8004d9ed188: Listen queue overflow: 193 already in queue awaiting acceptance (3035 occurrences) I guess this is the effect of the process freeze, not the cause (the listen queue fills up because the app can't handle the incoming connections). I'm not sure it matters, but some of the machines (and the above) runs on an ESX hypervisor (but as far as I can remember, I could see this on physical machines too, but I'm not sure about that). Also -so far- I could only see this where some "exotic" stuff ran, like a java or erlang based server (opendj, elasticsearch and rabbitmq). Also not sure about which triggers this. I've never seen this after some hours of uptime, at least some days or a week must've been passed to get stuck like the above. Any ideas about this? Thanks,