From owner-freebsd-stable@freebsd.org  Tue Oct 27 16:25:11 2015
Return-Path: <owner-freebsd-stable@freebsd.org>
Delivered-To: freebsd-stable@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2CA63A1F0D5
 for <freebsd-stable@mailman.ysv.freebsd.org>;
 Tue, 27 Oct 2015 16:25:11 +0000 (UTC) (envelope-from bra@fsn.hu)
Received: from dg.fsn.hu (dg.fsn.hu [84.2.225.196])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "dg.fsn.hu", Issuer "dg.fsn.hu" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id BB00510E5
 for <freebsd-stable@freebsd.org>; Tue, 27 Oct 2015 16:25:10 +0000 (UTC)
 (envelope-from bra@fsn.hu)
Received: by dg.fsn.hu (Postfix, from userid 1003)
 id B5E8D291D; Tue, 27 Oct 2015 17:25:02 +0100 (CET)
X-Bogosity: Ham, tests=bogofilter, spamicity=0.003469, version=1.2.4
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MF-ACE0E1EA [pR:
 25.6787]
X-CRM114-CacheID: sfid-20151027_17250_0BCD04C7 
X-CRM114-Status: Good  ( pR: 25.6787 )
X-DSPAM-Result: Whitelisted
X-DSPAM-Processed: Tue Oct 27 17:25:02 2015
X-DSPAM-Confidence: 0.9899
X-DSPAM-Probability: 0.0000
X-DSPAM-Signature: 562fa55e580599734046062
X-DSPAM-Factors: 27, is+unavailable, 0.01000, overflow", 0.01000,
 Java, 0.01000, could, 0.01000, could, 0.01000, but, 0.01000,
 but, 0.01000, only+this, 0.01000, reports, 0.01000,
 (3068, 0.01000, a+redis, 0.01000, out+from, 0.01000,
 Also, 0.01000, Also, 0.01000,
 References*tuebingen.de>, 0.01000, some+hours, 0.01000,
 possibility, 0.01000, Date*Oct+2015, 0.01000,
 this+after, 0.01000, 20+0, 0.01000, postsig+ast, 0.01000,
 think+that, 0.01000, happens+(for, 0.01000, 00%25, 0.01000,
 ideas+about, 0.01000, >+Zitat, 0.01000,
X-Spambayes-Classification: ham; 0.00
Received: from [IPv6:::1] (japan.t-online.co.hu [195.228.243.99])
 by dg.fsn.hu (Postfix) with ESMTPSA id 08C6D291B;
 Tue, 27 Oct 2015 17:25:01 +0100 (CET)
Subject: Re: Stuck processes in unkillable (STOP) state, listen queue overflow
To: Zara Kanaeva <zara.kanaeva@ggi.uni-tuebingen.de>,
 freebsd-stable@freebsd.org
References: <20151027144242.Horde.3Xc1_RqzaVMAZ12X6OPXfdN@webmail.uni-tuebingen.de>
From: "Nagy, Attila" <bra@fsn.hu>
Message-ID: <562FA55D.6050503@fsn.hu>
Date: Tue, 27 Oct 2015 17:25:01 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.3.0
MIME-Version: 1.0
In-Reply-To: <20151027144242.Horde.3Xc1_RqzaVMAZ12X6OPXfdN@webmail.uni-tuebingen.de>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-stable>, 
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable/>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Oct 2015 16:25:11 -0000

Hi,

(following topposting)
I have seen this with 16 and 32 GiB of RAM, but anyways, it shouldn't 
matter.
Do you use zfs? Although it doesn't seem to be stuck on IO...

On 10/27/15 14:42, Zara Kanaeva wrote:
> Hello,
>
> I have the same experience with apache and mapserver. It happens on 
> physical machine and ends with spontaneous reboot. This machine is 
> updated from FREEBSD 9.0 RELEASE to FREEBSD 10.2-PRERELEASE. Perhaps 
> this machine doesn't have enough RAM (only 8GB), but I think that must 
> not be a reason for a spontaneous reboot.
>
> I had no such behavior with the same machine and FREEBSD 9.0 RELEASE 
> on it (I am not 100% sure, I have yet no possibility to test it).
>
> Regards, Z. Kanaeva.
>
> Zitat von "Nagy, Attila" <bra@fsn.hu>:
>
>> Hi,
>>
>> Recently I've started to see a lot of cases, where the log is full 
>> with "listen queue overflow" messages and the process behind the 
>> network socket is unavailable.
>> When I open a TCP to it, it opens but nothing happens (for example I 
>> get no SMTP banner from postfix, nor I get a log entry about the new 
>> connection).
>>
>> I've seen this with Java programs, postfix and redis, basically 
>> everything which opens a TCP and listens on the machine.
>>
>> For example, I have a redis process, which listens on 6381. When I 
>> telnet into it, the TCP opens, but the program doesn't respond.
>> When I kill it, nothing happens. Even with kill -9 yields only this 
>> state:
>>   PID USERNAME       THR PRI NICE   SIZE    RES STATE   C TIME    
>> WCPU COMMAN
>>   776 redis            2  20    0 24112K  2256K STOP    3 16:56   
>> 0.00% redis-
>>
>> When I tcpdrop the connections of the process, tcpdrop reports 
>> success for the first time and failure for the second (No such 
>> process), but the connections remain:
>> # sockstat -4 | grep 776
>> redis    redis-serv 776   6  tcp4   *:6381 *:*
>> redis    redis-serv 776   9  tcp4   *:16381 *:*
>> redis    redis-serv 776   10 tcp4   127.0.0.1:16381 127.0.0.1:10460
>> redis    redis-serv 776   11 tcp4   127.0.0.1:16381 127.0.0.1:35795
>> redis    redis-serv 776   13 tcp4   127.0.0.1:30027 127.0.0.1:16379
>> redis    redis-serv 776   14 tcp4   127.0.0.1:58802 127.0.0.1:16384
>> redis    redis-serv 776   17 tcp4   127.0.0.1:16381 127.0.0.1:24354
>> redis    redis-serv 776   18 tcp4   127.0.0.1:16381 127.0.0.1:56999
>> redis    redis-serv 776   19 tcp4   127.0.0.1:16381 127.0.0.1:39488
>> redis    redis-serv 776   20 tcp4   127.0.0.1:6381 127.0.0.1:39491
>> # sockstat -4 | grep 776 | awk '{print "tcpdrop "$6" "$7}' | /bin/sh
>> tcpdrop: getaddrinfo: * port 6381: hostname nor servname provided, or 
>> not known
>> tcpdrop: getaddrinfo: * port 16381: hostname nor servname provided, 
>> or not known
>> tcpdrop: 127.0.0.1 16381 127.0.0.1 10460: No such process
>> tcpdrop: 127.0.0.1 16381 127.0.0.1 35795: No such process
>> tcpdrop: 127.0.0.1 30027 127.0.0.1 16379: No such process
>> tcpdrop: 127.0.0.1 58802 127.0.0.1 16384: No such process
>> tcpdrop: 127.0.0.1 16381 127.0.0.1 24354: No such process
>> tcpdrop: 127.0.0.1 16381 127.0.0.1 56999: No such process
>> tcpdrop: 127.0.0.1 16381 127.0.0.1 39488: No such process
>> tcpdrop: 127.0.0.1 6381 127.0.0.1 39491: No such process
>> # sockstat -4 | grep 776
>> redis    redis-serv 776   6  tcp4   *:6381 *:*
>> redis    redis-serv 776   9  tcp4   *:16381 *:*
>> redis    redis-serv 776   10 tcp4   127.0.0.1:16381 127.0.0.1:10460
>> redis    redis-serv 776   11 tcp4   127.0.0.1:16381 127.0.0.1:35795
>> redis    redis-serv 776   13 tcp4   127.0.0.1:30027 127.0.0.1:16379
>> redis    redis-serv 776   14 tcp4   127.0.0.1:58802 127.0.0.1:16384
>> redis    redis-serv 776   17 tcp4   127.0.0.1:16381 127.0.0.1:24354
>> redis    redis-serv 776   18 tcp4   127.0.0.1:16381 127.0.0.1:56999
>> redis    redis-serv 776   19 tcp4   127.0.0.1:16381 127.0.0.1:39488
>> redis    redis-serv 776   20 tcp4   127.0.0.1:6381 127.0.0.1:39491
>>
>> $ procstat -k 776
>>   PID    TID COMM             TDNAME KSTACK
>>   776 100725 redis-server     -                mi_switch 
>> sleepq_timedwait_sig _sleep kern_kevent sys_kevent amd64_syscall 
>> Xfast_syscall
>>   776 100744 redis-server     -                mi_switch 
>> thread_suspend_switch thread_single exit1 sigexit postsig ast doreti_ast
>>
>> I can do nothing to get out from this state, only reboot helps.
>>
>> The OS is stable/10@r289313, but I could observe this behaviour with 
>> earlier releases too.
>>
>> The dmesg is full with lines like these:
>> sonewconn: pcb 0xfffff8004dc54498: Listen queue overflow: 193 already 
>> in queue awaiting acceptance (3142 occurrences)
>> sonewconn: pcb 0xfffff8004d9ed188: Listen queue overflow: 193 already 
>> in queue awaiting acceptance (3068 occurrences)
>> sonewconn: pcb 0xfffff8004d9ed188: Listen queue overflow: 193 already 
>> in queue awaiting acceptance (3057 occurrences)
>> sonewconn: pcb 0xfffff8004d9ed188: Listen queue overflow: 193 already 
>> in queue awaiting acceptance (3037 occurrences)
>> sonewconn: pcb 0xfffff8004d9ed188: Listen queue overflow: 193 already 
>> in queue awaiting acceptance (3015 occurrences)
>> sonewconn: pcb 0xfffff8004d9ed188: Listen queue overflow: 193 already 
>> in queue awaiting acceptance (3035 occurrences)
>>
>> I guess this is the effect of the process freeze, not the cause (the 
>> listen queue fills up because the app can't handle the incoming 
>> connections).
>>
>> I'm not sure it matters, but some of the machines (and the above) 
>> runs on an ESX hypervisor (but as far as I can remember, I could see 
>> this on physical machines too, but I'm not sure about that).
>> Also -so far- I could only see this where some "exotic" stuff ran, 
>> like a java or erlang based server (opendj, elasticsearch and rabbitmq).
>>
>> Also not sure about which triggers this. I've never seen this after 
>> some hours of uptime, at least some days or a week must've been 
>> passed to get stuck like the above.
>>
>> Any ideas about this?
>>
>> Thanks,
>> _______________________________________________
>> freebsd-stable@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
>> To unsubscribe, send any mail to 
>> "freebsd-stable-unsubscribe@freebsd.org"
>
>
>