From owner-freebsd-stable@freebsd.org Mon Sep 26 08:51:21 2016 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D8F1ABEA761 for ; Mon, 26 Sep 2016 08:51:21 +0000 (UTC) (envelope-from julien.charbon@gmail.com) Received: from mail-wm0-f48.google.com (mail-wm0-f48.google.com [74.125.82.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 761AE772 for ; Mon, 26 Sep 2016 08:51:21 +0000 (UTC) (envelope-from julien.charbon@gmail.com) Received: by mail-wm0-f48.google.com with SMTP id b130so136490679wmc.0 for ; Mon, 26 Sep 2016 01:51:21 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to; bh=odatWaQZk6NST92pPnF+yH8MUfaCuLH1tH0z6ppfWRo=; b=buqmpIfGfsz+18Qqa7K0Ql1wwX85Ld9cLEXcB6ZA05MACvJ/T854/W1eSCr4Cbc+k2 3ns4FHdF4olOhNWTMN0/HJmLFlwKo9JdGKyDoa8w3Ihzd5dh/TYuRtnCrN1+xMsZWJlb WmfxyogiOrR1G74b1uIUYzi1VroQXxfqQF5QFQhBnu8sGE8h7ExKbpsOmk38mXozPyTM HKku7QYlsUefnlPiTcjOBRAfMTONWb18q8KOSWvJqQU2cDjcqKCaxqyzr+DmZ0O71hly nyf2IuFTuoSBUMAZcLF0p1qW+bO62b2dvd2MPgcTe5cGwIfCHWiFMa9CoWha0F7zGwfd sSYw== X-Gm-Message-State: AE9vXwPfgflWL3c07UY2gvd4OakGH7uZEOPLDyn4IRFYGLlA537XvScuFEITkiydeBxr4A== X-Received: by 10.194.115.38 with SMTP id jl6mr16632342wjb.28.1474879874002; Mon, 26 Sep 2016 01:51:14 -0700 (PDT) Received: from [10.100.64.20] ([217.30.88.44]) by smtp.gmail.com with ESMTPSA id 123sm10061023wmj.5.2016.09.26.01.51.13 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 26 Sep 2016 01:51:13 -0700 (PDT) Subject: Re: 11.0 stuck on high network load To: Slawa Olhovchenkov References: <20160919204328.GN2840@zxy.spb.ru> <8ba75d6e-4f01-895e-0aed-53c6c6692cb9@freebsd.org> <20160920202633.GQ2840@zxy.spb.ru> <20160921195155.GW2840@zxy.spb.ru> <20160922095331.GB2840@zxy.spb.ru> <67862b33-63c0-2f23-d254-5ddc55dbb554@freebsd.org> <20160922102045.GC2840@zxy.spb.ru> <20160923191656.GF2840@zxy.spb.ru> <20160925185815.GA6177@zxy.spb.ru> Cc: Konstantin Belousov , freebsd-stable@FreeBSD.org, hiren panchasara From: Julien Charbon Message-ID: Date: Mon, 26 Sep 2016 10:51:07 +0200 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.3.0 MIME-Version: 1.0 In-Reply-To: <20160925185815.GA6177@zxy.spb.ru> Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="TtwNGnolL41w7XTiubjCTpMUQ4E3GcXkQ" X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Sep 2016 08:51:21 -0000 This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --TtwNGnolL41w7XTiubjCTpMUQ4E3GcXkQ Content-Type: multipart/mixed; boundary="LOEuSf0BIKjR7HVfK3G906jP9JLEU0oiC"; protected-headers="v1" From: Julien Charbon To: Slawa Olhovchenkov Cc: Konstantin Belousov , freebsd-stable@FreeBSD.org, hiren panchasara Message-ID: Subject: Re: 11.0 stuck on high network load References: <20160919204328.GN2840@zxy.spb.ru> <8ba75d6e-4f01-895e-0aed-53c6c6692cb9@freebsd.org> <20160920202633.GQ2840@zxy.spb.ru> <20160921195155.GW2840@zxy.spb.ru> <20160922095331.GB2840@zxy.spb.ru> <67862b33-63c0-2f23-d254-5ddc55dbb554@freebsd.org> <20160922102045.GC2840@zxy.spb.ru> <20160923191656.GF2840@zxy.spb.ru> <20160925185815.GA6177@zxy.spb.ru> In-Reply-To: <20160925185815.GA6177@zxy.spb.ru> --LOEuSf0BIKjR7HVfK3G906jP9JLEU0oiC Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable On 9/25/16 8:58 PM, Slawa Olhovchenkov wrote: > On Fri, Sep 23, 2016 at 10:16:56PM +0300, Slawa Olhovchenkov wrote: >=20 >> On Thu, Sep 22, 2016 at 01:20:45PM +0300, Slawa Olhovchenkov wrote: >> >>> On Thu, Sep 22, 2016 at 12:04:40PM +0200, Julien Charbon wrote: >>> >>>>>> These paths can indeed compete for the same INP lock, as both >>>>>> tcp_tw_2msl_scan() calls always start with the first inp found in >>>>>> twq_2msl list. But in both cases, this first inp should be quickl= y used >>>>>> and its lock released anyway, thus that could explain your situati= on it >>>>>> that the TCP stack is doing that all the time, for example: >>>>>> >>>>>> - Let say that you are running out completely and constantly of t= cptw, >>>>>> and then all connections transitioning to TIME_WAIT state are comp= eting >>>>>> with the TIME_WAIT timeout scan that tries to free all the expired= >>>>>> tcptw. If the stack is doing that all the time, it can appear lik= e >>>>>> "live" locked. >>>>>> >>>>>> This is just an hypothesis and as usual might be a red herring. >>>>>> Anyway, could you run: >>>>>> >>>>>> $ vmstat -z | head -2; vmstat -z | grep -E 'tcp|sock' >>>>> >>>>> ITEM SIZE LIMIT USED FREE REQ FAIL = SLEEP >>>>> >>>>> socket: 864, 4192664, 18604, 25348,49276158, = 0, 0 >>>>> tcp_inpcb: 464, 4192664, 34226, 18702,49250593, = 0, 0 >>>>> tcpcb: 1040, 4192665, 18424, 18953,49250593, = 0, 0 >>>>> tcptw: 88, 16425, 15802, 623,14526919, 8= , 0 >>>>> tcpreass: 40, 32800, 15, 2285, 632381, 0= , 0 >>>>> >>>>> In normal case tcptw is about 16425/600/900 >>>>> >>>>> And after `sysctl -a | grep tcp` system stuck on serial console and= I am reset it. >>>>> >>>>>> Ideally, once when everything is ok, and once when you have the i= ssue >>>>>> to see the differences (if any). >>>>>> >>>>>> If it appears your are quite low in tcptw, and if you have enough= >>>>>> memory, could you try increase the tcptw limit using sysctl >>>>> >>>>> I think this is not eliminate stuck, just may do it less frequency >>>> >>>> You are right, it would just be a big hint that the tcp_tw_2msl_sca= n() >>>> contention hypothesis is the right one. As I see you have plenty of= >>>> memory on your server, thus could you try with: >>>> >>>> net.inet.tcp.maxtcptw=3D4192665 >>>> >>>> And see what happen. Just to validate this hypothesis. >>> >>> This is bad way for validate, with maxtcptw=3D16384 happened is rando= m >>> and can be waited for month. After maxtcptw=3D4192665 I am don't know= >>> how long need to wait for verification this hypothesis. >>> >>> More frequency (may be 3-5 times per day) happening less traffic drop= s >>> (not to zero for minutes). May be this caused also by contention in >>> tcp_tw_2msl_scan, but fast resolved (stochastic process). By eating >>> CPU power nginx can't service connection and clients closed >>> connections and need more TIME_WAIT and can trigered >>> tcp_tw_2msl_scan(reuse=3D1). After this we can got live lock. >>> >>> May be after I learning to catch and dignostic this validation is mor= e >>> accurately. >> >> Some more bits: >> >> socket: 864, 4192664, 30806, 790,28524160, 0, = 0 >> ipq: 56, 32802, 0, 1278, 1022, 0, = 0 >> udp_inpcb: 464, 4192664, 44, 364, 14066, 0, = 0 >> udpcb: 32, 4192750, 44, 3081, 14066, 0, = 0 >> tcp_inpcb: 464, 4192664, 38558, 378,28476709, 0, = 0 >> tcpcb: 1040, 4192665, 30690, 738,28476709, 0, = 0 >> tcptw: 88, 32805, 7868, 772, 8412249, 0, = 0 >> >> last pid: 49575; load averages: 2.00, 2.05, 3.75 up 1+01:12:08 = 22:13:42 >> 853 processes: 15 running, 769 sleeping, 35 waiting, 34 lock >> CPU 0: 0.0% user, 0.0% nice, 0.0% system, 100% interrupt, 0.0% i= dle >> CPU 1: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% i= dle >> CPU 2: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% i= dle >> CPU 3: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% i= dle >> CPU 4: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% i= dle >> CPU 5: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% i= dle >> CPU 6: 0.0% user, 0.0% nice, 0.4% system, 0.0% interrupt, 99.6% i= dle >> CPU 7: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% i= dle >> CPU 8: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% i= dle >> CPU 9: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% i= dle >> CPU 10: 0.0% user, 0.0% nice, 0.4% system, 0.0% interrupt, 99.6% i= dle >> CPU 11: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% i= dle >> Mem: 8659M Active, 8385M Inact, 107G Wired, 1325M Free >> ARC: 99G Total, 88G MFU, 10G MRU, 32K Anon, 167M Header, 529M Other >> Swap: 32G Total, 32G Free >=20 > I can collect some more info. > First: >=20 > 1046 - - R 125:10.79 nginx: worker process (nginx) > 1047 kqread - I 150:11.98 nginx: worker process (nginx) > 1049 kqread - I 145:58.35 nginx: worker process (nginx) > 1050 kqread - I 136:33.36 nginx: worker process (nginx) > 1051 kqread - I 140:59.73 nginx: worker process (nginx) > 1052 kqread - I 137:18.12 nginx: worker process (nginx) >=20 > pid 1046 is nginx running on CPU0 (affinity mask set). >=20 > # procstat -k -k 1046 > PID TID COMM TDNAME KSTACK > 1046 100686 nginx - mi_switch+0xd2 critical_= exit+0x7e lapic_handle_timer+0xb1 Xtimerint+0x8c __mtx_lock_sleep+0x168 z= one_fetch_slab+0x47 zone_import+0x52 zone_alloc_item+0x36 keg_alloc_slab+= 0x63 keg_fetch_slab+0x16e zone_fetch_slab+0x6e zone_import+0x52 uma_zallo= c_arg+0x36e m_getm2+0x14f m_uiotombuf+0x64 sosend_generic+0x356 soo_write= +0x42 dofilewrite+0x87 >=20 > Tracing command nginx pid 1046 tid 100686 td 0xfffff8014485f500 > sched_switch() at 0xffffffff804c956d =3D sched_switch+0x6ad/frame 0xfff= ffe20216992a0 /usr/src/sys/kern/sched_ule.c:1973 > mi_switch() at 0xffffffff804a8d92 =3D mi_switch+0xd2/frame 0xfffffe2021= 6992d0 /usr/src/sys/kern/kern_synch.c:465 > critical_exit() at 0xffffffff804a6bee =3D critical_exit+0x7e/frame 0xff= fffe20216992f0 /usr/src/sys/kern/kern_switch.c:219 > lapic_handle_timer() at 0xffffffff80771701 =3D lapic_handle_timer+0xb1/= frame 0xfffffe2021699330 /usr/src/sys/x86/x86/local_apic.c:1185 > Xtimerint() at 0xffffffff806cbbcc =3D Xtimerint+0x8c/frame 0xfffffe2021= 699330 /usr/src/sys/amd64/amd64/apic_vector.S:135 > --- interrupt, rip =3D 0xffffffff804de424, rsp =3D 0xfffffe2021699400, = rbp =3D 0xfffffe2021699420 --- > lock_delay() at 0xffffffff804de424 =3D lock_delay+0x54/frame 0xfffffe20= 21699420 /usr/src/sys/kern/subr_lock.c:127 > __mtx_lock_sleep() at 0xffffffff80484dc8 =3D __mtx_lock_sleep+0x168/fra= me 0xfffffe20216994a0 /usr/src/sys/kern/kern_mutex.c:512 > zone_fetch_slab() at 0xffffffff806a4257 =3D zone_fetch_slab+0x47/frame = 0xfffffe20216994e0 /usr/src/sys/vm/uma_core.c:2378 > zone_import() at 0xffffffff806a4312 =3D zone_import+0x52/frame 0xfffffe= 2021699530 /usr/src/sys/vm/uma_core.c:2501 > zone_alloc_item() at 0xffffffff806a0986 =3D zone_alloc_item+0x36/frame = 0xfffffe2021699570 /usr/src/sys/vm/uma_core.c:2591 > keg_alloc_slab() at 0xffffffff806a2463 =3D keg_alloc_slab+0x63/frame 0x= fffffe20216995d0 /usr/src/sys/vm/uma_core.c:965 > keg_fetch_slab() at 0xffffffff806a48ce =3D keg_fetch_slab+0x16e/frame 0= xfffffe2021699620 /usr/src/sys/vm/uma_core.c:2349 > zone_fetch_slab() at 0xffffffff806a427e =3D zone_fetch_slab+0x6e/frame = 0xfffffe2021699660 /usr/src/sys/vm/uma_core.c:2375 > zone_import() at 0xffffffff806a4312 =3D zone_import+0x52/frame 0xfffffe= 20216996b0 /usr/src/sys/vm/uma_core.c:2501=20 > uma_zalloc_arg() at 0xffffffff806a147e =3D uma_zalloc_arg+0x36e/frame 0= xfffffe2021699720 /usr/src/sys/vm/uma_core.c:2531 > m_getm2() at 0xffffffff8048231f =3D m_getm2+0x14f/frame 0xfffffe2021699= 790 /usr/src/sys/kern/kern_mbuf.c:830 > m_uiotombuf() at 0xffffffff80516044 =3D m_uiotombuf+0x64/frame 0xfffffe= 20216997e0 /usr/src/sys/kern/uipc_mbuf.c:1535 > sosend_generic() at 0xffffffff8051ce56 =3D sosend_generic+0x356/frame 0= xfffffe20216998a0 > soo_write() at 0xffffffff804fd872 =3D soo_write+0x42/frame 0xfffffe2021= 6998d0 > dofilewrite() at 0xffffffff804f5c97 =3D dofilewrite+0x87/frame 0xfffffe= 2021699920 > kern_writev() at 0xffffffff804f5978 =3D kern_writev+0x68/frame 0xfffffe= 2021699970 > sys_writev() at 0xffffffff804f5be6 =3D sys_writev+0x36/frame 0xfffffe20= 216999a0 > amd64_syscall() at 0xffffffff806e4051 =3D amd64_syscall+0x2c1/frame 0xf= ffffe2021699ab0 > Xfast_syscall() at 0xffffffff806cb2bb =3D Xfast_syscall+0xfb/frame 0xff= fffe2021699ab0 > --- syscall (121, FreeBSD ELF64, sys_writev), rip =3D 0x8019cc6ba, rsp = =3D 0x7fffffffd688, rbp =3D 0x7fffffffd6c0 --- This call stack is quite interesting: 1: A process is calling writev() 2: Kernel calls sosend_generic() that starts allocating memory 3: This allocation is then interrupted by the timer interrupt handler [that could actually trigger tcp_tw_2msl_scan(reuse=3D0)] 4: The timer interrupt handler seems to wait on sched_switch() And fun fact: When sosend_generic() calls m_uiotombuf() it does not hold INP_WLOCK yet... -- Julien --LOEuSf0BIKjR7HVfK3G906jP9JLEU0oiC-- --TtwNGnolL41w7XTiubjCTpMUQ4E3GcXkQ Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Comment: GPGTools - https://gpgtools.org iQEcBAEBCgAGBQJX6OF/AAoJEKVlQ5Je6dhxPtgIAOkdrwJWREyUwwRewrVTz/GR eohxzAM6NBuw7shSEvt/8qgaST8ClzxzuGQgrAf2+hevxNwa4Fo2cGqyfRdBp57v EeDKk2N/zJsiSCRr7HMbj/FWX+Nioa8IvxiABZx9SIhdTPtYD9lIUgSSrNEGWUrQ qw5VA+/i0ufm0pqLr8onZFSauZPRduHb2hpZFnCaxrNK3CeG2V6xYziP8f4anjyy F5P+exeiQ+iPZZ50+E49u2xOnnla3LwnoEO9MTZ/lSjLrVw+cc0L8HHfsFb+euA5 TgamvpQ9eSkx3yQsn1sc2ejc8+mvLQPz0eMLv1CrgCcbz0pHc11WAnVyrhJzc28= =i032 -----END PGP SIGNATURE----- --TtwNGnolL41w7XTiubjCTpMUQ4E3GcXkQ--