From owner-freebsd-net@freebsd.org Mon Jun 20 07:39:24 2016 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1BA30A7A93E for ; Mon, 20 Jun 2016 07:39:24 +0000 (UTC) (envelope-from glebius@FreeBSD.org) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 07486174F for ; Mon, 20 Jun 2016 07:39:24 +0000 (UTC) (envelope-from glebius@FreeBSD.org) Received: by mailman.ysv.freebsd.org (Postfix) id 0056AA7A93C; Mon, 20 Jun 2016 07:39:24 +0000 (UTC) Delivered-To: net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id F3A80A7A93A; Mon, 20 Jun 2016 07:39:23 +0000 (UTC) (envelope-from glebius@FreeBSD.org) Received: from cell.glebi.us (glebi.us [96.95.210.25]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "cell.glebi.us", Issuer "cell.glebi.us" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id E0BC0174C; Mon, 20 Jun 2016 07:39:23 +0000 (UTC) (envelope-from glebius@FreeBSD.org) Received: from cell.glebi.us (localhost [127.0.0.1]) by cell.glebi.us (8.15.2/8.15.2) with ESMTPS id u5K7dHNO093644 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 20 Jun 2016 00:39:17 -0700 (PDT) (envelope-from glebius@FreeBSD.org) Received: (from glebius@localhost) by cell.glebi.us (8.15.2/8.15.2/Submit) id u5K7dHwj093643; Mon, 20 Jun 2016 00:39:17 -0700 (PDT) (envelope-from glebius@FreeBSD.org) X-Authentication-Warning: cell.glebi.us: glebius set sender to glebius@FreeBSD.org using -f Date: Mon, 20 Jun 2016 00:39:17 -0700 From: Gleb Smirnoff To: rrs@FreeBSD.org Cc: Julien Charbon , hselasky@FreeBSD.org, rrs@FreeBSD.org, net@FreeBSD.org, current@FreeBSD.org Subject: Re: panic with tcp timers Message-ID: <20160620073917.GI1076@FreeBSD.org> References: <20160617045319.GE1076@FreeBSD.org> <1f28844b-b4ea-b544-3892-811f2be327b9@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1f28844b-b4ea-b544-3892-811f2be327b9@freebsd.org> User-Agent: Mutt/1.6.1 (2016-04-27) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 20 Jun 2016 07:39:24 -0000 Hi! On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: J> > Comparing stable/10 and head, I see two changes that could J> > affect that: J> > J> > - callout_async_drain J> > - switch to READ lock for inp info in tcp timers J> > J> > That's why you are in To, Julien and Hans :) J> > J> > We continue investigating, and I will keep you updated. J> > However, any help is welcome. I can share cores. Now, spending some time with cores and adding a bunch of extra CTRs, I have a sequence of events that lead to the panic. In short, the bug is in the callout system. It seems to be not relevant to the callout_async_drain, at least for now. The transition to READ lock unmasked the problem, that's why NetflixBSD 10 doesn't panic. The panic requires heavy contention on the TCP info lock. [CPU 1] the callout fires, tcp_timer_keep entered [CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo); [CPU 2] schedules the callout [CPU 2] tcp_discardcb called [CPU 2] callout successfully canceled [CPU 2] tcpcb freed [CPU 1] unblocks... panic When the lock was WLOCK, all contenders were resumed in a sequence they came to the lock. Now, that they are readers, once the lock is released, readers are resumed in a "random" order, and this allows tcp_discardcb to go before the old running callout, and this unmasks the panic. Exact quote from ktrdump: 0xfffff82089867418 <-- faulty tcpcb 0xfffff82089867728 <-- its tt_keep glebius@piston:~/cores:|>grep 0xfffff82089867418 ktr 19012192 11 scheduled 0xfffff820898677a8 func 0xffffffff80628740 arg 0xfffff82089867418 in 20553.1e3ff819 19012190 11 rescheduled 0xfffff82089867728 func 0xffffffff80628bb0 arg 0xfffff82089867418 in 20613.04a6193d 19009551 11 rescheduled 0xfffff82089867728 func 0xffffffff80628bb0 arg 0xfffff82089867418 in 20613.042306cf 19009549 11 scheduled 0xfffff82089867728 func 0xffffffff80628bb0 arg 0xfffff82089867418 in 20563.0423409f 19009545 11 tcp_newtcpcb: 0xfffff82089867418 18962842 11 tcp_discardcb: free 0xfffff82089867418 18962830 11 tcp_discardcb: 0xfffff82089867418 draincnt 0 18962826 11 failed to stop 0xfffff820898677a8 func 0xffffffff80628740 arg 0xfffff82089867418 18962822 11 cancelled3 0xfffff82089867768 func 0xffffffff806288e0 arg 0xfffff82089867418 18962808 11 cancelled3 0xfffff82089867728 func 0xffffffff80628bb0 arg 0xfffff82089867418 18962804 11 failed to stop 0xfffff820898676e8 func 0xffffffff80628fa0 arg 0xfffff82089867418 18962792 11 failed to stop 0xfffff820898676a8 func 0xffffffff806291e0 arg 0xfffff82089867418 18962755 11 tcp_discardcb: 0xfffff82089867418 18962742 11 scheduled 0xfffff82089867768 func 0xffffffff806288e0 arg 0xfffff82089867418 in 20632.ffc8d308 18962703 11 cancelled3 0xfffff820898676a8 func 0xffffffff806291e0 arg 0xfffff82089867418 18962695 11 scheduled 0xfffff82089867728 func 0xffffffff80628bb0 arg 0xfffff82089867418 in 20612.ffc8ea28 18923275 6 tcp_timer_keep: 0xfffff82089867418 18923274 6 callout 0xfffff82089867728 func 0xffffffff80628bb0 arg 0xfffff82089867418 17850361 9 scheduled 0xfffff820898676a8 func 0xffffffff806291e0 arg 0xfffff82089867418 in 20553.5aec0004 -- Totus tuus, Glebius.