From owner-freebsd-current@freebsd.org Mon Jun 20 10:48:25 2016 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 27EC1A7B8E6 for ; Mon, 20 Jun 2016 10:48:25 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 0FE8821E3 for ; Mon, 20 Jun 2016 10:48:25 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: by mailman.ysv.freebsd.org (Postfix) id 08097A7B8E4; Mon, 20 Jun 2016 10:48:25 +0000 (UTC) Delivered-To: current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 07679A7B8E2; Mon, 20 Jun 2016 10:48:25 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id A479921DF; Mon, 20 Jun 2016 10:48:24 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u5KAmJon046864 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Mon, 20 Jun 2016 13:48:19 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u5KAmJon046864 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id u5KAmJjv046863; Mon, 20 Jun 2016 13:48:19 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 20 Jun 2016 13:48:19 +0300 From: Konstantin Belousov To: Julien Charbon Cc: Gleb Smirnoff , rrs@FreeBSD.org, current@FreeBSD.org, hselasky@FreeBSD.org, net@FreeBSD.org Subject: Re: panic with tcp timers Message-ID: <20160620104819.GV38613@kib.kiev.ua> References: <20160617045319.GE1076@FreeBSD.org> <1f28844b-b4ea-b544-3892-811f2be327b9@freebsd.org> <20160620073917.GI1076@FreeBSD.org> <1d18d0e2-3e42-cb26-928c-2989d0751884@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1d18d0e2-3e42-cb26-928c-2989d0751884@freebsd.org> User-Agent: Mutt/1.6.1 (2016-04-27) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 20 Jun 2016 10:48:25 -0000 On Mon, Jun 20, 2016 at 11:55:55AM +0200, Julien Charbon wrote: > > Hi, > > On 6/20/16 9:39 AM, Gleb Smirnoff wrote: > > On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: > > J> > Comparing stable/10 and head, I see two changes that could > > J> > affect that: > > J> > > > J> > - callout_async_drain > > J> > - switch to READ lock for inp info in tcp timers > > J> > > > J> > That's why you are in To, Julien and Hans :) > > J> > > > J> > We continue investigating, and I will keep you updated. > > J> > However, any help is welcome. I can share cores. > > > > Now, spending some time with cores and adding a bunch of > > extra CTRs, I have a sequence of events that lead to the > > panic. In short, the bug is in the callout system. It seems > > to be not relevant to the callout_async_drain, at least for > > now. The transition to READ lock unmasked the problem, that's > > why NetflixBSD 10 doesn't panic. > > > > The panic requires heavy contention on the TCP info lock. > > > > [CPU 1] the callout fires, tcp_timer_keep entered > > [CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo); > > [CPU 2] schedules the callout > > [CPU 2] tcp_discardcb called > > [CPU 2] callout successfully canceled > > [CPU 2] tcpcb freed > > [CPU 1] unblocks... panic > > > > When the lock was WLOCK, all contenders were resumed in a > > sequence they came to the lock. Now, that they are readers, > > once the lock is released, readers are resumed in a "random" > > order, and this allows tcp_discardcb to go before the old > > running callout, and this unmasks the panic. > > Highly interesting. I should be able to reproduce that (will be useful > for testing the corresponding fix). > > Fix proposal: If callout_async_drain() returns 0 (fail) (instead of 1 > (success) here) when the callout cancellation is a success _but_ the > callout is current running, that should fix it. > > For the history: It comes back to my old callout question: > > Does _callout_stop_safe() is allowed to return 1 (success) even if the > callout is still currently running; a.k.a. it is not because you > successfully cancelled a callout that the callout is not currently running. > > We did propose a patch to make _callout_stop_safe() returns 0 (fail) > when the callout is currently running: > > callout_stop() should return 0 when the callout is currently being > serviced and indeed unstoppable > https://reviews.freebsd.org/differential/changeset/?ref=62513&whitespace=ignore-most > > But this change impacted too many old code paths and was interesting > only for TCP timers and thus was abandoned. Look at callout_stop CS_MIGRBLOCK flag and the fix in sleepq_check_timeout(). Or, at least, do not allow this use of callout_stop() to rot again, after previous dozen regressions and fixes there.