From owner-freebsd-net@freebsd.org Mon Jun 20 09:56:04 2016 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 357C5A79808 for ; Mon, 20 Jun 2016 09:56:04 +0000 (UTC) (envelope-from julien.charbon@gmail.com) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 12BFE2106 for ; Mon, 20 Jun 2016 09:56:04 +0000 (UTC) (envelope-from julien.charbon@gmail.com) Received: by mailman.ysv.freebsd.org (Postfix) id 0EBB7A79805; Mon, 20 Jun 2016 09:56:04 +0000 (UTC) Delivered-To: net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0E116A79801; Mon, 20 Jun 2016 09:56:04 +0000 (UTC) (envelope-from julien.charbon@gmail.com) Received: from mail-wm0-f46.google.com (mail-wm0-f46.google.com [74.125.82.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id BB2CB2103; Mon, 20 Jun 2016 09:56:03 +0000 (UTC) (envelope-from julien.charbon@gmail.com) Received: by mail-wm0-f46.google.com with SMTP id v199so61892368wmv.0; Mon, 20 Jun 2016 02:56:03 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to; bh=3R5oeRHX8RbACHoEqf+3o/gYI4UZCkA1gYS8cFHCR0U=; b=LXRMPc35nnTc0GuPTcUFQMNJ2krtP9JxPg10Cjiwamq3K2zJw+1MkAljA1by0CCe4q 3BLTqVgWgzWnnymvzSwquz5M4IU+ndEz6VIj5Uf8618tBH6HNaVYGydJmuv5rTA00JNX /NSmO3K+l7PNNuj1v83hmOL+9svhbP+m+nXZl0a+WwJTpSiheOTfIXHP+1pyDSCw5kvh w30MuD6KuQzHySvdNcb3Thx9xYNESCKyEc+C3pbFMKnTpqwZEkkwa/FV2joFTW/NgchC mfFJO4Av6X6Yme4AWVsfVzSoQBfqpUVMOhGxp4woj6MnbKRDzuNlssKCO6IVqiKSwPyT yCwg== X-Gm-Message-State: ALyK8tLIprB7amIg+uwvZQvTA651fVSjcnlnHTURSPg+nn+7UWCABlaG2gwm5EPuou6RvQ== X-Received: by 10.194.81.72 with SMTP id y8mr14670954wjx.83.1466416561397; Mon, 20 Jun 2016 02:56:01 -0700 (PDT) Received: from [10.100.64.16] ([217.30.88.7]) by smtp.gmail.com with ESMTPSA id e5sm62749343wjj.10.2016.06.20.02.56.00 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 20 Jun 2016 02:56:00 -0700 (PDT) Subject: Re: panic with tcp timers To: Gleb Smirnoff , rrs@FreeBSD.org References: <20160617045319.GE1076@FreeBSD.org> <1f28844b-b4ea-b544-3892-811f2be327b9@freebsd.org> <20160620073917.GI1076@FreeBSD.org> Cc: hselasky@FreeBSD.org, net@FreeBSD.org, current@FreeBSD.org From: Julien Charbon Message-ID: <1d18d0e2-3e42-cb26-928c-2989d0751884@freebsd.org> Date: Mon, 20 Jun 2016 11:55:55 +0200 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: <20160620073917.GI1076@FreeBSD.org> Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="WRcuaKjPkOnTUAdRFNif3MWmTfBRkeXRe" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 20 Jun 2016 09:56:04 -0000 This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --WRcuaKjPkOnTUAdRFNif3MWmTfBRkeXRe Content-Type: multipart/mixed; boundary="oQsT0aQmt8QcIiFDCxI7qFMBav9RqDwaE" From: Julien Charbon To: Gleb Smirnoff , rrs@FreeBSD.org Cc: hselasky@FreeBSD.org, net@FreeBSD.org, current@FreeBSD.org Message-ID: <1d18d0e2-3e42-cb26-928c-2989d0751884@freebsd.org> Subject: Re: panic with tcp timers References: <20160617045319.GE1076@FreeBSD.org> <1f28844b-b4ea-b544-3892-811f2be327b9@freebsd.org> <20160620073917.GI1076@FreeBSD.org> In-Reply-To: <20160620073917.GI1076@FreeBSD.org> --oQsT0aQmt8QcIiFDCxI7qFMBav9RqDwaE Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Hi, On 6/20/16 9:39 AM, Gleb Smirnoff wrote: > On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: > J> > Comparing stable/10 and head, I see two changes that could > J> > affect that: > J> >=20 > J> > - callout_async_drain > J> > - switch to READ lock for inp info in tcp timers > J> >=20 > J> > That's why you are in To, Julien and Hans :) > J> >=20 > J> > We continue investigating, and I will keep you updated. > J> > However, any help is welcome. I can share cores. >=20 > Now, spending some time with cores and adding a bunch of > extra CTRs, I have a sequence of events that lead to the > panic. In short, the bug is in the callout system. It seems > to be not relevant to the callout_async_drain, at least for > now. The transition to READ lock unmasked the problem, that's > why NetflixBSD 10 doesn't panic. >=20 > The panic requires heavy contention on the TCP info lock. >=20 > [CPU 1] the callout fires, tcp_timer_keep entered > [CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo); > [CPU 2] schedules the callout > [CPU 2] tcp_discardcb called > [CPU 2] callout successfully canceled > [CPU 2] tcpcb freed > [CPU 1] unblocks... panic >=20 > When the lock was WLOCK, all contenders were resumed in a > sequence they came to the lock. Now, that they are readers, > once the lock is released, readers are resumed in a "random" > order, and this allows tcp_discardcb to go before the old > running callout, and this unmasks the panic. Highly interesting. I should be able to reproduce that (will be useful for testing the corresponding fix). Fix proposal: If callout_async_drain() returns 0 (fail) (instead of 1 (success) here) when the callout cancellation is a success _but_ the callout is current running, that should fix it. For the history: It comes back to my old callout question: Does _callout_stop_safe() is allowed to return 1 (success) even if the callout is still currently running; a.k.a. it is not because you successfully cancelled a callout that the callout is not currently runnin= g. We did propose a patch to make _callout_stop_safe() returns 0 (fail) when the callout is currently running: callout_stop() should return 0 when the callout is currently being serviced and indeed unstoppable https://reviews.freebsd.org/differential/changeset/?ref=3D62513&whitespac= e=3Dignore-most But this change impacted too many old code paths and was interesting only for TCP timers and thus was abandoned. My 2 cents. -- Julien --oQsT0aQmt8QcIiFDCxI7qFMBav9RqDwaE-- --WRcuaKjPkOnTUAdRFNif3MWmTfBRkeXRe Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Comment: GPGTools - https://gpgtools.org iQEcBAEBCgAGBQJXZ72vAAoJEKVlQ5Je6dhxZ3sH/2eFfPP334XgUPWLnMPJ1CeQ gGAz8hshDh9Rmrt7tR+XoG0q8fRanTLP75cOODIYiU51bFYys+0NymLTrsDtjUbF fqRp4cjRznhMEoTiUoCLCIfeIJaer3X5FQDyf1md2Mn+CbtiWswXGr0kH1mnCBwq FBLPwCLF2MEZrXdZImhWCCF+i9KJYXL7gOsu/gCg/5x+JnOK5/Rq4SY6SXvqkBYB p9NKU4E4brZYXatLG4EGaHM4nG16gtw6ZrXmJKfiYMm2en9otRwhbfHfC7xpJG2n ONIMU32WJ095xcOFs+ywUkJ8DFWa0+01AoTy/+OHmIqacrJYMb2hy7mh7O7ylSs= =W+6o -----END PGP SIGNATURE----- --WRcuaKjPkOnTUAdRFNif3MWmTfBRkeXRe--