From owner-freebsd-net@freebsd.org Thu Jul 21 08:01:51 2016 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 28704BA0B0C for ; Thu, 21 Jul 2016 08:01:51 +0000 (UTC) (envelope-from hps@selasky.org) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 1262F15BE for ; Thu, 21 Jul 2016 08:01:51 +0000 (UTC) (envelope-from hps@selasky.org) Received: by mailman.ysv.freebsd.org (Postfix) id 0E423BA0B09; Thu, 21 Jul 2016 08:01:51 +0000 (UTC) Delivered-To: net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0D94BBA0B07; Thu, 21 Jul 2016 08:01:51 +0000 (UTC) (envelope-from hps@selasky.org) Received: from mail.turbocat.net (mail.turbocat.net [IPv6:2a01:4f8:d16:4514::2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 9C9A215BB; Thu, 21 Jul 2016 08:01:50 +0000 (UTC) (envelope-from hps@selasky.org) Received: from laptop015.home.selasky.org (unknown [62.141.129.119]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.turbocat.net (Postfix) with ESMTPSA id B844E1FE024; Thu, 21 Jul 2016 10:01:46 +0200 (CEST) Subject: Re: panic with tcp timers To: Julien Charbon , Larry Rosenman References: <20160617045319.GE1076@FreeBSD.org> <1f28844b-b4ea-b544-3892-811f2be327b9@freebsd.org> <20160620073917.GI1076@FreeBSD.org> <1d18d0e2-3e42-cb26-928c-2989d0751884@freebsd.org> <548bf673-580d-350a-9f91-88553f3c82f1@freebsd.org> Cc: Gleb Smirnoff , rrs@freebsd.org, net@freebsd.org, current@freebsd.org, owner-freebsd-current@freebsd.org From: Hans Petter Selasky Message-ID: <3c38de84-1e69-4321-f8b8-3c259d689b6d@selasky.org> Date: Thu, 21 Jul 2016 10:05:48 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:45.0) Gecko/20100101 Thunderbird/45.0 MIME-Version: 1.0 In-Reply-To: <548bf673-580d-350a-9f91-88553f3c82f1@freebsd.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Jul 2016 08:01:51 -0000 On 07/21/16 09:54, Julien Charbon wrote: > > Hi, > > On 7/14/16 11:02 PM, Larry Rosenman wrote: >> On 2016-07-14 12:01, Julien Charbon wrote: >>> On 6/20/16 11:55 AM, Julien Charbon wrote: >>>> On 6/20/16 9:39 AM, Gleb Smirnoff wrote: >>>>> On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: >>>>> J> > Comparing stable/10 and head, I see two changes that could >>>>> J> > affect that: >>>>> J> > >>>>> J> > - callout_async_drain >>>>> J> > - switch to READ lock for inp info in tcp timers >>>>> J> > >>>>> J> > That's why you are in To, Julien and Hans :) >>>>> J> > >>>>> J> > We continue investigating, and I will keep you updated. >>>>> J> > However, any help is welcome. I can share cores. >>>>> >>>>> Now, spending some time with cores and adding a bunch of >>>>> extra CTRs, I have a sequence of events that lead to the >>>>> panic. In short, the bug is in the callout system. It seems >>>>> to be not relevant to the callout_async_drain, at least for >>>>> now. The transition to READ lock unmasked the problem, that's >>>>> why NetflixBSD 10 doesn't panic. >>>>> >>>>> The panic requires heavy contention on the TCP info lock. >>>>> >>>>> [CPU 1] the callout fires, tcp_timer_keep entered >>>>> [CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo); >>>>> [CPU 2] schedules the callout >>>>> [CPU 2] tcp_discardcb called >>>>> [CPU 2] callout successfully canceled >>>>> [CPU 2] tcpcb freed >>>>> [CPU 1] unblocks... panic >>>>> >>>>> When the lock was WLOCK, all contenders were resumed in a >>>>> sequence they came to the lock. Now, that they are readers, >>>>> once the lock is released, readers are resumed in a "random" >>>>> order, and this allows tcp_discardcb to go before the old >>>>> running callout, and this unmasks the panic. >>>> >>>> Highly interesting. I should be able to reproduce that (will be useful >>>> for testing the corresponding fix). >>> >>> Finally, I was able to reproduce it (without glebius fix). The trick >>> was to really lower TCP keep timer expiration: >>> >>> $ sysctl -a | grep tcp.keep >>> net.inet.tcp.keepidle: 7200000 >>> net.inet.tcp.keepintvl: 75000 >>> net.inet.tcp.keepinit: 75000 >>> net.inet.tcp.keepcnt: 8 >>> $ sudo bash -c "sysctl net.inet.tcp.keepidle=10 && sysctl >>> net.inet.tcp.keepintvl=50 && sysctl net.inet.tcp.keepinit=10" >>> Password: >>> net.inet.tcp.keepidle: 7200000 -> 10 >>> net.inet.tcp.keepintvl: 75000 -> 50 >>> net.inet.tcp.keepinit: 75000 -> 10 >>> >>> Note: It will certainly close all your ssh connections to the tested >>> server. >>> >>> Now I will test in order: >>> >>> #1. glebius fix >>> https://svnweb.freebsd.org/base?view=revision&revision=302350 >>> >>> #2. rss extra fix >>> https://reviews.freebsd.org/D7135 >>> >>> #3. rrs TCP Timer cleanup >>> https://reviews.freebsd.org/D7136 >> >> please see also https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884 > > My tests result so far: > > #1. r302350: First glebius TCP timer fix: No more TCP timer kernel > panic during 48h under 200k TCP query per second load. > > Sadly I was unable to reproduce the issue described here: > > panic: bogus refcnt 0 on lle 0xfffff80004608c00 > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884 > > #2. r303098: Got all kernel callout changes since r302350, (updates on > callout code are indeed always full of surprises): > https://svnweb.freebsd.org/base/head/sys/kern/kern_timeout.c?view=log&pathrev=303098 > > No kernel panic either. > > Still to test: > > #3. rss extra fix (if still relevant now) > https://reviews.freebsd.org/D7135 > > #4. rrs TCP Timer cleanup: > https://reviews.freebsd.org/D7136 > > My 2 cents. > Hi, You should also check for memory leaks using "vmstat -m" . --HPS