Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 21 Jul 2016 10:05:48 +0200
From:      Hans Petter Selasky <hps@selasky.org>
To:        Julien Charbon <jch@freebsd.org>, Larry Rosenman <ler@lerctr.org>
Cc:        Gleb Smirnoff <glebius@freebsd.org>, rrs@freebsd.org, net@freebsd.org, current@freebsd.org, owner-freebsd-current@freebsd.org
Subject:   Re: panic with tcp timers
Message-ID:  <3c38de84-1e69-4321-f8b8-3c259d689b6d@selasky.org>
In-Reply-To: <548bf673-580d-350a-9f91-88553f3c82f1@freebsd.org>
References:  <20160617045319.GE1076@FreeBSD.org> <1f28844b-b4ea-b544-3892-811f2be327b9@freebsd.org> <20160620073917.GI1076@FreeBSD.org> <1d18d0e2-3e42-cb26-928c-2989d0751884@freebsd.org> <dbb33989-538a-69e8-7243-26c554da266c@freebsd.org> <eb862d55795687387e22f0dd83e9f3d2@thebighonker.lerctr.org> <548bf673-580d-350a-9f91-88553f3c82f1@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 07/21/16 09:54, Julien Charbon wrote:
>
>  Hi,
>
> On 7/14/16 11:02 PM, Larry Rosenman wrote:
>> On 2016-07-14 12:01, Julien Charbon wrote:
>>> On 6/20/16 11:55 AM, Julien Charbon wrote:
>>>> On 6/20/16 9:39 AM, Gleb Smirnoff wrote:
>>>>> On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote:
>>>>> J> > Comparing stable/10 and head, I see two changes that could
>>>>> J> > affect that:
>>>>> J> >
>>>>> J> > - callout_async_drain
>>>>> J> > - switch to READ lock for inp info in tcp timers
>>>>> J> >
>>>>> J> > That's why you are in To, Julien and Hans :)
>>>>> J> >
>>>>> J> > We continue investigating, and I will keep you updated.
>>>>> J> > However, any help is welcome. I can share cores.
>>>>>
>>>>> Now, spending some time with cores and adding a bunch of
>>>>> extra CTRs, I have a sequence of events that lead to the
>>>>> panic. In short, the bug is in the callout system. It seems
>>>>> to be not relevant to the callout_async_drain, at least for
>>>>> now. The transition to READ lock unmasked the problem, that's
>>>>> why NetflixBSD 10 doesn't panic.
>>>>>
>>>>> The panic requires heavy contention on the TCP info lock.
>>>>>
>>>>> [CPU 1] the callout fires, tcp_timer_keep entered
>>>>> [CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
>>>>> [CPU 2] schedules the callout
>>>>> [CPU 2] tcp_discardcb called
>>>>> [CPU 2] callout successfully canceled
>>>>> [CPU 2] tcpcb freed
>>>>> [CPU 1] unblocks... panic
>>>>>
>>>>> When the lock was WLOCK, all contenders were resumed in a
>>>>> sequence they came to the lock. Now, that they are readers,
>>>>> once the lock is released, readers are resumed in a "random"
>>>>> order, and this allows tcp_discardcb to go before the old
>>>>> running callout, and this unmasks the panic.
>>>>
>>>>  Highly interesting.  I should be able to reproduce that (will be useful
>>>> for testing the corresponding fix).
>>>
>>>  Finally, I was able to reproduce it (without glebius fix).   The trick
>>> was to really lower TCP keep timer expiration:
>>>
>>> $ sysctl -a | grep tcp.keep
>>> net.inet.tcp.keepidle: 7200000
>>> net.inet.tcp.keepintvl: 75000
>>> net.inet.tcp.keepinit: 75000
>>> net.inet.tcp.keepcnt: 8
>>> $ sudo bash -c "sysctl net.inet.tcp.keepidle=10 && sysctl
>>> net.inet.tcp.keepintvl=50 && sysctl net.inet.tcp.keepinit=10"
>>> Password:
>>> net.inet.tcp.keepidle: 7200000 -> 10
>>> net.inet.tcp.keepintvl: 75000 -> 50
>>> net.inet.tcp.keepinit: 75000 -> 10
>>>
>>>  Note: It will certainly close all your ssh connections to the tested
>>> server.
>>>
>>>  Now I will test in order:
>>>
>>> #1. glebius fix
>>> https://svnweb.freebsd.org/base?view=revision&revision=302350
>>>
>>> #2. rss extra fix
>>> https://reviews.freebsd.org/D7135
>>>
>>> #3. rrs TCP Timer cleanup
>>> https://reviews.freebsd.org/D7136
>>
>> please see also https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884
>
>  My tests result so far:
>
> #1. r302350:  First glebius TCP timer fix:  No more TCP timer kernel
> panic during 48h under 200k TCP query per second load.
>
>  Sadly I was unable to reproduce the issue described here:
>
> panic: bogus refcnt 0 on lle 0xfffff80004608c00
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884
>
> #2. r303098:  Got all kernel callout changes since r302350, (updates on
> callout code are indeed always full of surprises):
> https://svnweb.freebsd.org/base/head/sys/kern/kern_timeout.c?view=log&pathrev=303098
>
>  No kernel panic either.
>
>  Still to test:
>
> #3. rss extra fix (if still relevant now)
> https://reviews.freebsd.org/D7135
>
> #4. rrs TCP Timer cleanup:
> https://reviews.freebsd.org/D7136
>
>  My 2 cents.
>

Hi,

You should also check for memory leaks using "vmstat -m" .

--HPS




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3c38de84-1e69-4321-f8b8-3c259d689b6d>