Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 21 Jul 2016 09:54:20 +0200
From:      Julien Charbon <jch@freebsd.org>
To:        Larry Rosenman <ler@lerctr.org>
Cc:        Gleb Smirnoff <glebius@freebsd.org>, rrs@freebsd.org, hselasky@freebsd.org, net@freebsd.org, current@freebsd.org, owner-freebsd-current@freebsd.org
Subject:   Re: panic with tcp timers
Message-ID:  <548bf673-580d-350a-9f91-88553f3c82f1@freebsd.org>
In-Reply-To: <eb862d55795687387e22f0dd83e9f3d2@thebighonker.lerctr.org>
References:  <20160617045319.GE1076@FreeBSD.org> <1f28844b-b4ea-b544-3892-811f2be327b9@freebsd.org> <20160620073917.GI1076@FreeBSD.org> <1d18d0e2-3e42-cb26-928c-2989d0751884@freebsd.org> <dbb33989-538a-69e8-7243-26c554da266c@freebsd.org> <eb862d55795687387e22f0dd83e9f3d2@thebighonker.lerctr.org>

next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--RltEhHq6WEsTOuEtcjO8preCMk3cvFPf5
Content-Type: multipart/mixed; boundary="9FJ5nB7dX6eQcKe0ImWav0rWKa4Vtkule"
From: Julien Charbon <jch@freebsd.org>
To: Larry Rosenman <ler@lerctr.org>
Cc: Gleb Smirnoff <glebius@freebsd.org>, rrs@freebsd.org,
 hselasky@freebsd.org, net@freebsd.org, current@freebsd.org,
 owner-freebsd-current@freebsd.org
Message-ID: <548bf673-580d-350a-9f91-88553f3c82f1@freebsd.org>
Subject: Re: panic with tcp timers
References: <20160617045319.GE1076@FreeBSD.org>
 <1f28844b-b4ea-b544-3892-811f2be327b9@freebsd.org>
 <20160620073917.GI1076@FreeBSD.org>
 <1d18d0e2-3e42-cb26-928c-2989d0751884@freebsd.org>
 <dbb33989-538a-69e8-7243-26c554da266c@freebsd.org>
 <eb862d55795687387e22f0dd83e9f3d2@thebighonker.lerctr.org>
In-Reply-To: <eb862d55795687387e22f0dd83e9f3d2@thebighonker.lerctr.org>

--9FJ5nB7dX6eQcKe0ImWav0rWKa4Vtkule
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable


 Hi,

On 7/14/16 11:02 PM, Larry Rosenman wrote:
> On 2016-07-14 12:01, Julien Charbon wrote:
>> On 6/20/16 11:55 AM, Julien Charbon wrote:
>>> On 6/20/16 9:39 AM, Gleb Smirnoff wrote:
>>>> On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote:
>>>> J> > Comparing stable/10 and head, I see two changes that could
>>>> J> > affect that:
>>>> J> >
>>>> J> > - callout_async_drain
>>>> J> > - switch to READ lock for inp info in tcp timers
>>>> J> >
>>>> J> > That's why you are in To, Julien and Hans :)
>>>> J> >
>>>> J> > We continue investigating, and I will keep you updated.
>>>> J> > However, any help is welcome. I can share cores.
>>>>
>>>> Now, spending some time with cores and adding a bunch of
>>>> extra CTRs, I have a sequence of events that lead to the
>>>> panic. In short, the bug is in the callout system. It seems
>>>> to be not relevant to the callout_async_drain, at least for
>>>> now. The transition to READ lock unmasked the problem, that's
>>>> why NetflixBSD 10 doesn't panic.
>>>>
>>>> The panic requires heavy contention on the TCP info lock.
>>>>
>>>> [CPU 1] the callout fires, tcp_timer_keep entered
>>>> [CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
>>>> [CPU 2] schedules the callout
>>>> [CPU 2] tcp_discardcb called
>>>> [CPU 2] callout successfully canceled
>>>> [CPU 2] tcpcb freed
>>>> [CPU 1] unblocks... panic
>>>>
>>>> When the lock was WLOCK, all contenders were resumed in a
>>>> sequence they came to the lock. Now, that they are readers,
>>>> once the lock is released, readers are resumed in a "random"
>>>> order, and this allows tcp_discardcb to go before the old
>>>> running callout, and this unmasks the panic.
>>>
>>>  Highly interesting.  I should be able to reproduce that (will be use=
ful
>>> for testing the corresponding fix).
>>
>>  Finally, I was able to reproduce it (without glebius fix).   The tric=
k
>> was to really lower TCP keep timer expiration:
>>
>> $ sysctl -a | grep tcp.keep
>> net.inet.tcp.keepidle: 7200000
>> net.inet.tcp.keepintvl: 75000
>> net.inet.tcp.keepinit: 75000
>> net.inet.tcp.keepcnt: 8
>> $ sudo bash -c "sysctl net.inet.tcp.keepidle=3D10 && sysctl
>> net.inet.tcp.keepintvl=3D50 && sysctl net.inet.tcp.keepinit=3D10"
>> Password:
>> net.inet.tcp.keepidle: 7200000 -> 10
>> net.inet.tcp.keepintvl: 75000 -> 50
>> net.inet.tcp.keepinit: 75000 -> 10
>>
>>  Note: It will certainly close all your ssh connections to the tested
>> server.
>>
>>  Now I will test in order:
>>
>> #1. glebius fix
>> https://svnweb.freebsd.org/base?view=3Drevision&revision=3D302350
>>
>> #2. rss extra fix
>> https://reviews.freebsd.org/D7135
>>
>> #3. rrs TCP Timer cleanup
>> https://reviews.freebsd.org/D7136
>=20
> please see also https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D210=
884

 My tests result so far:

#1. r302350:  First glebius TCP timer fix:  No more TCP timer kernel
panic during 48h under 200k TCP query per second load.

 Sadly I was unable to reproduce the issue described here:

panic: bogus refcnt 0 on lle 0xfffff80004608c00
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D210884

#2. r303098:  Got all kernel callout changes since r302350, (updates on
callout code are indeed always full of surprises):
https://svnweb.freebsd.org/base/head/sys/kern/kern_timeout.c?view=3Dlog&p=
athrev=3D303098

 No kernel panic either.

 Still to test:

#3. rss extra fix (if still relevant now)
https://reviews.freebsd.org/D7135

#4. rrs TCP Timer cleanup:
https://reviews.freebsd.org/D7136

 My 2 cents.

--
Julien


--9FJ5nB7dX6eQcKe0ImWav0rWKa4Vtkule--

--RltEhHq6WEsTOuEtcjO8preCMk3cvFPf5
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - https://gpgtools.org

iQEcBAEBCgAGBQJXkH+yAAoJEKVlQ5Je6dhxGe0H+gJyAT5R0hpGgjBBTICN3h+q
aGvIgBPC3HgVDJhU1ZKhU0xjNZirq2icxgh/0UV+iuZvOUZCTteT4IsVl8WoZDUQ
0VODwVSj748EJdftA5GqDR464nY+6McIj1FrWtmbVgqtYkKP2oAuOQzy0w2lRYeK
c3m8gb9JP0bN8M9zFRee2IzaIikzQJtaapMX77XzBR5umxuzAnp4tbSuAmJdE3Ln
+ddBH/4DcTLQEKSBboqQwM/VLYzoWl33e5IQhrYyUzJe1dfXLZHBS6sm2eHdug+0
NIOEuBcYRJZqp4TwYyjIGauIALAfqo6zDQCSUZvhkgqNmkriogBVtjz92pxmQPg=
=5jrc
-----END PGP SIGNATURE-----

--RltEhHq6WEsTOuEtcjO8preCMk3cvFPf5--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?548bf673-580d-350a-9f91-88553f3c82f1>