Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 20 Jun 2016 00:39:17 -0700
From:      Gleb Smirnoff <glebius@FreeBSD.org>
To:        rrs@FreeBSD.org
Cc:        Julien Charbon <jch@freebsd.org>, hselasky@FreeBSD.org, rrs@FreeBSD.org, net@FreeBSD.org, current@FreeBSD.org
Subject:   Re: panic with tcp timers
Message-ID:  <20160620073917.GI1076@FreeBSD.org>
In-Reply-To: <1f28844b-b4ea-b544-3892-811f2be327b9@freebsd.org>
References:  <20160617045319.GE1076@FreeBSD.org> <1f28844b-b4ea-b544-3892-811f2be327b9@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
  Hi!

On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote:
J> > Comparing stable/10 and head, I see two changes that could
J> > affect that:
J> > 
J> > - callout_async_drain
J> > - switch to READ lock for inp info in tcp timers
J> > 
J> > That's why you are in To, Julien and Hans :)
J> > 
J> > We continue investigating, and I will keep you updated.
J> > However, any help is welcome. I can share cores.

Now, spending some time with cores and adding a bunch of
extra CTRs, I have a sequence of events that lead to the
panic. In short, the bug is in the callout system. It seems
to be not relevant to the callout_async_drain, at least for
now. The transition to READ lock unmasked the problem, that's
why NetflixBSD 10 doesn't panic.

The panic requires heavy contention on the TCP info lock.

[CPU 1] the callout fires, tcp_timer_keep entered
[CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
[CPU 2] schedules the callout
[CPU 2] tcp_discardcb called
[CPU 2] callout successfully canceled
[CPU 2] tcpcb freed
[CPU 1] unblocks... panic

When the lock was WLOCK, all contenders were resumed in a
sequence they came to the lock. Now, that they are readers,
once the lock is released, readers are resumed in a "random"
order, and this allows tcp_discardcb to go before the old
running callout, and this unmasks the panic.

Exact quote from ktrdump:

0xfffff82089867418 <-- faulty tcpcb
0xfffff82089867728 <-- its tt_keep

glebius@piston:~/cores:|>grep 0xfffff82089867418 ktr
19012192  11 scheduled 0xfffff820898677a8 func 0xffffffff80628740 arg 0xfffff82089867418 in 20553.1e3ff819
19012190  11 rescheduled 0xfffff82089867728 func 0xffffffff80628bb0 arg 0xfffff82089867418 in 20613.04a6193d
19009551  11 rescheduled 0xfffff82089867728 func 0xffffffff80628bb0 arg 0xfffff82089867418 in 20613.042306cf
19009549  11 scheduled 0xfffff82089867728 func 0xffffffff80628bb0 arg 0xfffff82089867418 in 20563.0423409f
19009545  11 tcp_newtcpcb: 0xfffff82089867418
18962842  11 tcp_discardcb: free 0xfffff82089867418
18962830  11 tcp_discardcb: 0xfffff82089867418 draincnt 0
18962826  11 failed to stop 0xfffff820898677a8 func 0xffffffff80628740 arg 0xfffff82089867418
18962822  11 cancelled3 0xfffff82089867768 func 0xffffffff806288e0 arg 0xfffff82089867418
18962808  11 cancelled3 0xfffff82089867728 func 0xffffffff80628bb0 arg 0xfffff82089867418
18962804  11 failed to stop 0xfffff820898676e8 func 0xffffffff80628fa0 arg 0xfffff82089867418
18962792  11 failed to stop 0xfffff820898676a8 func 0xffffffff806291e0 arg 0xfffff82089867418
18962755  11 tcp_discardcb: 0xfffff82089867418
18962742  11 scheduled 0xfffff82089867768 func 0xffffffff806288e0 arg 0xfffff82089867418 in 20632.ffc8d308
18962703  11 cancelled3 0xfffff820898676a8 func 0xffffffff806291e0 arg 0xfffff82089867418
18962695  11 scheduled 0xfffff82089867728 func 0xffffffff80628bb0 arg 0xfffff82089867418 in 20612.ffc8ea28
18923275   6 tcp_timer_keep: 0xfffff82089867418
18923274   6 callout 0xfffff82089867728 func 0xffffffff80628bb0 arg 0xfffff82089867418
17850361   9 scheduled 0xfffff820898676a8 func 0xffffffff806291e0 arg 0xfffff82089867418 in 20553.5aec0004


-- 
Totus tuus, Glebius.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160620073917.GI1076>