Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 22 Sep 2015 18:46:55 +0200
From:      Palle Girgensohn <girgen@FreeBSD.org>
To:        Julien Charbon <jch@FreeBSD.org>
Cc:        Konstantin Belousov <kostikbel@gmail.com>, freebsd-net@freebsd.org, Hans Petter Selasky <hps@selasky.org>
Subject:   Re: Kernel panics in tcp_twclose
Message-ID:  <73856F2B-3E70-483C-9988-C84E798CEB44@FreeBSD.org>
In-Reply-To: <3721F099-F45D-4DCD-8AB3-84D1ABC44145@FreeBSD.org>
References:  <26B0FF93-8AE3-4514-BDA1-B966230AAB65@FreeBSD.org> <55FC1809.3070903@freebsd.org> <20150918160605.GN67105@kib.kiev.ua> <55FFBE01.6060706@freebsd.org> <3721F099-F45D-4DCD-8AB3-84D1ABC44145@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi all,


> 21 sep 2015 kl. 15:53 skrev Palle Girgensohn <girgen@FreeBSD.org>:
>=20
>>=20
>> 21 sep 2015 kl. 10:21 skrev Julien Charbon <jch@FreeBSD.org>:
>>=20
>>=20
>> Hi Konstantin, Hi Palle,
>>=20
>> On 18/09/15 18:06, Konstantin Belousov wrote:
>>> On Fri, Sep 18, 2015 at 03:56:25PM +0200, Julien Charbon wrote:
>>>> Hi Palle,
>>>>=20
>>>> On 18/09/15 11:12, Palle Girgensohn wrote:
>>>>> We see daily panics on our production systems (web server, apache
>>>>> running MPM event, openjdk8. Kernel with VIMAGE. Jails using =
netgraph
>>>>> interfaces [not epair]).
>>>>>=20
>>>>> The problem started after the summer. Normal port upgrades seems =
to
>>>>> be the only difference. The problem occurs with 10.2-p2 kernel as
>>>>> well as 10.1-p4 and 10.1-p15.
>>>>>=20
>>>>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D203175
>>>>>=20
>>>>> Any ideas?
>>>>=20
>>>> Thanks for you detailed report.  I am not aware of any =
tcp_twclose()
>>>> related issues (without VIMAGE) since FreeBSD 10.0 (does not mean =
there
>>>> are none).  Few interesting facts (at least for me):
>>>>=20
>>>> - Your crash happens when unlocking a inp exclusive lock with =
INP_WUNLOCK()
>>>>=20
>>>> - Something is already wrong before calling turnstile_broadcast() =
as it
>>>> is called with ts =3D NULL:
>>> In the kernel without witness this is a 99%-sure indication of =
attempt to
>>> unlock not owned lock.
>>=20
>> Thanks, this is useful.  So far I did not find any path where
>> tcp_twclose() can call INP_WUNLOCK without having the exclusive lock
>> held, that makes this issue interesting.
>>=20
>>>> I won't go to far here as I am not expert enough in VIMAGE, but one
>>>> question anyway:
>>>>=20
>>>> - Can you correlate this kernel panic to a particular event?  Like =
for
>>>> example a VIMAGE/VNET jail destruction.
>>>>=20
>>>> I will test that on my side on a 10.2 machine.
>>=20
>> I did not find any issues while testing 10.2 + VIMAGE on my side. =
Thus
>> Palle what I would suggest:
>>=20
>> - First, test with stable/10 to see if by chance this issue has =
already
>> been fixed in stable branch.
>>=20
>> - Second, if issue is still in stable/10, compile 10.2 kernel with
>> these options:
>>=20
>> options        DDB
>> options        DEADLKRES
>> options        INVARIANTS
>> options        INVARIANT_SUPPORT
>> options        WITNESS
>> options        WITNESS_SKIPSPIN
>>=20
>> To see where the original fault is coming from.
>=20
> Hi,
>=20
> We just had two crashes within 15 minutes using 10.2 with these two =
added:
>=20
> https://svnweb.freebsd.org/changeset/base/287261
>=20
> https://svnweb.freebsd.org/changeset/base/287780=20
>=20
> We don't always get a core dump, but the second time, we did.
>=20
> very similar stack trace, but not identical:
>=20
> (kgdb) #0  doadump (textdump=3D<value optimized out>) at pcpu.h:219
> #1  0xffffffff80949a82 in kern_reboot (howto=3D260)
>    at /usr/src/sys/kern/kern_shutdown.c:451
> #2  0xffffffff80949e65 in vpanic (fmt=3D<value optimized out>,
>    ap=3D<value optimized out>) at =
/usr/src/sys/kern/kern_shutdown.c:758
> #3  0xffffffff80949cf3 in panic (fmt=3D0x0)
>    at /usr/src/sys/kern/kern_shutdown.c:687
> #4  0xffffffff80d5d0bb in trap_fatal (frame=3D<value optimized out>,
>    eva=3D<value optimized out>) at /usr/src/sys/amd64/amd64/trap.c:851
> #5  0xffffffff80d5d3bd in trap_pfault (frame=3D0xfffffe1760bc1840,
>    usermode=3D<value optimized out>) at =
/usr/src/sys/amd64/amd64/trap.c:674
> #6  0xffffffff80d5ca5a in trap (frame=3D0xfffffe1760bc1840)
>    at /usr/src/sys/amd64/amd64/trap.c:440
> #7  0xffffffff80d42dd2 in calltrap ()
>    at /usr/src/sys/amd64/amd64/exception.S:236
> #8  0xffffffff8099861c in turnstile_broadcast (ts=3D0x0, queue=3D1)
>    at /usr/src/sys/kern/subr_turnstile.c:838
> #9  0xffffffff80948100 in __rw_wunlock_hard (c=3D0xfffff811c43487a0, =
tid=3D1,
>    file=3D0x1 <Address 0x1 out of bounds>, line=3D1)
>    at /usr/src/sys/kern/kern_rwlock.c:988
> #10 0xffffffff80b067c4 in tcp_twclose (tw=3D<value optimized out>,
>    reuse=3D<value optimized out>) at =
/usr/src/sys/netinet/tcp_timewait.c:540
> #11 0xffffffff80b06e0b in tcp_tw_2msl_scan (reuse=3D0)
>    at /usr/src/sys/netinet/tcp_timewait.c:748
> #12 0xffffffff80b04b0e in tcp_slowtimo ()
>    at /usr/src/sys/netinet/tcp_timer.c:198
> #13 0xffffffff809b7a04 in pfslowtimo (arg=3D0x0)
>    at /usr/src/sys/kern/uipc_domain.c:508
> #14 0xffffffff8095f91b in softclock_call_cc (c=3D0xffffffff81620bf0,
>    cc=3D0xffffffff8169dc00, direct=3D0) at =
/usr/src/sys/kern/kern_timeout.c:685
> #15 0xffffffff8095fd44 in softclock (arg=3D0xffffffff8169dc00)
>    at /usr/src/sys/kern/kern_timeout.c:814
> #16 0xffffffff8091592b in intr_event_execute_handlers (
>    p=3D<value optimized out>, ie=3D0xfffff801102e0d00)
>    at /usr/src/sys/kern/kern_intr.c:1264
> #17 0xffffffff80915d76 in ithread_loop (arg=3D0xfffff801102adee0)
>    at /usr/src/sys/kern/kern_intr.c:1277
> #18 0xffffffff8091347a in fork_exit (
>    callout=3D0xffffffff80915ce0 <ithread_loop>, =
arg=3D0xfffff801102adee0,
>    frame=3D0xfffffe1760bc1c00) at /usr/src/sys/kern/kern_fork.c:1018
> #19 0xffffffff80d4330e in fork_trampoline ()
>    at /usr/src/sys/amd64/amd64/exception.S:611
> #20 0x0000000000000000 in ?? ()
>=20
>=20
>=20
> I'll try stable/10 now. Would you suggest a "clean" stable/10, or =
could 287621 and 287780 help?
>=20
> I'll add the debugging suggested options right away.
>=20
> Palle


I have a new core dump from ^/stable/10 with:


options        DDB
options        DEADLKRES
options        INVARIANTS
options        INVARIANT_SUPPORT
options        WITNESS
options        WITNESS_SKIPSPIN


What can I do with the core dump? "corrupt stack"...

(kgdb) #0  doadump (textdump=3D1) at pcpu.h:219
#1  0xffffffff8094b337 in kern_reboot (howto=3D260)
    at /usr/src/sys/kern/kern_shutdown.c:451
#2  0xffffffff8094b845 in vpanic (fmt=3D<value optimized out>,
    ap=3D<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:758
#3  0xffffffff8094b6d9 in kassert_panic (fmt=3D<value optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:646
#4  0xffffffff80b1ee59 in tcp_usr_detach (so=3D<value optimized out>)
    at /usr/src/sys/netinet/tcp_usrreq.c:202
#5  0xffffffff809cd291 in sofree (so=3D0xfffff801dd302000)
    at /usr/src/sys/kern/uipc_socket.c:747
#6  0xffffffff809cdb00 in soclose (so=3D<value optimized out>)
    at /usr/src/sys/kern/uipc_socket.c:849
#7  0xffffffff808fe659 in _fdrop (fp=3D0xfffff802a593db40, td=3D0x0) at =
file.h:343
#8  0xffffffff80901092 in closef (fp=3D0xfffff802a593db40,
    td=3D0xfffff80eebc894a0) at /usr/src/sys/kern/kern_descrip.c:2338
#9  0xffffffff808feb5d in closefp (fdp=3D0xfffff80b20cce000,
    fd=3D<value optimized out>, fp=3D0xfffff802a593db40, =
td=3D0xfffff80eebc894a0,
    holdleaders=3D<value optimized out>)
    at /usr/src/sys/kern/kern_descrip.c:1194
#10 0xffffffff80d7bc3a in amd64_syscall (td=3D0xfffff80eebc894a0, =
traced=3D0)
    at subr_syscall.c:134
#11 0xffffffff80d5f1db in Xfast_syscall ()
    at /usr/src/sys/amd64/amd64/exception.S:396
#12 0x0000000801c8d94a in ?? ()
Previous frame inner to this frame (corrupt stack?)
Current language:  auto; currently minimal
(kgdb)


Thanks,
Palle




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?73856F2B-3E70-483C-9988-C84E798CEB44>