Date: Mon, 21 Sep 2015 15:53:41 +0200 From: Palle Girgensohn <girgen@FreeBSD.org> To: Julien Charbon <jch@FreeBSD.org> Cc: Konstantin Belousov <kostikbel@gmail.com>, freebsd-net@freebsd.org, Hans Petter Selasky <hps@selasky.org> Subject: Re: Kernel panics in tcp_twclose Message-ID: <3721F099-F45D-4DCD-8AB3-84D1ABC44145@FreeBSD.org> In-Reply-To: <55FFBE01.6060706@freebsd.org> References: <26B0FF93-8AE3-4514-BDA1-B966230AAB65@FreeBSD.org> <55FC1809.3070903@freebsd.org> <20150918160605.GN67105@kib.kiev.ua> <55FFBE01.6060706@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
> 21 sep 2015 kl. 10:21 skrev Julien Charbon <jch@FreeBSD.org>: >=20 >=20 > Hi Konstantin, Hi Palle, >=20 > On 18/09/15 18:06, Konstantin Belousov wrote: >> On Fri, Sep 18, 2015 at 03:56:25PM +0200, Julien Charbon wrote: >>> Hi Palle, >>>=20 >>> On 18/09/15 11:12, Palle Girgensohn wrote: >>>> We see daily panics on our production systems (web server, apache >>>> running MPM event, openjdk8. Kernel with VIMAGE. Jails using = netgraph >>>> interfaces [not epair]). >>>>=20 >>>> The problem started after the summer. Normal port upgrades seems to >>>> be the only difference. The problem occurs with 10.2-p2 kernel as >>>> well as 10.1-p4 and 10.1-p15. >>>>=20 >>>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D203175 >>>>=20 >>>> Any ideas? >>>=20 >>> Thanks for you detailed report. I am not aware of any tcp_twclose() >>> related issues (without VIMAGE) since FreeBSD 10.0 (does not mean = there >>> are none). Few interesting facts (at least for me): >>>=20 >>> - Your crash happens when unlocking a inp exclusive lock with = INP_WUNLOCK() >>>=20 >>> - Something is already wrong before calling turnstile_broadcast() as = it >>> is called with ts =3D NULL: >> In the kernel without witness this is a 99%-sure indication of = attempt to >> unlock not owned lock. >=20 > Thanks, this is useful. So far I did not find any path where > tcp_twclose() can call INP_WUNLOCK without having the exclusive lock > held, that makes this issue interesting. >=20 >>> I won't go to far here as I am not expert enough in VIMAGE, but one >>> question anyway: >>>=20 >>> - Can you correlate this kernel panic to a particular event? Like = for >>> example a VIMAGE/VNET jail destruction. >>>=20 >>> I will test that on my side on a 10.2 machine. >=20 > I did not find any issues while testing 10.2 + VIMAGE on my side. Thus > Palle what I would suggest: >=20 > - First, test with stable/10 to see if by chance this issue has = already > been fixed in stable branch. >=20 > - Second, if issue is still in stable/10, compile 10.2 kernel with > these options: >=20 > options DDB > options DEADLKRES > options INVARIANTS > options INVARIANT_SUPPORT > options WITNESS > options WITNESS_SKIPSPIN >=20 > To see where the original fault is coming from. Hi, We just had two crashes within 15 minutes using 10.2 with these two = added: https://svnweb.freebsd.org/changeset/base/287261 https://svnweb.freebsd.org/changeset/base/287780=20 We don't always get a core dump, but the second time, we did. very similar stack trace, but not identical: (kgdb) #0 doadump (textdump=3D<value optimized out>) at pcpu.h:219 #1 0xffffffff80949a82 in kern_reboot (howto=3D260) at /usr/src/sys/kern/kern_shutdown.c:451 #2 0xffffffff80949e65 in vpanic (fmt=3D<value optimized out>, ap=3D<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:758 #3 0xffffffff80949cf3 in panic (fmt=3D0x0) at /usr/src/sys/kern/kern_shutdown.c:687 #4 0xffffffff80d5d0bb in trap_fatal (frame=3D<value optimized out>, eva=3D<value optimized out>) at /usr/src/sys/amd64/amd64/trap.c:851 #5 0xffffffff80d5d3bd in trap_pfault (frame=3D0xfffffe1760bc1840, usermode=3D<value optimized out>) at = /usr/src/sys/amd64/amd64/trap.c:674 #6 0xffffffff80d5ca5a in trap (frame=3D0xfffffe1760bc1840) at /usr/src/sys/amd64/amd64/trap.c:440 #7 0xffffffff80d42dd2 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:236 #8 0xffffffff8099861c in turnstile_broadcast (ts=3D0x0, queue=3D1) at /usr/src/sys/kern/subr_turnstile.c:838 #9 0xffffffff80948100 in __rw_wunlock_hard (c=3D0xfffff811c43487a0, = tid=3D1, file=3D0x1 <Address 0x1 out of bounds>, line=3D1) at /usr/src/sys/kern/kern_rwlock.c:988 #10 0xffffffff80b067c4 in tcp_twclose (tw=3D<value optimized out>, reuse=3D<value optimized out>) at = /usr/src/sys/netinet/tcp_timewait.c:540 #11 0xffffffff80b06e0b in tcp_tw_2msl_scan (reuse=3D0) at /usr/src/sys/netinet/tcp_timewait.c:748 #12 0xffffffff80b04b0e in tcp_slowtimo () at /usr/src/sys/netinet/tcp_timer.c:198 #13 0xffffffff809b7a04 in pfslowtimo (arg=3D0x0) at /usr/src/sys/kern/uipc_domain.c:508 #14 0xffffffff8095f91b in softclock_call_cc (c=3D0xffffffff81620bf0, cc=3D0xffffffff8169dc00, direct=3D0) at = /usr/src/sys/kern/kern_timeout.c:685 #15 0xffffffff8095fd44 in softclock (arg=3D0xffffffff8169dc00) at /usr/src/sys/kern/kern_timeout.c:814 #16 0xffffffff8091592b in intr_event_execute_handlers ( p=3D<value optimized out>, ie=3D0xfffff801102e0d00) at /usr/src/sys/kern/kern_intr.c:1264 #17 0xffffffff80915d76 in ithread_loop (arg=3D0xfffff801102adee0) at /usr/src/sys/kern/kern_intr.c:1277 #18 0xffffffff8091347a in fork_exit ( callout=3D0xffffffff80915ce0 <ithread_loop>, arg=3D0xfffff801102adee0,= frame=3D0xfffffe1760bc1c00) at /usr/src/sys/kern/kern_fork.c:1018 #19 0xffffffff80d4330e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:611 #20 0x0000000000000000 in ?? () I'll try stable/10 now. Would you suggest a "clean" stable/10, or could = 287621 and 287780 help? I'll add the debugging suggested options right away. Palle
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3721F099-F45D-4DCD-8AB3-84D1ABC44145>