Date: Thu, 24 Sep 2015 00:05:07 +0200 From: Palle Girgensohn <girgen@pingpong.net> To: Julien Charbon <jch@freebsd.org> Cc: George Neville-Neil <gnn@FreeBSD.org>, freebsd-net@freebsd.org Subject: Re: Kernel panics in tcp_twclose Message-ID: <54767991-9D3B-4ECB-A07E-CFA21A54BBDD@pingpong.net> In-Reply-To: <5602F044.5010606@freebsd.org> References: <26B0FF93-8AE3-4514-BDA1-B966230AAB65@FreeBSD.org> <55FC1809.3070903@freebsd.org> <20150918160605.GN67105@kib.kiev.ua> <55FFBE01.6060706@freebsd.org> <3721F099-F45D-4DCD-8AB3-84D1ABC44145@FreeBSD.org> <73856F2B-3E70-483C-9988-C84E798CEB44@FreeBSD.org> <44EBAC98-4761-4E47-8E47-5032430A1C8A@FreeBSD.org> <56019AF8.8000705@freebsd.org> <F9D29C16-502B-43A1-BE2C-D2AD30F0B9EF@FreeBSD.org> <5601CF2D.9030307@freebsd.org> <E09DF89D-AAC5-48FD-8B75-EEAB937A5C32@FreeBSD.org> <5602E90A.9050504@freebsd.org> <0931591A-23EC-40CB-A109-72E9308B1A2D@pingpong.net> <5602F044.5010606@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi, > 23 sep 2015 kl. 20:32 skrev Julien Charbon <jch@freebsd.org>: >=20 >=20 > Hi, >=20 > On 23/09/15 20:26, Palle Girgensohn wrote: >>> 23 sep. 2015 kl. 20:01 skrev Julien Charbon <jch@freebsd.org>: >>> On 23/09/15 16:36, Palle Girgensohn wrote: >>>>> 22 sep 2015 kl. 23:59 skrev Julien Charbon <jch@freebsd.org>: On >>>>> 22/09/15 22:58, Palle Girgensohn wrote: >>>>>>> 22 sep 2015 kl. 20:16 skrev Julien Charbon <jch@freebsd.org>:=20 >>>>>>> On 22/09/15 18:49, Palle Girgensohn wrote: >>>>>>>>> 22 sep 2015 kl. 18:46 skrev Palle Girgensohn=20 >>>>>>>>> <girgen@FreeBSD.org>: >>>>>>>>>> 21 sep 2015 kl. 15:53 skrev Palle Girgensohn=20 >>>>>>>>>> <girgen@FreeBSD.org>: >>>>>>>>>>> 21 sep 2015 kl. 10:21 skrev Julien Charbon=20 >>>>>>>>>>> <jch@FreeBSD.org>: On 18/09/15 18:06, Konstantin >>>>>>>>>>> Belousov wrote: >>>>>>>>>>>> On Fri, Sep 18, 2015 at 03:56:25PM +0200, Julien >>>>>>>>>>>> Charbon wrote: >>>>>>>>>>>>> [...] >>>>>>>>>>> - Second, if issue is still in stable/10, compile 10.2=20 >>>>>>>>>>> kernel with these options: >>>>>>>>>>>=20 >>>>>>>>>>> options DDB options DEADLKRES options=20 >>>>>>>>>>> INVARIANTS options INVARIANT_SUPPORT options >>>>>>>>>>> WITNESS options WITNESS_SKIPSPIN >>>>>>>>>>>=20 >>>>>>>>>>> To see where the original fault is coming from. >>>>>>>>>> [...] >>>>>>>>>>=20 >>>>>>>>>> I'll try stable/10 now. Would you suggest a "clean"=20 >>>>>>>>>> stable/10, or could 287621 and 287780 help? >>>>>>>>>>=20 >>>>>>>>>> I'll add the debugging suggested options right away. >>>>>>>>>>=20 >>>>>>>>>> Palle >>>>>>>>>=20 >>>>>>>>> I have a new core dump from ^/stable/10 with: >>>>>>>>>=20 >>>>>>>>> options DDB options DEADLKRES options >>>>>>>>> INVARIANTS options INVARIANT_SUPPORT options WITNESS >>>>>>>>> options WITNESS_SKIPSPIN >>>>>>>>=20 >>>>>>>> # kgdb kernel /var/crash/vmcore.2 GNU gdb 6.1.1 [FreeBSD]=20 >>>>>>>> Copyright 2004 Free Software Foundation, Inc. GDB is free=20 >>>>>>>> software, covered by the GNU General Public License, and you >>>>>>>> are welcome to change it and/or distribute copies of it under >>>>>>>> certain conditions. Type "show copying" to see the >>>>>>>> conditions. There is absolutely no warranty for GDB. Type >>>>>>>> "show warranty" for details. This GDB was configured as >>>>>>>> "amd64-marcel-freebsd"... >>>>>>>>=20 >>>>>>>> Unread portion of the kernel message buffer: panic: >>>>>>>> tcp_detach: INP_TIMEWAIT && INP_DROPPED && tp !=3D NULL cpuid =3D= >>>>>>>> 16 KDB: stack backtrace: db_trace_self_wrapper() at=20 >>>>>>>> db_trace_self_wrapper+0x2b/frame 0xfffffe183d9e97e0=20 >>>>>>>> kdb_backtrace() at kdb_backtrace+0x39/frame >>>>>>>> 0xfffffe183d9e9890 vpanic() at vpanic+0x126/frame >>>>>>>> 0xfffffe183d9e98d0 kassert_panic() at >>>>>>>> kassert_panic+0x139/frame 0xfffffe183d9e9940 >>>>>>>> tcp_usr_detach() at tcp_usr_detach+0xf9/frame >>>>>>>> 0xfffffe183d9e9970 sofree() at sofree+0x1f1/frame >>>>>>>> 0xfffffe183d9e99a0 soclose() at soclose+0x3a0/frame >>>>>>>> 0xfffffe183d9e99f0 _fdrop() at _fdrop+0x29/frame >>>>>>>> 0xfffffe183d9e9a10 closef() at closef+0x1e2/frame >>>>>>>> 0xfffffe183d9e9aa0 closefp() at closefp+0x9d/frame >>>>>>>> 0xfffffe183d9e9ae0 amd64_syscall() at=20 >>>>>>>> amd64_syscall+0x25a/frame 0xfffffe183d9e9bf0 Xfast_syscall() >>>>>>>> at Xfast_syscall+0xfb/frame 0xfffffe183d9e9bf0 --- syscall >>>>>>>> (6, FreeBSD ELF64, sys_close), rip =3D 0x801c8d94a, rsp =3D=20 >>>>>>>> 0x7ffff91c8668, rbp =3D 0x7ffff91c8680 --- KDB: enter: panic=20 >>>>>>>> Uptime: 18h57m59s Dumping 23085 out of 98263=20 >>>>>>>> MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% >>>>>>>>=20 >>>>>>>> Reading symbols from /boot/kernel/nullfs.ko.symbols...done.=20 >>>>>>>> Loaded symbols for /boot/kernel/nullfs.ko.symbols Reading >>>>>>>> symbols from /boot/kernel/zfs.ko.symbols...done. Loaded >>>>>>>> symbols for /boot/kernel/zfs.ko.symbols Reading symbols from >>>>>>>> /boot/kernel/opensolaris.ko.symbols...done. Loaded symbols >>>>>>>> for /boot/kernel/opensolaris.ko.symbols Reading symbols from >>>>>>>> /boot/kernel/ng_bridge.ko.symbols...done. Loaded symbols for >>>>>>>> /boot/kernel/ng_bridge.ko.symbols Reading symbols from=20 >>>>>>>> /boot/kernel/netgraph.ko.symbols...done. Loaded symbols for=20 >>>>>>>> /boot/kernel/netgraph.ko.symbols Reading symbols from=20 >>>>>>>> /boot/kernel/ng_eiface.ko.symbols...done. Loaded symbols for >>>>>>>> /boot/kernel/ng_eiface.ko.symbols Reading symbols from=20 >>>>>>>> /boot/kernel/ng_ether.ko.symbols...done. Loaded symbols for=20 >>>>>>>> /boot/kernel/ng_ether.ko.symbols Reading symbols from=20 >>>>>>>> /boot/kernel/accf_data.ko.symbols...done. Loaded symbols for >>>>>>>> /boot/kernel/accf_data.ko.symbols Reading symbols from=20 >>>>>>>> /boot/kernel/accf_http.ko.symbols...done. Loaded symbols for >>>>>>>> /boot/kernel/accf_http.ko.symbols Reading symbols from=20 >>>>>>>> /boot/kernel/ums.ko.symbols...done. Loaded symbols for=20 >>>>>>>> /boot/kernel/ums.ko.symbols Reading symbols from=20 >>>>>>>> /boot/kernel/ng_socket.ko.symbols...done. Loaded symbols for >>>>>>>> /boot/kernel/ng_socket.ko.symbols Reading symbols from=20 >>>>>>>> /boot/kernel/fdescfs.ko.symbols...done. Loaded symbols for=20 >>>>>>>> /boot/kernel/fdescfs.ko.symbols #0 doadump (textdump=3D1) at=20= >>>>>>>> pcpu.h:219 219 __asm("movq %%gs:%1,%0" : "=3Dr" (td) = (kgdb) bt >>>>>>>> #0 doadump (textdump=3D1) at pcpu.h:219 #1 0xffffffff8094b337 >>>>>>>> in kern_reboot (howto=3D260) at >>>>>>>> /usr/src/sys/kern/kern_shutdown.c:451 #2 0xffffffff8094b845 >>>>>>>> in vpanic (fmt=3D<value optimized out>, ap=3D<value optimized >>>>>>>> out>) at /usr/src/sys/kern/kern_shutdown.c:758 #3 >>>>>>>> 0xffffffff8094b6d9 in kassert_panic (fmt=3D<value optimized >>>>>>>> out>) at /usr/src/sys/kern/kern_shutdown.c:646 #4 >>>>>>>> 0xffffffff80b1ee59 in tcp_usr_detach (so=3D<value optimized >>>>>>>> out>) at /usr/src/sys/netinet/tcp_usrreq.c:202 #5 >>>>>>>> 0xffffffff809cd291 in sofree (so=3D0xfffff801dd302000) at=20 >>>>>>>> /usr/src/sys/kern/uipc_socket.c:747 #6 0xffffffff809cdb00 in >>>>>>>> soclose (so=3D<value optimized out>) at=20 >>>>>>>> /usr/src/sys/kern/uipc_socket.c:849 #7 0xffffffff808fe659 in >>>>>>>> _fdrop (fp=3D0xfffff802a593db40, td=3D0x0) at file.h:343 #8=20 >>>>>>>> 0xffffffff80901092 in closef (fp=3D0xfffff802a593db40,=20 >>>>>>>> td=3D0xfffff80eebc894a0) at >>>>>>>> /usr/src/sys/kern/kern_descrip.c:2338 #9 0xffffffff808feb5d >>>>>>>> in closefp (fdp=3D0xfffff80b20cce000, fd=3D<value optimized = out>, >>>>>>>> fp=3D0xfffff802a593db40, td=3D0xfffff80eebc894a0, >>>>>>>> holdleaders=3D<value optimized out>) at=20 >>>>>>>> /usr/src/sys/kern/kern_descrip.c:1194 #10 0xffffffff80d7bc3a >>>>>>>> in amd64_syscall (td=3D0xfffff80eebc894a0, traced=3D0) at=20 >>>>>>>> subr_syscall.c:134 #11 0xffffffff80d5f1db in Xfast_syscall () >>>>>>>> at /usr/src/sys/amd64/amd64/exception.S:396 #12 >>>>>>>> 0x0000000801c8d94a in ?? () Previous frame inner to this >>>>>>>> frame (corrupt stack?) Current language: auto; currently >>>>>>>> minimal >>>>>>>=20 >>>>>>> Thanks for the information. As I suspected the initial error >>>>>>> was elsewhere than tcp_twclose(), never got this assertion >>>>>>> before: >>>>>>>=20 >>>>>>> tcp_detach: INP_TIMEWAIT && INP_DROPPED && tp !=3D NULL >>>>>>>=20 >>>>>>> from here: >>>>>>>=20 >>>>>>> static void tcp_detach(struct socket *so, struct inpcb *inp) { >>>>>>> struct tcpcb *tp; >>>>>>>=20 >>>>>>> INP_INFO_WLOCK_ASSERT(&V_tcbinfo); INP_WLOCK_ASSERT(inp); >>>>>>>=20 >>>>>>> KASSERT(so->so_pcb =3D=3D inp, ("tcp_detach: so_pcb !=3D inp"));=20= >>>>>>> KASSERT(inp->inp_socket =3D=3D so, ("tcp_detach: inp_socket !=3D >>>>>>> so")); >>>>>>>=20 >>>>>>> tp =3D intotcpcb(inp); >>>>>>>=20 >>>>>>> if (inp->inp_flags & INP_TIMEWAIT) { if (inp->inp_flags &=20 >>>>>>> INP_DROPPED) { KASSERT(tp =3D=3D NULL, ("tcp_detach: = INP_TIMEWAIT >>>>>>> && " "INP_DROPPED && tp !=3D NULL")); >>>>>>>=20 >>>>>>> Let me check if I could find a path that could lead to this=20 >>>>>>> unexpected case. Unexpected because: INP_DROPPED and=20 >>>>>>> inp->inp_ppcb is set to NULL are set at same time here: >>>>>>>=20 >>>>>>> void tcp_twclose(struct tcptw *tw, int reuse) { struct socket >>>>>>> *so; struct inpcb *inp; >>>>>>>=20 >>>>>>> inp =3D tw->tw_inpcb; KASSERT((inp->inp_flags & INP_TIMEWAIT),=20= >>>>>>> ("tcp_twclose: !timewait")); KASSERT(intotw(inp) =3D=3D tw,=20 >>>>>>> ("tcp_twclose: inp_ppcb !=3D tw"));=20 >>>>>>> INP_INFO_WLOCK_ASSERT(&V_tcbinfo); /* in_pcbfree() */=20 >>>>>>> INP_WLOCK_ASSERT(inp); >>>>>>>=20 >>>>>>> tcp_tw_2msl_stop(tw, reuse); inp->inp_ppcb =3D NULL;=20 >>>>>>> in_pcbdrop(inp); ... >>>>>>>=20 >>>>>>> Interesting [...] >>>>=20 >>>> Just a quick update. Julien is pursuing this off list with a core >>>> dump and we are now waiting for a new core dump with the first >>>> KASSERT removed. this is on a stable/10 kernel. >>>=20 >>> By the way Palle could you also run below Dtrace script to see where >>> this tcp_close() in INP_TIMEWAIT comes from: >>>=20 >>> $ cat tcp-close-tw.d >>> fbt::tcp_close:entry >>> /args[0]->t_inpcb->inp_flags & 0x01000000/ >>> { >>> @s1[stack()] =3D count() >>> } >>>=20 >>> tick-1sec { >>> printa(@s1); >>> } >>> $ sudo dtrace -s tcp-close-tw.d >>>=20 >>> And share any backtraces reported in this dtrace script output. >>>=20 >>> George, could you check if this dtrace script makes sense for you, = and >>> if you have any improvements to add, I am quite a rookie in Dtrace = scripts. >>=20 >> Shall I let the dtrace script run continuously until the machine = crashes? Or just run it once? >=20 > Continuously until the machine crashes. You can report any backstrace > outputs like: >=20 > kernel`tcp_usr_close+0x86 > kernel`soclose+0xe4 > kernel`_fdrop+0x29 > kernel`closef+0x237 > kernel`closefp+0x95 > kernel`amd64_syscall+0x357 > kernel`0xffffffff80c83c4b > 1 >=20 > before the machine crashes. But I expect the problematic case > detection with Dtrace to be quickly followed by the crash. Will see. >=20 > -- > Julien >=20 Kernels and userland are updated to 10.2-p3 with the patch removing the = suspicous KASSERT.=20 dtrace running continously redirecting to a log file. now we're just waiting... :) Palle
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?54767991-9D3B-4ECB-A07E-CFA21A54BBDD>