Date: Mon, 08 Feb 2016 15:25:21 -0800 From: Chris Torek <torek@ixsystems.com> To: freebsd-net@freebsd.org Subject: inp_gcmoptions / imo_gc_task refers to free()d memory Message-ID: <201602082325.u18NPLd6040890@elf.torek.net>
next in thread | raw e-mail | index | archive | help
Looking for clues from multicast gurus :-) I turned on all the memory debug options looking for a ZFS issue, and promptly stumbled over an IP multicast bug that occurs when creating and deleting bridge and other virtual interfaces quickly. The actual crash is here (note, clang optimized out several stack frames due to tail calls), along with the relevant info: in_leavegroup_locked() at in_leavegroup_locked+0x6b/frame 0xfffffe0babd1aae0 inp_gcmoptions() at inp_gcmoptions+0x1e2/frame 0xfffffe0babd1ab20 taskqueue_run_locked() at taskqueue_run_locked+0xf0/frame 0xfffffe0babd1ab80 taskqueue_thread_loop() at taskqueue_thread_loop+0x9b/frame 0xfffffe0bab1dabb0 fork_exit() at ... in_leavegroup_locked+0x6b: movq 0x10(%rax),%rax rax 0xdeadc0dedeadc0de That particular movq instruction is part of the CURVNET_SET(inm->inm_ifp->if_vnet) macro call. Here's gdb disassembly in the area (and gdb points to that line, netinet/in_mcast.c line 1243 or so). 0x11c7 <in_leavegroup_locked+103>: mov 0x18(%r14),%rax 0x11cb <in_leavegroup_locked+107>: mov 0x10(%rax),%rax 0x11cf <in_leavegroup_locked+111>: test %rax,%rax Here %r14 holds "inm", so it's the load to get if_vnet into %rax that fails, not a load of ifp->foo (CURVNET_SET expands to an assert that ifp is not NULL and that it has the right vnet magic number, which all starts with the "test %rax" instruction). The conclusion is that inm itself is a valid pointer but inm->inm_ifp is 0xdeadc0dedeadc0de => we've done a free(inm, M_IPMADDR) already. But, "struct in_multi" has a ref count. So I'm a bit at a loss as to where the reference count failed us... (I did find a thread in the middle of deleting a bridge interface, but I think at this point that this is a red herring.) BTW, this is the wrong place for this fix, but I'll attach it anyway, since turning on memory debugging found this bug too, so if you turn on memory debugging you might hang if you have a ZFS root :-) (I need to make a bit of progress with my commit bit, and just commit the fix below...) Chris commit cc2a7f115fe69a5d91e54654ba1f210b7db19df6 Author: Chris Torek <torek@ixsystems.com> Date: Sun Feb 7 13:36:34 2016 -0800 taskqueue_drain_all: reload after sleeping In taskqueue_drain_all(), after we use TQ_SLEEP to wait for pending tasks to run, we must reload the tail of the task queue, since it may be done now. The symptom of this bug was a thread hanging in TQ_SLEEP if a task lived in malloc()ed memory that was reused or (more likely) filled with 0xdeadc0de on free() due to memory checking being turned on. The latter would make the pending value become 49374 (0xc0de) and we would wait forever. diff --git a/sys/kern/subr_taskqueue.c b/sys/kern/subr_taskqueue.c index f104bb5..55a1ca1 100644 --- a/sys/kern/subr_taskqueue.c +++ b/sys/kern/subr_taskqueue.c @@ -440,10 +440,9 @@ taskqueue_drain_all(struct taskqueue *queue) WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL, __func__); TQ_LOCK(queue); - task = STAILQ_LAST(&queue->tq_queue, task, ta_link); - if (task != NULL) - while (task->ta_pending != 0) - TQ_SLEEP(queue, task, &queue->tq_mutex, PWAIT, "-", 0); + while ((task = STAILQ_LAST(&queue->tq_queue, task, ta_link)) != NULL && + task->ta_pending != 0) + TQ_SLEEP(queue, task, &queue->tq_mutex, PWAIT, "-", 0); taskqueue_drain_running(queue); KASSERT(STAILQ_EMPTY(&queue->tq_queue), ("taskqueue queue is not empty after draining"));
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201602082325.u18NPLd6040890>