Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 20 Jan 2015 10:32:12 +0200
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Ryan Stone <rysto32@gmail.com>
Cc:        "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
Subject:   Re: Sleeping thread held mutex in vm_pageout_oom()
Message-ID:  <20150120083212.GC42409@kib.kiev.ua>
In-Reply-To: <CAFMmRNxz252HMWWBmRf=Z69zh2_w9cD5e1AZGeizyagKezm2Hw@mail.gmail.com>
References:  <CAFMmRNxz252HMWWBmRf=Z69zh2_w9cD5e1AZGeizyagKezm2Hw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Jan 19, 2015 at 05:56:24PM -0500, Ryan Stone wrote:
> I recently had a system where a DIMM failed and the OOM killer was
> constantly kicking in due to a memory-hungry daemon being constantly
> restarted.  This ended up triggering a race condition in the OOM
> killer leading to this panic:
> 
> Sleeping thread (tid 100075, pid 8) owns a non-sleepable lock
> sched_switch() at 0xffffffff8048386d = sched_switch+0x16d
> mi_switch() at 0xffffffff80469dd6 = mi_switch+0x186
> sleepq_wait() at 0xffffffff80499204 = sleepq_wait+0x44
> __lockmgr_args() at 0xffffffff8044b88b = __lockmgr_args+0x67b
> vop_stdlock() at 0xffffffff804d3689 = vop_stdlock+0x39
> ---Type <return> to continue, or q <return> to quit---
> VOP_LOCK1_APV() at 0xffffffff8069da42 = VOP_LOCK1_APV+0x52
> _vn_lock() at 0xffffffff804ed627 = _vn_lock+0x47
> vm_object_deallocate() at 0xffffffff8061eef3 = vm_object_deallocate+0x203
> vm_map_entry_deallocate() at 0xffffffff80616d2c = vm_map_entry_deallocate+0x4c
> vm_map_process_deferred() at 0xffffffff80616d62 = vm_map_process_deferred+0x32
> vm_map_remove() at 0xffffffff806183ff = vm_map_remove+0x6f
> vmspace_free() at 0xffffffff80619206 = vmspace_free+0x56
> vm_pageout_oom() at 0xffffffff806230d1 = vm_pageout_oom+0x181
> vm_pageout() at 0xffffffff8062410b = vm_pageout+0x90b
> fork_exit() at 0xffffffff8043a382 = fork_exit+0x112
> fork_trampoline() at 0xffffffff8063385e = fork_trampoline+0xe
> --- trap 0, rip = 0, rsp = 0xffffff80c3be1d00, rbp = 0 ---
> panic: sleeping thread
> cpuid = 5
> curthread = grep/grep (82989/100544)
> cpu_ticks = 1848294656444
> KDB: stack backtrace:
> db_trace_self_wrapper() at 0xffffffff801e52ba = db_trace_self_wrapper+0x2a
> panic() at 0xffffffff80461608 = panic+0x228
> propagate_priority() at 0xffffffff8049cbde = propagate_priority+0x15e
> turnstile_wait() at 0xffffffff8049d278 = turnstile_wait+0x1b8
> _mtx_lock_sleep() at 0xffffffff80451af1 = _mtx_lock_sleep+0xf1
> ---Type <return> to continue, or q <return> to quit---
> _mtx_lock_flags() at 0xffffffff80451c75 = _mtx_lock_flags+0x75
> exit1() at 0xffffffff804367de = exit1+0x36e
> sys_exit() at 0xffffffff8043731e = sys_exit+0xe
> syscallenter() at 0xffffffff8049b324 = syscallenter+0x104
> syscall() at 0xffffffff80649bfc = syscall+0x4c
> Xfast_syscall() at 0xffffffff806335f2 = Xfast_syscall+0xe2
> --- syscall (1, FreeBSD ELF64, sys_exit), rip = 0x300a2df9c, rsp =
> 0x7ffffffd40c8, rbp = 0x7ffffffd40e0 ---
> Uptime: 7m19s
> 
> 
> The root cause is that vm_pageout_oom() acquires a reference on a
> process's vmspace while holding its PROC_LOCK(), then the process
> exited.  This left vm_pageout_oom() holding the only reference on the
> vmspace, so when it dropped the reference it called into
> vm_map_remove() and wound up sleeping while still holding the
> PROC_LOCK().  This was under FreeBSD 8 but the code in head does not
> seem to have changed here.
Well, the root cause is that the vmspace reference is dropped while owning
the process lock (a mutex).

> 
> I'm not quite familiar with the lock mechanisms here so I'm not sure
> how to fix it.  Does vm_pageout_oom() need to _PHOLD() the process
> while holding the PROC_LOCK(), then drop the lock, then acquire the
> vmspace reference?  It appears that's how other places that call
> vmspace_acquire_ref() work.

Yes, I think it is enough to keep a hold ref on the big process instead
of keeping it locked.  This should also allow to change trylock for
the next iteration process to plain lock.  On the other hand, it seems
reasonable to keep trylock for vm_map locking, since in OOM situation
usually some processes are stuck waiting for page while maps are locked.

Holding the process lock for bigproc prevents not only exit, but also
execve from executing while oom loop selected the victim.  This makes
it possible for a race where large process, selected for oom kill,
performs exec meantime and becoming small process, and then being killed
at the end of oom loop.  I think it is acceptable.

diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c
index ca9d7f9..d9f28c3 100644
--- a/sys/vm/vm_pageout.c
+++ b/sys/vm/vm_pageout.c
@@ -1516,8 +1516,8 @@ vm_pageout_oom(int shortage)
 	FOREACH_PROC_IN_SYSTEM(p) {
 		int breakout;
 
-		if (PROC_TRYLOCK(p) == 0)
-			continue;
+		PROC_LOCK(p);
+
 		/*
 		 * If this is a system, protected or killed process, skip it.
 		 */
@@ -1557,11 +1557,14 @@ vm_pageout_oom(int shortage)
 			PROC_UNLOCK(p);
 			continue;
 		}
+		_PHOLD(p);
 		if (!vm_map_trylock_read(&vm->vm_map)) {
-			vmspace_free(vm);
+			_PRELE(p);
 			PROC_UNLOCK(p);
+			vmspace_free(vm);
 			continue;
 		}
+		PROC_UNLOCK(p);
 		size = vmspace_swap_count(vm);
 		vm_map_unlock_read(&vm->vm_map);
 		if (shortage == VM_OOM_MEM)
@@ -1573,16 +1576,19 @@ vm_pageout_oom(int shortage)
 		 */
 		if (size > bigsize) {
 			if (bigproc != NULL)
-				PROC_UNLOCK(bigproc);
+				PRELE(bigproc);
 			bigproc = p;
 			bigsize = size;
-		} else
-			PROC_UNLOCK(p);
+		} else {
+			PRELE(p);
+		}
 	}
 	sx_sunlock(&allproc_lock);
 	if (bigproc != NULL) {
+		PROC_LOCK(bigproc);
 		killproc(bigproc, "out of swap space");
 		sched_nice(bigproc, PRIO_MIN);
+		_PRELE(bigproc);
 		PROC_UNLOCK(bigproc);
 		wakeup(&vm_cnt.v_free_count);
 	}



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150120083212.GC42409>