From owner-freebsd-hackers@FreeBSD.ORG  Tue Jan 20 08:32:22 2015
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 5DCBE2E3
 for <freebsd-hackers@freebsd.org>; Tue, 20 Jan 2015 08:32:22 +0000 (UTC)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 00B5B9A5
 for <freebsd-hackers@freebsd.org>; Tue, 20 Jan 2015 08:32:21 +0000 (UTC)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t0K8WC1M089941
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Tue, 20 Jan 2015 10:32:12 +0200 (EET)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t0K8WC1M089941
Received: (from kostik@localhost)
 by tom.home (8.14.9/8.14.9/Submit) id t0K8WCSS089940;
 Tue, 20 Jan 2015 10:32:12 +0200 (EET)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Tue, 20 Jan 2015 10:32:12 +0200
From: Konstantin Belousov <kostikbel@gmail.com>
To: Ryan Stone <rysto32@gmail.com>
Subject: Re: Sleeping thread held mutex in vm_pageout_oom()
Message-ID: <20150120083212.GC42409@kib.kiev.ua>
References: <CAFMmRNxz252HMWWBmRf=Z69zh2_w9cD5e1AZGeizyagKezm2Hw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAFMmRNxz252HMWWBmRf=Z69zh2_w9cD5e1AZGeizyagKezm2Hw@mail.gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.0
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home
Cc: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Jan 2015 08:32:22 -0000

On Mon, Jan 19, 2015 at 05:56:24PM -0500, Ryan Stone wrote:
> I recently had a system where a DIMM failed and the OOM killer was
> constantly kicking in due to a memory-hungry daemon being constantly
> restarted.  This ended up triggering a race condition in the OOM
> killer leading to this panic:
> 
> Sleeping thread (tid 100075, pid 8) owns a non-sleepable lock
> sched_switch() at 0xffffffff8048386d = sched_switch+0x16d
> mi_switch() at 0xffffffff80469dd6 = mi_switch+0x186
> sleepq_wait() at 0xffffffff80499204 = sleepq_wait+0x44
> __lockmgr_args() at 0xffffffff8044b88b = __lockmgr_args+0x67b
> vop_stdlock() at 0xffffffff804d3689 = vop_stdlock+0x39
> ---Type <return> to continue, or q <return> to quit---
> VOP_LOCK1_APV() at 0xffffffff8069da42 = VOP_LOCK1_APV+0x52
> _vn_lock() at 0xffffffff804ed627 = _vn_lock+0x47
> vm_object_deallocate() at 0xffffffff8061eef3 = vm_object_deallocate+0x203
> vm_map_entry_deallocate() at 0xffffffff80616d2c = vm_map_entry_deallocate+0x4c
> vm_map_process_deferred() at 0xffffffff80616d62 = vm_map_process_deferred+0x32
> vm_map_remove() at 0xffffffff806183ff = vm_map_remove+0x6f
> vmspace_free() at 0xffffffff80619206 = vmspace_free+0x56
> vm_pageout_oom() at 0xffffffff806230d1 = vm_pageout_oom+0x181
> vm_pageout() at 0xffffffff8062410b = vm_pageout+0x90b
> fork_exit() at 0xffffffff8043a382 = fork_exit+0x112
> fork_trampoline() at 0xffffffff8063385e = fork_trampoline+0xe
> --- trap 0, rip = 0, rsp = 0xffffff80c3be1d00, rbp = 0 ---
> panic: sleeping thread
> cpuid = 5
> curthread = grep/grep (82989/100544)
> cpu_ticks = 1848294656444
> KDB: stack backtrace:
> db_trace_self_wrapper() at 0xffffffff801e52ba = db_trace_self_wrapper+0x2a
> panic() at 0xffffffff80461608 = panic+0x228
> propagate_priority() at 0xffffffff8049cbde = propagate_priority+0x15e
> turnstile_wait() at 0xffffffff8049d278 = turnstile_wait+0x1b8
> _mtx_lock_sleep() at 0xffffffff80451af1 = _mtx_lock_sleep+0xf1
> ---Type <return> to continue, or q <return> to quit---
> _mtx_lock_flags() at 0xffffffff80451c75 = _mtx_lock_flags+0x75
> exit1() at 0xffffffff804367de = exit1+0x36e
> sys_exit() at 0xffffffff8043731e = sys_exit+0xe
> syscallenter() at 0xffffffff8049b324 = syscallenter+0x104
> syscall() at 0xffffffff80649bfc = syscall+0x4c
> Xfast_syscall() at 0xffffffff806335f2 = Xfast_syscall+0xe2
> --- syscall (1, FreeBSD ELF64, sys_exit), rip = 0x300a2df9c, rsp =
> 0x7ffffffd40c8, rbp = 0x7ffffffd40e0 ---
> Uptime: 7m19s
> 
> 
> The root cause is that vm_pageout_oom() acquires a reference on a
> process's vmspace while holding its PROC_LOCK(), then the process
> exited.  This left vm_pageout_oom() holding the only reference on the
> vmspace, so when it dropped the reference it called into
> vm_map_remove() and wound up sleeping while still holding the
> PROC_LOCK().  This was under FreeBSD 8 but the code in head does not
> seem to have changed here.
Well, the root cause is that the vmspace reference is dropped while owning
the process lock (a mutex).

> 
> I'm not quite familiar with the lock mechanisms here so I'm not sure
> how to fix it.  Does vm_pageout_oom() need to _PHOLD() the process
> while holding the PROC_LOCK(), then drop the lock, then acquire the
> vmspace reference?  It appears that's how other places that call
> vmspace_acquire_ref() work.

Yes, I think it is enough to keep a hold ref on the big process instead
of keeping it locked.  This should also allow to change trylock for
the next iteration process to plain lock.  On the other hand, it seems
reasonable to keep trylock for vm_map locking, since in OOM situation
usually some processes are stuck waiting for page while maps are locked.

Holding the process lock for bigproc prevents not only exit, but also
execve from executing while oom loop selected the victim.  This makes
it possible for a race where large process, selected for oom kill,
performs exec meantime and becoming small process, and then being killed
at the end of oom loop.  I think it is acceptable.
diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c
index ca9d7f9..d9f28c3 100644
--- a/sys/vm/vm_pageout.c
+++ b/sys/vm/vm_pageout.c
@@ -1516,8 +1516,8 @@ vm_pageout_oom(int shortage)
 	FOREACH_PROC_IN_SYSTEM(p) {
 		int breakout;
 
-		if (PROC_TRYLOCK(p) == 0)
-			continue;
+		PROC_LOCK(p);
+
 		/*
 		 * If this is a system, protected or killed process, skip it.
 		 */
@@ -1557,11 +1557,14 @@ vm_pageout_oom(int shortage)
 			PROC_UNLOCK(p);
 			continue;
 		}
+		_PHOLD(p);
 		if (!vm_map_trylock_read(&vm->vm_map)) {
-			vmspace_free(vm);
+			_PRELE(p);
 			PROC_UNLOCK(p);
+			vmspace_free(vm);
 			continue;
 		}
+		PROC_UNLOCK(p);
 		size = vmspace_swap_count(vm);
 		vm_map_unlock_read(&vm->vm_map);
 		if (shortage == VM_OOM_MEM)
@@ -1573,16 +1576,19 @@ vm_pageout_oom(int shortage)
 		 */
 		if (size > bigsize) {
 			if (bigproc != NULL)
-				PROC_UNLOCK(bigproc);
+				PRELE(bigproc);
 			bigproc = p;
 			bigsize = size;
-		} else
-			PROC_UNLOCK(p);
+		} else {
+			PRELE(p);
+		}
 	}
 	sx_sunlock(&allproc_lock);
 	if (bigproc != NULL) {
+		PROC_LOCK(bigproc);
 		killproc(bigproc, "out of swap space");
 		sched_nice(bigproc, PRIO_MIN);
+		_PRELE(bigproc);
 		PROC_UNLOCK(bigproc);
 		wakeup(&vm_cnt.v_free_count);
 	}