From owner-freebsd-hackers@FreeBSD.ORG Tue Jan 20 08:32:22 2015 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 5DCBE2E3 for ; Tue, 20 Jan 2015 08:32:22 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 00B5B9A5 for ; Tue, 20 Jan 2015 08:32:21 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t0K8WC1M089941 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 20 Jan 2015 10:32:12 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t0K8WC1M089941 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id t0K8WCSS089940; Tue, 20 Jan 2015 10:32:12 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 20 Jan 2015 10:32:12 +0200 From: Konstantin Belousov To: Ryan Stone Subject: Re: Sleeping thread held mutex in vm_pageout_oom() Message-ID: <20150120083212.GC42409@kib.kiev.ua> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home Cc: "freebsd-hackers@freebsd.org" X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 20 Jan 2015 08:32:22 -0000 On Mon, Jan 19, 2015 at 05:56:24PM -0500, Ryan Stone wrote: > I recently had a system where a DIMM failed and the OOM killer was > constantly kicking in due to a memory-hungry daemon being constantly > restarted. This ended up triggering a race condition in the OOM > killer leading to this panic: > > Sleeping thread (tid 100075, pid 8) owns a non-sleepable lock > sched_switch() at 0xffffffff8048386d = sched_switch+0x16d > mi_switch() at 0xffffffff80469dd6 = mi_switch+0x186 > sleepq_wait() at 0xffffffff80499204 = sleepq_wait+0x44 > __lockmgr_args() at 0xffffffff8044b88b = __lockmgr_args+0x67b > vop_stdlock() at 0xffffffff804d3689 = vop_stdlock+0x39 > ---Type to continue, or q to quit--- > VOP_LOCK1_APV() at 0xffffffff8069da42 = VOP_LOCK1_APV+0x52 > _vn_lock() at 0xffffffff804ed627 = _vn_lock+0x47 > vm_object_deallocate() at 0xffffffff8061eef3 = vm_object_deallocate+0x203 > vm_map_entry_deallocate() at 0xffffffff80616d2c = vm_map_entry_deallocate+0x4c > vm_map_process_deferred() at 0xffffffff80616d62 = vm_map_process_deferred+0x32 > vm_map_remove() at 0xffffffff806183ff = vm_map_remove+0x6f > vmspace_free() at 0xffffffff80619206 = vmspace_free+0x56 > vm_pageout_oom() at 0xffffffff806230d1 = vm_pageout_oom+0x181 > vm_pageout() at 0xffffffff8062410b = vm_pageout+0x90b > fork_exit() at 0xffffffff8043a382 = fork_exit+0x112 > fork_trampoline() at 0xffffffff8063385e = fork_trampoline+0xe > --- trap 0, rip = 0, rsp = 0xffffff80c3be1d00, rbp = 0 --- > panic: sleeping thread > cpuid = 5 > curthread = grep/grep (82989/100544) > cpu_ticks = 1848294656444 > KDB: stack backtrace: > db_trace_self_wrapper() at 0xffffffff801e52ba = db_trace_self_wrapper+0x2a > panic() at 0xffffffff80461608 = panic+0x228 > propagate_priority() at 0xffffffff8049cbde = propagate_priority+0x15e > turnstile_wait() at 0xffffffff8049d278 = turnstile_wait+0x1b8 > _mtx_lock_sleep() at 0xffffffff80451af1 = _mtx_lock_sleep+0xf1 > ---Type to continue, or q to quit--- > _mtx_lock_flags() at 0xffffffff80451c75 = _mtx_lock_flags+0x75 > exit1() at 0xffffffff804367de = exit1+0x36e > sys_exit() at 0xffffffff8043731e = sys_exit+0xe > syscallenter() at 0xffffffff8049b324 = syscallenter+0x104 > syscall() at 0xffffffff80649bfc = syscall+0x4c > Xfast_syscall() at 0xffffffff806335f2 = Xfast_syscall+0xe2 > --- syscall (1, FreeBSD ELF64, sys_exit), rip = 0x300a2df9c, rsp = > 0x7ffffffd40c8, rbp = 0x7ffffffd40e0 --- > Uptime: 7m19s > > > The root cause is that vm_pageout_oom() acquires a reference on a > process's vmspace while holding its PROC_LOCK(), then the process > exited. This left vm_pageout_oom() holding the only reference on the > vmspace, so when it dropped the reference it called into > vm_map_remove() and wound up sleeping while still holding the > PROC_LOCK(). This was under FreeBSD 8 but the code in head does not > seem to have changed here. Well, the root cause is that the vmspace reference is dropped while owning the process lock (a mutex). > > I'm not quite familiar with the lock mechanisms here so I'm not sure > how to fix it. Does vm_pageout_oom() need to _PHOLD() the process > while holding the PROC_LOCK(), then drop the lock, then acquire the > vmspace reference? It appears that's how other places that call > vmspace_acquire_ref() work. Yes, I think it is enough to keep a hold ref on the big process instead of keeping it locked. This should also allow to change trylock for the next iteration process to plain lock. On the other hand, it seems reasonable to keep trylock for vm_map locking, since in OOM situation usually some processes are stuck waiting for page while maps are locked. Holding the process lock for bigproc prevents not only exit, but also execve from executing while oom loop selected the victim. This makes it possible for a race where large process, selected for oom kill, performs exec meantime and becoming small process, and then being killed at the end of oom loop. I think it is acceptable. diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c index ca9d7f9..d9f28c3 100644 --- a/sys/vm/vm_pageout.c +++ b/sys/vm/vm_pageout.c @@ -1516,8 +1516,8 @@ vm_pageout_oom(int shortage) FOREACH_PROC_IN_SYSTEM(p) { int breakout; - if (PROC_TRYLOCK(p) == 0) - continue; + PROC_LOCK(p); + /* * If this is a system, protected or killed process, skip it. */ @@ -1557,11 +1557,14 @@ vm_pageout_oom(int shortage) PROC_UNLOCK(p); continue; } + _PHOLD(p); if (!vm_map_trylock_read(&vm->vm_map)) { - vmspace_free(vm); + _PRELE(p); PROC_UNLOCK(p); + vmspace_free(vm); continue; } + PROC_UNLOCK(p); size = vmspace_swap_count(vm); vm_map_unlock_read(&vm->vm_map); if (shortage == VM_OOM_MEM) @@ -1573,16 +1576,19 @@ vm_pageout_oom(int shortage) */ if (size > bigsize) { if (bigproc != NULL) - PROC_UNLOCK(bigproc); + PRELE(bigproc); bigproc = p; bigsize = size; - } else - PROC_UNLOCK(p); + } else { + PRELE(p); + } } sx_sunlock(&allproc_lock); if (bigproc != NULL) { + PROC_LOCK(bigproc); killproc(bigproc, "out of swap space"); sched_nice(bigproc, PRIO_MIN); + _PRELE(bigproc); PROC_UNLOCK(bigproc); wakeup(&vm_cnt.v_free_count); }