From owner-freebsd-fs@freebsd.org Thu May 26 15:58:51 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3CA0CB4BD65 for ; Thu, 26 May 2016 15:58:51 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DA481177D for ; Thu, 26 May 2016 15:58:50 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u4QFwiTB098675 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Thu, 26 May 2016 18:58:44 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u4QFwiTB098675 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id u4QFwiR0098674; Thu, 26 May 2016 18:58:44 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 26 May 2016 18:58:44 +0300 From: Konstantin Belousov To: lists@yamagi.org Cc: freebsd-fs@freebsd.org Subject: Re: LOR between allproc <-> ufs Message-ID: <20160526155844.GH38613@kib.kiev.ua> References: <20160526160902.bbe4c36ad340f11f69f7ba08@yamagi.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160526160902.bbe4c36ad340f11f69f7ba08@yamagi.org> User-Agent: Mutt/1.6.1 (2016-04-27) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 May 2016 15:58:51 -0000 On Thu, May 26, 2016 at 04:13:43PM +0200, lists@yamagi.org wrote: > Hello, > having experienced several hard 'freezes' (system is still running, but > all file system activity just hangs) on a FreeBSD 10.3-RELEASE machine, > I've build a kernel with full debugging. Besides the well knows bufwait > <-> dirhash LOR I've seen this one between allproc <-> ufs. Sadly the > output is somewhat garbled, messages from LSIs upstream mps(4) driver > interfered : > > May 26 15:24:26 grobi kernel: 1st 0xffffffff81885b28 allproc (allproc) @ > /usr/src/sys/vm/vm_meter.c:122 > May 26 15:24:26 grobi kernel: 2nd 0xfffff8022c60b5f0 ufs (ufs) @ > /usr/src/sys/kern/vfs_subr.c:2364 > May 26 15:24:26 grobi kernel: KDB: stack backtrace: > May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0 > May 26 15:24:26 grobi kernel: db_trace_self_wrapperxpt_release_simq: > requested 1 > present 0 > May 26 15:24:26 grobi kernel: () atxpt_release_simq: requested 1 > > present 0 > May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0 > May 26 15:24:26 grobi kernel: db_trace_self_wxpt_release_simq: requested > 1 > present 0 > May 26 15:24:26 grobi kernel: rxpt_release_simq: requested 1 > present 0 > May 26 15:24:26 grobi kernel: apper+xpt_release_simq: requested 1 > > present 0 > May 26 15:24:26 grobi kernel: 0x2b/fraxpt_release_simq: requested 1 > > present 0 > May 26 15:24:26 grobi kernel: me 0xfffffe1xpt_release_simq: requested 1 > > present 0 > May 26 15:24:26 grobi kernel: 04432b2d0 > May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0 > May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0 > May 26 15:24:26 grobi kernel: kxpt_release_simq: requested 1 > present 0 > May 26 15:24:26 grobi kernel: dxpt_release_simq: requested 1 > present 0 > May 26 15:24:26 grobi kernel: bxpt_release_simq: requested 1 > present 0 > May 26 15:24:26 grobi kernel: _backtrxpt_release_simq: requested 1 > > present 0 > May 26 15:24:26 grobi kernel: ace() atxpt_release_simq: requested 1 > > present 0 > May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0 > May 26 15:24:26 grobi kernel: kdb_backxpt_release_simq: requested 1 > > present 0 > May 26 15:24:26 grobi kernel: trace+0x3xpt_release_simq: requested 1 > > present 0 > May 26 15:24:26 grobi kernel: 9/frame 0xpt_release_simq: requested 1 > > present 0 > May 26 15:24:26 grobi kernel: xfffffe10xpt_release_simq: requested 1 > > present 0 > May 26 15:24:26 grobi kernel: 4432b380 > May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0 > May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0 > May 26 15:24:26 grobi kernel: wxpt_release_simq: requested 1 > present 0 > May 26 15:24:26 grobi kernel: itnxpt_release_simq: requested 1 > present > 0 > May 26 15:24:26 grobi kernel: ess_chexpt_release_simq: requested 1 > > present 0 > May 26 15:24:26 grobi kernel: ckordexpt_release_simq: requested 1 > > present 0 > May 26 15:24:26 grobi kernel: r(xpt_release_simq: requested 1 > present > 0 > May 26 15:24:26 grobi kernel: ) at witness_checkorder+0xe24/frame > 0xfffffe104432b410 > May 26 15:24:26 grobi kernel: __lockmgr_args() at > __lockmgr_args+0x9ea/frame 0xfffffe104432b550 > May 26 15:24:26 grobi kernel: ffs_lock() at ffs_lock+0x84/frame > 0xfffffe104432b5a0 > May 26 15:24:26 grobi kernel: VOP_LOCK1_APV() at > VOP_LOCK1_APV+0xfc/frame 0xfffffe104432b5d0 > May 26 15:24:26 grobi kernel: _vn_lock() at _vn_lock+0xaa/frame > 0xfffffe104432b640 > May 26 15:24:26 grobi kernel: vputx() at vputx+0x232/frame > 0xfffffe104432b6a0 > May 26 15:24:26 grobi kernel: vm_object_deallocate() at > vm_object_deallocate+0x5fb/frame 0xfffffe104432b700 > May 26 15:24:26 grobi kernel: vm_map_process_deferred() at > vm_map_process_deferred+0x89/frame 0xfffffe104432b730 > May 26 15:24:26 grobi kernel: vm_map_remove() at > vm_map_remove+0xc8/frame 0xfffffe104432b760 > May 26 15:24:26 grobi kernel: vmspace_free() at vmspace_free+0x51/frame > 0xfffffe104432b780 > May 26 15:24:26 grobi kernel: vmtotal() at vmtotal+0x2a5/frame > 0xfffffe104432b800 > May 26 15:24:26 grobi kernel: sysctl_root() at sysctl_root+0x1d2/frame > 0xfffffe104432b850 > May 26 15:24:26 grobi kernel: userland_sysctl() at > userland_sysctl+0x192/frame 0xfffffe104432b8f0 > May 26 15:24:26 grobi kernel: sys___sysctl() at sys___sysctl+0x74/frame > 0xfffffe104432b9a0 > May 26 15:24:26 grobi kernel: amd64_syscall() at > amd64_syscall+0x2c4/frame 0xfffffe104432bab0 > May 26 15:24:26 grobi kernel: Xfast_syscall() at > Xfast_syscall+0xfb/frame 0xfffffe104432bab0 > May 26 15:24:26 grobi kernel: --- syscall (202, FreeBSD ELF64, > sys___sysctl), rip = 0x802ef665a, rsp = 0x7fffffffe758, rbp = > 0x7fffffffe790 --- > > Can anybody tell me if this is a know false positive? So far I > was unable to reproduce it. If I manage to do so I'll send a > better stacktrace. I do not think this is a false positive, the order there is indeed allproc_lock->vnode lock. Reversed order can occur in, say, call to mountcheckdirs() from unmount, where we own mp root vnode lock and then iterate over all processes in the system. I do not see an easy (or any) way to avoid the lock order on umount. IMO we should avoid freeing reference on vmspace while owning the allproc_lock lock, and not unlock the user map due to deferred man entries free. Completely untested patch is below. I do not think that this LOR can be reproduced at will. E.g. you need to record the order in witness by unmounting some filesystem mounted on a directory on UFS mount. Then, the chances must be that it was the last reference on the vmspace, or that there were deferred vnode entry. diff --git a/sys/vm/vm_glue.c b/sys/vm/vm_glue.c index 83dc01a..c815c33 100644 --- a/sys/vm/vm_glue.c +++ b/sys/vm/vm_glue.c @@ -863,22 +863,32 @@ retry: struct vmspace *vm; int minslptime = 100000; int slptime; - + + PROC_LOCK(p); /* * Watch out for a process in * creation. It may have no * address space or lock yet. */ - if (p->p_state == PRS_NEW) + if (p->p_state == PRS_NEW) { + PROC_UNLOCK(p); continue; + } /* * An aio daemon switches its * address space while running. * Perform a quick check whether * a process has P_SYSTEM. + * Filter out exiting processes. */ - if ((p->p_flag & P_SYSTEM) != 0) + if ((p->p_flag & (P_SYSTEM | P_WEXIT)) != 0) { + PROC_UNLOCK(p); continue; + } + p->p_lock++; + PROC_UNLOCK(p); + sx_sunlock(&allproc_lock); + /* * Do not swapout a process that * is waiting for VM data @@ -893,16 +903,15 @@ retry: */ vm = vmspace_acquire_ref(p); if (vm == NULL) - continue; + goto nextproc2; if (!vm_map_trylock(&vm->vm_map)) goto nextproc1; PROC_LOCK(p); - if (p->p_lock != 0 || - (p->p_flag & (P_STOPPED_SINGLE|P_TRACED|P_SYSTEM|P_WEXIT) - ) != 0) { + if (p->p_lock != 1 || (p->p_flag & (P_STOPPED_SINGLE | + P_TRACED | P_SYSTEM)) != 0) goto nextproc; - } + /* * only aiod changes vmspace, however it will be * skipped because of the if statement above checking @@ -977,12 +986,13 @@ retry: if ((action & VM_SWAP_NORMAL) || ((action & VM_SWAP_IDLE) && (minslptime > swap_idle_threshold2))) { + _PRELE(p); if (swapout(p) == 0) didswap++; PROC_UNLOCK(p); + sx_sunlock(&allproc_lock); vm_map_unlock(&vm->vm_map); vmspace_free(vm); - sx_sunlock(&allproc_lock); goto retry; } } @@ -991,7 +1001,9 @@ nextproc: vm_map_unlock(&vm->vm_map); nextproc1: vmspace_free(vm); - continue; +nextproc2: + sx_slock(&allproc_lock); + PRELE(p); } sx_sunlock(&allproc_lock); /* diff --git a/sys/vm/vm_meter.c b/sys/vm/vm_meter.c index f1573a3..f1d8c6f 100644 --- a/sys/vm/vm_meter.c +++ b/sys/vm/vm_meter.c @@ -153,14 +153,20 @@ vmtotal(SYSCTL_HANDLER_ARGS) thread_unlock(td); } } + if ((p->p_flag & P_WEXIT) != 0) + continue; + p->p_lock++; /* avoid _PHOLD, no need for faultin() */ PROC_UNLOCK(p); /* * Note active objects. */ paging = 0; vm = vmspace_acquire_ref(p); - if (vm == NULL) + if (vm == NULL) { + PRELE(p); continue; + } + sx_sunlock(&allproc_lock); map = &vm->vm_map; vm_map_lock_read(map); for (entry = map->header.next; @@ -177,6 +183,8 @@ vmtotal(SYSCTL_HANDLER_ARGS) vmspace_free(vm); if (paging) total.t_pw++; + sx_slock(&allproc_lock); + PRELE(p); } sx_sunlock(&allproc_lock); /* diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c index ffdcb61..29d4155 100644 --- a/sys/vm/vm_pageout.c +++ b/sys/vm/vm_pageout.c @@ -1484,19 +1484,21 @@ vm_pageout_oom(int shortage) PROC_UNLOCK(p); continue; } - _PHOLD(p); + p->p_lock++; + PROC_UNLOCK(p); + sx_sunlock(&allproc_lock); if (!vm_map_trylock_read(&vm->vm_map)) { - _PRELE(p); - PROC_UNLOCK(p); vmspace_free(vm); + sx_slock(&allproc_lock); + PRELE(p); continue; } - PROC_UNLOCK(p); size = vmspace_swap_count(vm); if (shortage == VM_OOM_MEM) size += vm_pageout_oom_pagecount(vm); vm_map_unlock_read(&vm->vm_map); vmspace_free(vm); + sx_slock(&allproc_lock); /* * If this process is bigger than the biggest one, @@ -1794,9 +1796,13 @@ again: if ((p->p_flag & P_INMEM) == 0) limit = 0; /* XXX */ vm = vmspace_acquire_ref(p); + p->p_lock++; PROC_UNLOCK(p); - if (vm == NULL) + if (vm == NULL) { + PRELE(p); continue; + } + sx_sunlock(&allproc_lock); size = vmspace_resident_count(vm); if (size >= limit) { @@ -1841,6 +1847,8 @@ again: } #endif vmspace_free(vm); + sx_slock(&allproc_lock); + PRELE(p); } sx_sunlock(&allproc_lock); if (tryagain != 0 && attempts <= 10)