From owner-freebsd-current@FreeBSD.ORG  Sun May 26 19:28:19 2013
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 3AC1B9DE;
 Sun, 26 May 2013 19:28:19 +0000 (UTC)
 (envelope-from roger.pau@citrix.com)
Received: from SMTP.EU.CITRIX.COM (smtp.eu.citrix.com [46.33.159.39])
 by mx1.freebsd.org (Postfix) with ESMTP id 2B9109B1;
 Sun, 26 May 2013 19:28:17 +0000 (UTC)
X-IronPort-AV: E=Sophos;i="4.87,746,1363132800"; d="scan'208,223";a="4997408"
Received: from lonpex01cl02.citrite.net ([10.30.203.102])
 by LONPIPO01.EU.CITRIX.COM with ESMTP/TLS/AES128-SHA;
 26 May 2013 19:28:07 +0000
Received: from [192.168.1.30] (10.30.203.1) by LONPEX01CL02.citrite.net
 (10.30.203.102) with Microsoft SMTP Server id 14.2.342.3; Sun, 26 May 2013
 20:28:06 +0100
Message-ID: <51A26245.9060707@citrix.com>
Date: Sun, 26 May 2013 21:28:05 +0200
From: =?ISO-8859-1?Q?Roger_Pau_Monn=E9?= <roger.pau@citrix.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7;
 rv:17.0) Gecko/20130509 Thunderbird/17.0.6
MIME-Version: 1.0
To: John Baldwin <jhb@freebsd.org>
Subject: Re: FreeBSD-HEAD gets stuck on vnode operations
References: <5190CBEC.5000704@citrix.com> <20130514163149.GS3047@kib.kiev.ua>
 <51927143.4080102@citrix.com> <201305201434.55406.jhb@freebsd.org>
 <51A0FA43.2040503@citrix.com>
In-Reply-To: <51A0FA43.2040503@citrix.com>
Content-Type: multipart/mixed; boundary="------------090606090603060600080202"
X-Originating-IP: [10.30.203.1]
Cc: Konstantin Belousov <kostikbel@gmail.com>, freebsd-current@freebsd.org,
 "current@freebsd.org" <current@freebsd.org>
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 26 May 2013 19:28:19 -0000

--------------090606090603060600080202
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit

On 25/05/13 19:52, Roger Pau Monné wrote:
> On 20/05/13 20:34, John Baldwin wrote:
>> On Tuesday, May 14, 2013 1:15:47 pm Roger Pau Monné wrote:
>>> On 14/05/13 18:31, Konstantin Belousov wrote:
>>>> On Tue, May 14, 2013 at 06:08:45PM +0200, Roger Pau Monn? wrote:
>>>>> On 13/05/13 17:00, Konstantin Belousov wrote:
>>>>>> On Mon, May 13, 2013 at 04:33:04PM +0200, Roger Pau Monn? wrote:
>>>>>>> On 13/05/13 13:18, Roger Pau Monn? wrote:
>>>>>
>>>>> Thanks for taking a look,
>>>>>
>>>>>>> I would like to explain this a little bit more, the syncer process
>>>>>>> doesn't get blocked on the _mtx_trylock_flags_ call, it just continues
>>>>>>> looping forever in what seems to be an endless loop around
>>>>>>> mnt_vnode_next_active/ffs_sync. Also while in this state there is no
>>>>>>> noticeable disk activity, so I'm unsure of what is happening.
>>>>>> How many CPUs does your VM have ?
>>>>>
>>>>> 7 vCPUs, but I've also seen this issue with 4 and 16 vCPUs.
>>>>>
>>>>>>
>>>>>> The loop you describing means that other thread owns the vnode
>>>>>> interlock. Can you track what this thread does ? E.g. look at the
>>>>>> vp->v_interlock.mtx_lock, which is basically a pointer to the struct
>>>>>> thread owning the mutex, clear low bits as needed. Then you can
>>>>>> inspect the thread and get a backtrace.
>>>>>
>>>>> There are no other threads running, only syncer is running on CPU 1 (see
>>>>> ps in previous email). All other CPUs are idle, and as seen from the ps
>>>>> quite a lot of threads are blocked in vnode related operations, either
>>>>> "*Name Cac", "*vnode_fr" or "*vnode in". I've also attached the output
>>>>> of alllocks in the previous email.
>>>> This is not useful.  You need to look at the mutex which fails the
>>>> trylock operation in the mnt_vnode_next_active(), see who owns it,
>>>> and then 'unwind' the locking dependencies from there.
>>>
>>> Sorry, now I get it, let's see if I can find the locked vnodes and the
>>> thread that owns them...
>>
>> You can use 'show lock <address of vp->v_interlock>' to find an owning
>> thread and then use 'show sleepchain <thread>'.  If you are using kgdb on the 
>> live system (probably easier) then you can grab my scripts at 
>> www.freebsd.org/~jhb/gdb/ (do 'cd /path/to/scripts; source gdb6').  You can 
>> then find the offending thread and do 'mtx_owner &vp->v_interlock' and then
>> 'sleepchain <tid>'

Hello,

I've been looking into this issue a little bit more, and the lock
dependencies look right to me, the lockup happens when the thread owning
the v_interlock mutex tries to acquire the vnode_free_list_mtx mutex
which is already owned by the syncer thread, at this point, the thread
owning the v_interlock mutex goes to sleep, and the syncer process will
start doing a sequence of:

VI_TRYLOCK -> mtx_unlock vnode_free_list_mtx -> kern_yield -> mtx_lock
vnode_free_list_mtx ...

It seems like kern_yield, which I assume is placed there in order to
allow the thread owning v_interlock to be able to also lock
vnode_free_list_mtx, doesn't get a window big enough to wake up the
waiting thread and get the vnode_free_list_mtx mutex. Since the syncer
is the only process runnable on the CPU there is no context switch, and
the syncer process continues to run.

Relying on kern_yield to provide a window big enough that allows any
other thread waiting on vnode_free_list_mtx to run doesn't seem like a
good idea on SMP systems. I've not tested this on bare metal, but waking
up an idle CPU in a virtualized environment might be more expensive than
doing it on bare metal.

Bear in mind that I'm not familiar with either the scheduler or the ufs
code, my proposed naive fix is to replace the kern_yield call with a
pause, that will allow any other threads waiting on vnode_free_list_mtx
to lock the vnode_free_list_mtx mutex and finish whatever they are doing
and release the v_interlock mutex, so the syncer thread can also finish
it's work. I've tested the patch for a couple of hours and seems to be
fine, I haven't been able to reproduce the issue anymore.


--------------090606090603060600080202
Content-Type: text/plain; charset="UTF-8"; x-mac-type=0; x-mac-creator=0;
	name="0001-mnt_vnode_next_active-replace-kern_yield-with-pause.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename*0="0001-mnt_vnode_next_active-replace-kern_yield-with-pause.pat";
	filename*1="ch"

>From fec90f7bb9cdf05b49d11dbe4930d3c595c147f5 Mon Sep 17 00:00:00 2001
From: Roger Pau Monne <roger.pau@citrix.com>
Date: Sun, 26 May 2013 19:55:43 +0200
Subject: [PATCH] mnt_vnode_next_active: replace kern_yield with pause

On SMP systems there is no way to assure that a kern_yield will allow
any other threads waiting on the vnode_free_list_mtx to be able to
acquire it. The syncer process can get stuck in a loop trying to lock
the v_interlock mutex, without allowing other threads waiting on
vnode_free_list_mtx to run. Replace the kern_yield with a pause, that
should allow any thread owning v_interlock and waiting on
vnode_free_list_mtx to finish it's work and release v_interlock.
---
 sys/kern/vfs_subr.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
index 0da6764..597f4b7 100644
--- a/sys/kern/vfs_subr.c
+++ b/sys/kern/vfs_subr.c
@@ -4703,7 +4703,15 @@ restart:
 			if (mp_ncpus == 1 || should_yield()) {
 				TAILQ_INSERT_BEFORE(vp, *mvp, v_actfreelist);
 				mtx_unlock(&vnode_free_list_mtx);
-				kern_yield(PRI_USER);
+				/*
+				 * There is another thread owning the
+				 * v_interlock mutex and possibly waiting on
+				 * vnode_free_list_mtx, so pause in order for
+				 * it to acquire the vnode_free_list_mtx
+				 * mutex and finish the work, releasing
+				 * v_interlock when finished.
+				 */
+				pause("vi_lock", 1);
 				mtx_lock(&vnode_free_list_mtx);
 				goto restart;
 			}
-- 
1.7.7.5 (Apple Git-26)


--------------090606090603060600080202--