Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 27 May 2013 12:35:16 +0200
From:      =?ISO-8859-1?Q?Roger_Pau_Monn=E9?= <roger.pau@citrix.com>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        freebsd-current@freebsd.org, "current@freebsd.org" <current@freebsd.org>, Jilles Tjoelker <jilles@stack.nl>
Subject:   Re: FreeBSD-HEAD gets stuck on vnode operations
Message-ID:  <51A336E4.9050200@citrix.com>
In-Reply-To: <20130527102306.GB3047@kib.kiev.ua>
References:  <20130514163149.GS3047@kib.kiev.ua> <51927143.4080102@citrix.com> <201305201434.55406.jhb@freebsd.org> <51A0FA43.2040503@citrix.com> <51A26245.9060707@citrix.com> <20130526202058.GA40375@stack.nl> <51A275F7.9030401@citrix.com> <20130526222254.GB40375@stack.nl> <20130527060727.GW3047@kib.kiev.ua> <51A31727.9040909@citrix.com> <20130527102306.GB3047@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On 27/05/13 12:23, Konstantin Belousov wrote:
> On Mon, May 27, 2013 at 10:19:51AM +0200, Roger Pau Monn? wrote:
>> On 27/05/13 08:07, Konstantin Belousov wrote:
>>> On Mon, May 27, 2013 at 12:22:54AM +0200, Jilles Tjoelker wrote:
>>>> On Sun, May 26, 2013 at 10:52:07PM +0200, Roger Pau Monn? wrote:
>>>>> On 26/05/13 22:20, Jilles Tjoelker wrote:
>>>>>> Instead of a pause() that may be too short or too long, how about
>>>>>> waiting for the necessary lock? In other words, replace the kern_yield()
>>>>>> call with VI_LOCK(vp); VI_UNLOCK(vp);. This is also the usual approach
>>>>>> to acquire two locks without imposing an order between them.
>>>>
>>>>> Since there might be more than one locked vnode, waiting on a specific
>>>>> locked vnode seemed rather arbitrary, but I agree that the pause is also
>>>>> rather arbitrary.
>>>>
>>>>> Also, can we be sure that the v_interlock mutex will not be destroyed
>>>>> while the syncer process is waiting for it to be unlocked?
>>>>
>>>> I think this is a major problem. My idea was too easy and will not work.
>>>>
>>>> That said, the code in mnt_vnode_next_active() appears to implement some
>>>> sort of adaptive spinning for SMP. It tries VI_TRYLOCK for 200ms
>>>> (default value of hogticks) and then yields. This is far too long for a
>>>> mutex lock and if it takes that long it means that either the thread
>>>> owning the lock is blocked by us somehow or someone is abusing a mutex
>>>> to work like a sleepable lock such as by spinning or DELAY.
>>>>
>>>> Given that it has been spinning for 200ms, it is not so bad to pause for
>>>> one additional microsecond.
>>>>
>>>> The adaptive spinning was added fairly recently, so apparently it
>>>> happens fairly frequently that VI_TRYLOCK fails transiently.
>>>> Unfortunately, the real adaptive spinning code cannot be used because it
>>>> will spin forever as long as the thread owning v_interlock is running,
>>>> including when that is because it is spinning for vnode_free_list_mtx.
>>>> Perhaps we can try to VI_TRYLOCK a certain number of times. It is also
>>>> possible to check the contested bit of vnode_free_list_mtx
>>>> (sys/netgraph/netflow/netflow.c does something similar) and stop
>>>> spinning in that case.
>>>>
>>>> A cpu_spinwait() invocation should also be added to the spin loop.
>>>
>>> There are two 'proper' solutions for this issue:
>>>
>>> One is to change the handling of the vnode lifecycle to allow the
>>> safe block for the vnode interlock acquisition. In particular, the
>>> change would add some stability of the vnode memory when vnode is
>>> put on the free list. As example, the vnode zone could be marked as
>>> type-stable again, and then the vnode interlock can be obtained with
>>> dropped free list lock. Arguably, marking the zone as non-freeable would
>>> be a regression, esp. for the zone which accounts for largest allocation
>>> on the kernel memory.
>>>
>>> Another one is to somehow ensure that the priority is properly
>>> propagated from the spinning thread to the vnode interlock owner.
>>> I think that it is desirable to donate some amount of priority
>>> from the spinning thread.  Unfortunately, I was unable to come
>>> up with elegant solution for this which would be also contained
>>> and did not require rewamp of the mutex interfaces.
>>>
>>> BTW, if anybody come up with the idea of the restructuring the free list
>>> handling to avoid the free list/vnode interlock LOR altogether, it would
>>> be the best.
>>>
>>> I do not have objections against the pause() addition, but I would
>>> argue that should_yield() should be removed then, switching the code to
>>> unconditionally pause when the collision detected.
>>
>> Taking the idea from Jilles, what about replacing should_yield with a
>> check to see if the vnode_free_list_mtx mutex is contented?
>>
>> That would prevent us from doing unnecessary pauses, and only releasing
>> the vnode_free_list_mtx mutex when there's someone else that actually
>> needs it.
> This would still be racy, and possibly allows the lock convoy in the same
> manner as the current code.  Also, AFAIR, the real problem was when
> two iterators start synchronized, it usually ended in live-lock.
> 
> If you are willing, try this, of course, but I tend to agree with just
> an addition of pause() for now.

OK, I've tested replacing kern_yield with a pause the whole night and
that seems to be working as expected, I did no longer see any lockups,
usually during a whole night run I saw at least 3 or 4 lockups.

If you are happy (and others) with the replacement you can commit it and
we can replace the should_yield call later if needed.

Thanks, Roger.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51A336E4.9050200>