From owner-freebsd-current@FreeBSD.ORG  Sun Dec  6 20:55:55 2009
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C9B2E106566B;
	Sun,  6 Dec 2009 20:55:55 +0000 (UTC) (envelope-from avg@icyb.net.ua)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
	by mx1.freebsd.org (Postfix) with ESMTP id D5C1A8FC08;
	Sun,  6 Dec 2009 20:55:54 +0000 (UTC)
Received: from porto.topspin.kiev.ua (porto-e.starpoint.kiev.ua
	[212.40.38.100])
	by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id WAA18488;
	Sun, 06 Dec 2009 22:55:52 +0200 (EET) (envelope-from avg@icyb.net.ua)
Received: from localhost.topspin.kiev.ua ([127.0.0.1])
	by porto.topspin.kiev.ua with esmtp (Exim 4.34 (FreeBSD))
	id 1NHO92-00022o-EG; Sun, 06 Dec 2009 22:55:52 +0200
Message-ID: <4B1C1A28.6030909@icyb.net.ua>
Date: Sun, 06 Dec 2009 22:55:04 +0200
From: Andriy Gapon <avg@icyb.net.ua>
User-Agent: Thunderbird 2.0.0.23 (X11/20091128)
MIME-Version: 1.0
To: Attilio Rao <attilio@freebsd.org>
References: <4B1B9600.4080709@icyb.net.ua> <4B1BBEC4.7040906@icyb.net.ua>
	<3bbf2fe10912061104j53ef5be2yb1019699308b0473@mail.gmail.com>
In-Reply-To: <3bbf2fe10912061104j53ef5be2yb1019699308b0473@mail.gmail.com>
X-Enigmail-Version: 0.96.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Cc: freebsd-current@freebsd.org
Subject: Re: process stuck in stat/../cache_lookup: ktorrent, zfs
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 06 Dec 2009 20:55:55 -0000

on 06/12/2009 21:04 Attilio Rao said the following:
> 2009/12/6 Andriy Gapon <avg@icyb.net.ua>:
>> on 06/12/2009 13:31 Andriy Gapon said the following:
>>> System is recent 9-current, amd64.
>>> I see that sometimes ktorrent gets stuck during heavy download (multiple files
>>> in parallel, high speed).  It is completely unresponsive and not killable even
>>> with SIGKILL.
>> [snip]
>>> #0  sched_switch (td=0xffffff012a6c5700, newtd=0xffffff0001533380,
>>> flags=Variable "flags" is not available.
>>> ) at /usr/src/sys/kern/sched_ule.c:1865
>>> #1  0xffffffff80374baf in mi_switch (flags=260, newtd=0x0) at
>>> /usr/src/sys/kern/kern_synch.c:449
>>> #2  0xffffffff803a795b in sleepq_switch (wchan=Variable "wchan" is not available.
>>> ) at /usr/src/sys/kern/subr_sleepqueue.c:509
>>> #3  0xffffffff803a8645 in sleepq_wait (wchan=0xffffff0105b457f8, pri=80) at
>>> /usr/src/sys/kern/subr_sleepqueue.c:588
>>> #4  0xffffffff80351184 in __lockmgr_args (lk=0xffffff0105b457f8, flags=2097408,
>>> ilk=0xffffff0105b45820, wmesg=Variable "wmesg" is not available.
>>> ) at /usr/src/sys/kern/kern_lock.c:216
>> So some more data:
>> (kgdb) fr 4
>>
>> #4  0xffffffff80351184 in __lockmgr_args (lk=0xffffff0105b457f8, flags=2097408,
>> ilk=0xffffff0105b45820, wmesg=Variable "wmesg" is not available.
>> ) at /usr/src/sys/kern/kern_lock.c:216
>> 216                     sleepq_wait(&lk->lock_object, pri);
>> (kgdb) p *lk
>> $8 = {lock_object = {lo_name = 0xffffffff80ad55b6 "zfs", lo_flags = 91947008,
>> lo_data = 0, lo_witness = 0x0}, lk_lock = 3, lk_timo = 51, lk_pri = 80}
>> (kgdb) p/x flags
>> $9 = 0x200100
>> (kgdb) p/x lk->lock_object.lo_flags
>> $12 = 0x57b0000
>>
>> Apparently sleeplk is inlined into __lockmgr_args.
>>
>> So it looks like this is a LK_SHARED|LK_INTERLOCK lockmgr call which has not
>> taken any easy path and ended up in sleepq_wait, but wakeup never comes for it,
>> perhaps missed?
> 
> I think that a 'missed wakeup' is a too fast (and wrong) conclusion.
> here the problem is that the lock is held in shared mode (lk->lk_lock
> = 3) so you would need to know what happened to the owners once they
> got the lock.
> The only way you can do that, though, is with shared acquisitions,
> then you should try to reproduce it with WITNESS on.
> Once you have such datas we could digg further.

Attilio,

no conclusions on my part so far, just guesses.
But what I think that we see is that a shared lock operation made it to sleeplk,
and that must mean that the lock was originally exclusively held.  It's hard to
see how lk_lock could have ended up with both LK_SHARE|LK_SHARED_WAITERS set in
this scenario.

I will try to reproduce this with WITNESS kernel, but that will have to wait
until Tuesday or longer.  I do hope that it is reproducible with WITNESS.

-- 
Andriy Gapon