From owner-freebsd-stable@FreeBSD.ORG  Tue May 12 20:59:21 2009
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id AE0B71065677;
	Tue, 12 May 2009 20:59:21 +0000 (UTC)
	(envelope-from pluknet@gmail.com)
Received: from mail-bw0-f213.google.com (mail-bw0-f213.google.com
	[209.85.218.213])
	by mx1.freebsd.org (Postfix) with ESMTP id EC4A18FC13;
	Tue, 12 May 2009 20:59:20 +0000 (UTC)
	(envelope-from pluknet@gmail.com)
Received: by bwz9 with SMTP id 9so231322bwz.43
	for <multiple recipients>; Tue, 12 May 2009 13:59:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:received:in-reply-to:references
	:date:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	bh=bCuZ4DDsgm3JNf7gzyYUackLDJRRYK4qr/d+JCVSnOU=;
	b=rsQnBhD8PStqUHaiSvCgRTfrpa7nmmosbZC6dKjBXR4cfmQgiDR3VAfzvUkR6lneHI
	6YwLDMyZ6JNP24g/XeUhRgFbYeIFuCLhWjUJAztRp8wi9LnS70jRDO7QZtHoWzXQqpw0
	TbTHcu7rYB02+vqZjiVNUu9UtAYWSvA7uCUro=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type:content-transfer-encoding;
	b=JDcIlr9a/b6Rrg6IMcoQzMz1zO9rtjLIzdRIVbL2bspb5LsZOScXS283OPTf1CXYKt
	SChw3F2LwTbmpE2D7QiyE44pd9pjEF6o/E76/hLuBeqx2I/VtNYc2HAD7te5zXhkGR52
	q0h1Y0PFB2Kba23Pwd8SMT7DFIKQX7a/JkbrU=
MIME-Version: 1.0
Received: by 10.103.219.17 with SMTP id w17mr71985muq.122.1242161959838; Tue, 
	12 May 2009 13:59:19 -0700 (PDT)
In-Reply-To: <200905121014.55450.jhb@freebsd.org>
References: <a31046fc0904292336w17aca317hefd32dad5bc28007@mail.gmail.com>
	<200905110949.31142.jhb@freebsd.org>
	<a31046fc0905112312y2496e5cex334ddcaf57889909@mail.gmail.com>
	<200905121014.55450.jhb@freebsd.org>
Date: Wed, 13 May 2009 00:59:19 +0400
Message-ID: <a31046fc0905121359q156d6355ja6a86bf3db193ba2@mail.gmail.com>
From: pluknet <pluknet@gmail.com>
To: John Baldwin <jhb@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: freebsd-stable@freebsd.org
Subject: Re: lock up in 6.2 (procs massively stuck in Giant)
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 12 May 2009 20:59:22 -0000

2009/5/12 John Baldwin <jhb@freebsd.org>:
> On Tuesday 12 May 2009 2:12:27 am pluknet wrote:
>> 2009/5/11 John Baldwin <jhb@freebsd.org>:
>> > On Monday 04 May 2009 11:41:35 pm pluknet wrote:
>> >> 2009/5/1 John Baldwin <jhb@freebsd.org>:
>> >> > On Thursday 30 April 2009 2:36:34 am pluknet wrote:
>> >> >> Hi folks.
>> >> >>
>> >> >> Today I got a new locking issue.
>> >> >> This is the first time I got it, and it's merely reproduced.
>> >> >>
>> >> >> The box has lost both remote connection and local access.
>> >> >> No SIGINFO output on the local console even.
>> >> >> Jumping in ddb> shows the next:
>> >> >>
>> >> >> 1) first, this is a 8-way web server. No processes on runqueue except
> one
>> >> > httpd
>> >> >> (i.e. ps shows R in its state):
>> >> >
>> >> > You need to find who owns Giant and what that thread is doing.  You can
>> > try
>> >> > using 'show lock Giant' as well as 'show lockchain 11568'.
>> >> >
>> >>
>> >> Hi, John!
>> >>
>> >> Just reproduced now on another box.
>> >> Hmm.. stack of the process owing Giant looks garbled.
>> >>
>> >> db> show lock Giant
>> >>  class: sleep mutex
>> >>  name: Giant
>> >>  flags: {DEF, RECURSE}
>> >>  state: {OWNED, CONTESTED}
>> >>  owner: 0xd0d79320 (tid 102754, pid 34594, "httpd")
>> >>
>> >> db> show lockchain 34594
>> >> thread 102754 (pid 34594, httpd) running on CPU 7
>> >> db> show lockchain 102754
>> >> thread 102754 (pid 34594, httpd) running on CPU 7
>> >
>> > The thread is running, so we don't know what it's top of stack is and you
>> > can't a good stack trace in that case.
>> >
>> > None of your CPUs are idle, so I don't think you have any sort of
> deadlock.
>> > You might have a livelock.
>> >
>> > --
>> > John Baldwin
>> >
>>
>> I'm curious if it could be caused by heavy load.
>> I don't know what it might be definitely,
>> as it's non-trivial for me to determine the reason
>> of a livelock, and to debug it.
>>
>> So I think it may have sense to try 7.x, as there
>> has been done much locking work.
>
> It may be worth trying 7.  Also, what is the state of the 'swi7: clock'
> process?
>
> --
> John Baldwin
>

Hi.

>From just another box (not from the first two mentioned earlier)
with a similar locking issue. If it would make sense, since there are
possibly a bit different conditions.
clock proc here is on swi4, I hope it's a non-important difference.

   18     0     0     0  LL     *Giant    0xd0a6b140 [swi4: clock sio]
db> bt 18
Tracing pid 18 tid 100015 td 0xc7cfec80
sched_switch(c7cfec80,0,1) at sched_switch+0x143
mi_switch(1,0) at mi_switch+0x1ba
turnstile_wait(c0a06c60,cb77ee10) at turnstile_wait+0x2f7
_mtx_lock_sleep(c0a06c60,c7cfec80,0,0,0) at _mtx_lock_sleep+0xfc
softclock(0) at softclock+0x231
ithread_execute_handlers(c7d07218,c7d4a100) at ithread_execute_handlers+0x125
ithread_loop(c7cb69f0,e6892d38) at ithread_loop+0x55
fork_exit(c066d3e4,c7cb69f0,e6892d38) at fork_exit+0x71
fork_trampoline() at fork_trampoline+0x8
--- trap 0x1, eip = 0, esp = 0xe6892d6c, ebp = 0 ---

db> show lock Giant
 class: sleep mutex
 name: Giant
 flags: {DEF, RECURSE}
 state: {OWNED, CONTESTED}
 owner: 0xcb77ee10 (tid 101174, pid 8611, "httpd")
db> show lockchain 101174
thread 101174 (pid 8611, httpd) running on CPU 4
db> bt 101174
Tracing pid 8611 tid 101174 td 0xcb77ee10
sched_switch(cb77ee10,c7f3de10,6) at sched_switch+0x143
mi_switch(ca6d82e8,6,c0a0baf0,ca6d82e8,c0a0a0b0,...) at mi_switch
kseq_move(c0a0baf0,6) at kseq_move+0xc1
sched_balance_pair(ef879bb0,ef879bb0,c08a2adf,cb77ef68,cb77b360,.
         lance_pair+0x91
sched_lock(0,cbd1f658,0,cb77b36c,0,...) at sched_lock
_end(cb77b360,cb77b364,cb77ee10,cb77ee18,0,...) at 0xcb77b360
_end(d0a49a80,d0a49a84,c84cf7d0,c84cf7d8,0,...) at 0xc7f97648
_end(ca6dbcc0,ca6dbcc4,ca6d54b0,ca6d54b8,0,...) at 0xcbd1f648
_end(cbcad780,cbcad784,cc8a2190,cc8a2198,0,...) at 0xc8514430
_end(cab883c0,cab883c4,ca9417d0,ca9417d8,0,...) at 0xca6dc000
_end(cc67c4e0,cc67c4e4,cd6fd000,cd6fd008,0,...) at 0xcc8abc90
_end(cd3a9120,cd3a9124,cd3b1320,cd3b1328,0,...) at 0xcad68218
_end(cd130c60,cd130c64,d00ca320,d00ca328,0,...) at 0xca71e860
_end(cbcac240,cbcac244,cbf6e4b0,cbf6e4b8,0,...) at 0xcd472a78
_end(cb73c960,cb73c964,cb4f44b0,cb4f44b8,0,...) at 0xd00cfa78
_end(ca348b40,ca348b44,ca420af0,ca420af8,0,...) at 0xcc0e9c90
_end(d0310ea0,d0310ea4,cd3ad4b0,cd3ad4b8,0,...) at 0xcc7ec218
_end(ca5ddd20,ca5ddd24,ca6d8c80,ca6d8c88,0,...) at 0xca426c90
_end(c998aa20,c998aa24,ca2bb320,ca2bb328,0,...) at 0xd030fc90
[...] oh, i saw that earlier somewhere.. don't remember where.

db> c
and waiting some moments shows a little different picture:
db> bt 101174
Tracing pid 8611 tid 101174 td 0xcb77ee10
sched_switch(cb77ee10,c7f3de10,6) at sched_switch+0x143
mi_switch(cf177608,7,c0a0b460,cf177608,c0a0a0b0,...) at mi_switch+0x1ba
kseq_move(c0a0b460,7) at kseq_move+0xc1
sched_balance_pair(cb77ef68,ef879bb8,c0694edf,cb77ef68,cb77b360,...)
at sched_balance_pair+0x91
_end(cbd1f650,cb77ee10,cb77ee20,0,cb77b374,...) at 0xcb77b360
MAXCPU(cb77b360,cb77b364,cb77ee10,cb77ee18,0,...) at 0
_end(d0a49a80,d0a49a84,c84cf7d0,c84cf7d8,0,...) at 0xc7f97648
_end(ca6dbcc0,ca6dbcc4,ca6d54b0,ca6d54b8,0,...) at 0xcbd1f648
_end(cbcad780,cbcad784,cc8a2190,cc8a2198,0,...) at 0xc8514430
_end(cab883c0,cab883c4,ca9417d0,ca9417d8,0,...) at 0xca6dc000
_end(cc67c4e0,cc67c4e4,cd6fd000,cd6fd008,0,...) at 0xcc8abc90
_end(cd3a9120,cd3a9124,cd3b1320,cd3b1328,0,...) at 0xcad68218
_end(cd130c60,cd130c64,d00ca320,d00ca328,0,...) at 0xca71e860
_end(cbcac240,cbcac244,cbf6e4b0,cbf6e4b8,0,...) at 0xcd472a78
[...]

-- 
wbr,
pluknet