From owner-freebsd-stable@FreeBSD.ORG  Wed Jul  8 07:18:57 2009
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3396D106566C;
	Wed,  8 Jul 2009 07:18:57 +0000 (UTC)
	(envelope-from dan.naumov@gmail.com)
Received: from an-out-0708.google.com (an-out-0708.google.com [209.85.132.245])
	by mx1.freebsd.org (Postfix) with ESMTP id CF9BB8FC08;
	Wed,  8 Jul 2009 07:18:56 +0000 (UTC)
	(envelope-from dan.naumov@gmail.com)
Received: by an-out-0708.google.com with SMTP id d14so2461176and.13
	for <multiple recipients>; Wed, 08 Jul 2009 00:18:56 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:received:in-reply-to:references
	:date:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	bh=3BrcMTc4dn8GTLTuMt2Mvu4NPEwao1ULV2v7hwKru6s=;
	b=ntCyndFgZ2z3UOk6/gqe1As0q9N+WDXgy6zXw0xkuju5oEwwgXnm91V1xM6kqgmLtJ
	MFXGsyG5jAODKb8jtpq/SeBcWmAA9VPSIqIpxlzjhuOSQAoeoWKCIdl3VZIsh7H20wyY
	cy7yid9t2jx3pxaGHjxvOSioFqX2mQJNorW5k=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type:content-transfer-encoding;
	b=NXmFcIQxkYcrifylFLQD9GkVRMh4+GZMcNMKTB8Ew0oxSGAdnH70GgVfJli77Hs3Qy
	6cqsryAyOPdaMAAdNLz+/jC0nEFM3UBc/U26JNX6wYkFTVR7i7w+3kiUk8se1hdknQaR
	xBTn85xt83K4tgQqr9FZT2XPCJPeokJuqKuAE=
MIME-Version: 1.0
Received: by 10.101.66.17 with SMTP id t17mr12074997ank.41.1247037536211; Wed, 
	08 Jul 2009 00:18:56 -0700 (PDT)
In-Reply-To: <cf9b1ee00907071757i169d2a82la260798f364054f9@mail.gmail.com>
References: <cf9b1ee00907061812r3da70018i1c8d8d12bb038a80@mail.gmail.com>
	<3bbf2fe10907061818v245abd0cgc3ca5073cb93aea4@mail.gmail.com>
	<cf9b1ee00907061825r34165c48x6727c50b3219d5fb@mail.gmail.com>
	<3bbf2fe10907061827g35eaeb49g26cf6fdb64436ca7@mail.gmail.com>
	<cf9b1ee00907071757i169d2a82la260798f364054f9@mail.gmail.com>
Date: Wed, 8 Jul 2009 10:18:56 +0300
Message-ID: <cf9b1ee00907080018s3f32c8afr4f65f01ce9ff1f25@mail.gmail.com>
From: Dan Naumov <dan.naumov@gmail.com>
To: Attilio Rao <attilio@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: FreeBSD-STABLE Mailing List <freebsd-stable@freebsd.org>
Subject: Re: 7.2-release/amd64: panic, spin lock held too long
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 08 Jul 2009 07:18:57 -0000

On Wed, Jul 8, 2009 at 3:57 AM, Dan Naumov<dan.naumov@gmail.com> wrote:
> On Tue, Jul 7, 2009 at 4:27 AM, Attilio Rao<attilio@freebsd.org> wrote:
>> 2009/7/7 Dan Naumov <dan.naumov@gmail.com>:
>>> On Tue, Jul 7, 2009 at 4:18 AM, Attilio Rao<attilio@freebsd.org> wrote:
>>>> 2009/7/7 Dan Naumov <dan.naumov@gmail.com>:
>>>>> I just got a panic following by a reboot a few seconds after running
>>>>> "portsnap update", /var/log/messages shows the following:
>>>>>
>>>>> Jul =A07 03:49:38 atom syslogd: kernel boot file is /boot/kernel/kern=
el
>>>>> Jul =A07 03:49:38 atom kernel: spin lock 0xffffffff80b3edc0 (sched lo=
ck
>>>>> 1) held by 0xffffff00017d8370 (tid 100054) too long
>>>>> Jul =A07 03:49:38 atom kernel: panic: spin lock held too long
>>>>
>>>> That's a known bug, affecting -CURRENT as well.
>>>> The cpustop IPI is handled though an NMI, which means it could
>>>> interrupt a CPU in any moment, even while holding a spinlock,
>>>> violating one well known FreeBSD rule.
>>>> That means that the cpu can stop itself while the thread was holding
>>>> the sched lock spinlock and not releasing it (there is no way, modulo
>>>> highly hackish, to fix that).
>>>> In the while hardclock() wants to schedule something else to run and
>>>> got stuck on the thread lock.
>>>>
>>>> Ideal fix would involve not using a NMI for serving the cpustop while
>>>> having a cheap way (not making the common path too hard) to tell
>>>> hardclock() to avoid scheduling while cpustop is in flight.
>>>>
>>>> Thanks,
>>>> Attilio
>>>
>>> Any idea if a fix is being worked on and how unlucky must one be to
>>> run into this issue, should I expect it to happen again? Is it
>>> basically completely random?
>>
>> I'd like to work on that issue before BETA3 (and backport to
>> STABLE_7), I'm just time-constrained right now.
>> it is completely random.
>>
>> Thanks,
>> Attilio
>
> Ok, this is getting pretty bad, 23 hours later, I get the same kind of
> panic, the only difference is that instead of "portsnap update", this
> was triggered by "portsnap cron" which I have running between 3 and 4
> am every day:
>
> Jul =A08 03:03:49 atom kernel: ssppiinn =A0lloocckk
> 00xxffffffffffffffff8800bb33eeddc400 =A0((sscchheedd =A0lloocck k1 )0 )h
> ehledl db yb y 0x0xfffffffffff0f00001081735339760e 0( t(itdi d
> 10100006070)5 )t otoo ol olnogng
> Jul =A08 03:03:49 atom kernel: p
> Jul =A08 03:03:49 atom kernel: anic: spin lock held too long
> Jul =A08 03:03:49 atom kernel: cpuid =3D 0
> Jul =A08 03:03:49 atom kernel: Uptime: 23h2m38s

I have now tried repeating the problem by running "stress --cpu 8 --io
8 --vm 4 --vm-bytes 1024M --timeout 600s --verbose" which pushed
system load into the 15.50 ballpark and simultaneously running
"portsnap fetch" and "portsnap update" but I couldn't manually trigger
the panic, it seems that this problem is indeed random (although it
baffles me why is it specifically portsnap triggering it). I have now
disabled powerd to check whether that makes any difference to system
stability.

The only other things running on the system are: sshd, ntpd, smartd,
smbd/nmdb and a few instances of irssi in screens.

- Sincerely,
Dan Naumov