From owner-freebsd-hackers@FreeBSD.ORG  Wed Oct 17 18:46:11 2012
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id 8ED34BD;
 Wed, 17 Oct 2012 18:46:11 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net
 [IPv6:2001:470:1f10:75::2])
 by mx1.freebsd.org (Postfix) with ESMTP id 60E9F8FC12;
 Wed, 17 Oct 2012 18:46:11 +0000 (UTC)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id BB801B94F;
 Wed, 17 Oct 2012 14:46:10 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Andriy Gapon <avg@freebsd.org>
Subject: Re: _mtx_lock_spin: obsolete historic handling of kdb_active and
 panicstr?
Date: Wed, 17 Oct 2012 10:12:05 -0400
User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p20; KDE/4.5.5; amd64; ; )
References: <507E9498.10905@FreeBSD.org>
In-Reply-To: <507E9498.10905@FreeBSD.org>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201210171012.05392.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Wed, 17 Oct 2012 14:46:10 -0400 (EDT)
Cc: freebsd-hackers <freebsd-hackers@freebsd.org>,
 Bruce Evans <brde@optusnet.com.au>
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 17 Oct 2012 18:46:11 -0000

On Wednesday, October 17, 2012 7:20:56 am Andriy Gapon wrote:
> 
> _mtx_lock_spin has the following check in its retry loop:
> if (i < 60000000 || kdb_active || panicstr != NULL)
>         DELAY(1);
> else
>         _mtx_lock_spin_failed(m);
> 
> Which means that in the (!kdb_active && panicstr == NULL) case we will make at
> most 60000000 iterations and then call _mtx_lock_spin_failed (which proceeds to
> panic).  When either kdb_active or panicstr is set, then we are going to loop
> forever.
> 
> I've done some digging through the lengthy history and many evolutions of the
> code (unfortunately I haven't kept records during the research), and my
> conclusion is that the kdb_active and panicstr checks were added at the quite
> early era of FreeBSD SMP, where we didn't have a mechanism to stop/block other
> CPUs when kdb or panic were entered.  We didn't even prevent parallel execution
> of panic.
> So the above code was a sort of defense where we hoped that "other" CPUs would
> eventually stumble upon some held spinlock and would be captured there.  Maybe
> there was a specific set of spinlocks, which were supposed to help.

It wasn't so much as a way of hoping CPUs would stop so much as a way to prevent
other CPUs from panic'ing while another CPU had already panic'd or was already
in DDB making debugging harder.

> Nowadays, we do try to stop other CPUs during panic and kdb activation and there
> are good chances that they are indeed stopped.  In this circumstances, should
> the main CPU be so unlucky as to run into the held spinlock, the above check
> would do more harm than good - the main CPU would just spin there forever,
> because a lock owner is also spinning in the stop loop and so can't release the
> lock.
> Actually, this is only true for the kdb case.  In the panic case we make a check
> earlier and simply ignore/skip/bust all the locks.  That makes the panicstr
> check in the code in question both harmless and useless.
> 
> So I'd like to propose to remove those checks altogether.  Or perhaps to
> "reverse" them and immediately induce a (possibly secondary) panic if we ever
> get to that wait loop and kdb_active || panicstr != NULL.
> 
> What do you think?

I think this sounds fine.  I do think though that there are two behaviors.  If
for some reason you are not able to stop the other CPUs, you would rather them
spin than trigger another panic while you are in DDB or writing out a crashdump.
However, the CPU that is currently in the debugger or writing out a crashdump
should probably bust all locks (code executed in debugger backends should
generally avoid all locking at all, and depend on things like try locks where it
gracefully fails if it must use locking.  That would make the kdb_active case
here irrelevant, and the panic case is already handled as you noted.)

-- 
John Baldwin