From owner-freebsd-hackers@FreeBSD.ORG Wed Oct 17 18:46:11 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8ED34BD; Wed, 17 Oct 2012 18:46:11 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 60E9F8FC12; Wed, 17 Oct 2012 18:46:11 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id BB801B94F; Wed, 17 Oct 2012 14:46:10 -0400 (EDT) From: John Baldwin To: Andriy Gapon Subject: Re: _mtx_lock_spin: obsolete historic handling of kdb_active and panicstr? Date: Wed, 17 Oct 2012 10:12:05 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p20; KDE/4.5.5; amd64; ; ) References: <507E9498.10905@FreeBSD.org> In-Reply-To: <507E9498.10905@FreeBSD.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201210171012.05392.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Wed, 17 Oct 2012 14:46:10 -0400 (EDT) Cc: freebsd-hackers , Bruce Evans X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Oct 2012 18:46:11 -0000 On Wednesday, October 17, 2012 7:20:56 am Andriy Gapon wrote: > > _mtx_lock_spin has the following check in its retry loop: > if (i < 60000000 || kdb_active || panicstr != NULL) > DELAY(1); > else > _mtx_lock_spin_failed(m); > > Which means that in the (!kdb_active && panicstr == NULL) case we will make at > most 60000000 iterations and then call _mtx_lock_spin_failed (which proceeds to > panic). When either kdb_active or panicstr is set, then we are going to loop > forever. > > I've done some digging through the lengthy history and many evolutions of the > code (unfortunately I haven't kept records during the research), and my > conclusion is that the kdb_active and panicstr checks were added at the quite > early era of FreeBSD SMP, where we didn't have a mechanism to stop/block other > CPUs when kdb or panic were entered. We didn't even prevent parallel execution > of panic. > So the above code was a sort of defense where we hoped that "other" CPUs would > eventually stumble upon some held spinlock and would be captured there. Maybe > there was a specific set of spinlocks, which were supposed to help. It wasn't so much as a way of hoping CPUs would stop so much as a way to prevent other CPUs from panic'ing while another CPU had already panic'd or was already in DDB making debugging harder. > Nowadays, we do try to stop other CPUs during panic and kdb activation and there > are good chances that they are indeed stopped. In this circumstances, should > the main CPU be so unlucky as to run into the held spinlock, the above check > would do more harm than good - the main CPU would just spin there forever, > because a lock owner is also spinning in the stop loop and so can't release the > lock. > Actually, this is only true for the kdb case. In the panic case we make a check > earlier and simply ignore/skip/bust all the locks. That makes the panicstr > check in the code in question both harmless and useless. > > So I'd like to propose to remove those checks altogether. Or perhaps to > "reverse" them and immediately induce a (possibly secondary) panic if we ever > get to that wait loop and kdb_active || panicstr != NULL. > > What do you think? I think this sounds fine. I do think though that there are two behaviors. If for some reason you are not able to stop the other CPUs, you would rather them spin than trigger another panic while you are in DDB or writing out a crashdump. However, the CPU that is currently in the debugger or writing out a crashdump should probably bust all locks (code executed in debugger backends should generally avoid all locking at all, and depend on things like try locks where it gracefully fails if it must use locking. That would make the kdb_active case here irrelevant, and the panic case is already handled as you noted.) -- John Baldwin