From owner-freebsd-stable@FreeBSD.ORG  Thu Jan  3 17:33:34 2013
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: stable@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 51F26A1;
 Thu,  3 Jan 2013 17:33:34 +0000 (UTC) (envelope-from hrs@FreeBSD.org)
Received: from mail.allbsd.org (gatekeeper.allbsd.org
 [IPv6:2001:2f0:104:e001::32])
 by mx1.freebsd.org (Postfix) with ESMTP id 3AB8E6E1;
 Thu,  3 Jan 2013 17:33:33 +0000 (UTC)
Received: from alph.allbsd.org (p1137-ipbf1505funabasi.chiba.ocn.ne.jp
 [118.7.212.137]) (authenticated bits=128)
 by mail.allbsd.org (8.14.5/8.14.5) with ESMTP id r03HXGYR021067
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Fri, 4 Jan 2013 02:33:26 +0900 (JST) (envelope-from hrs@FreeBSD.org)
Received: from localhost (localhost [127.0.0.1]) (authenticated bits=0)
 by alph.allbsd.org (8.14.5/8.14.5) with ESMTP id r03HXCxG091302;
 Fri, 4 Jan 2013 02:33:16 +0900 (JST) (envelope-from hrs@FreeBSD.org)
Date: Fri, 04 Jan 2013 02:32:44 +0900 (JST)
Message-Id: <20130104.023244.472910818423317661.hrs@allbsd.org>
To: kostikbel@gmail.com
Subject: Re: NFS-exported ZFS instability
From: Hiroki Sato <hrs@FreeBSD.org>
In-Reply-To: <20130102174044.GB82219@kib.kiev.ua>
References: <20130102.105304.1817355190360003433.hrs@allbsd.org>
 <1914428061.1617223.1357133079421.JavaMail.root@erie.cs.uoguelph.ca>
 <20130102174044.GB82219@kib.kiev.ua>
X-PGPkey-fingerprint: BDB3 443F A5DD B3D0 A530  FFD7 4F2C D3D8 2793 CF2D
X-Mailer: Mew version 6.5 on Emacs 23.4 / Mule 6.0 (HANACHIRUSATO)
Mime-Version: 1.0
Content-Type: Multipart/Signed; protocol="application/pgp-signature";
 micalg=pgp-sha1;
 boundary="--Security_Multipart(Fri_Jan__4_02_32_44_2013_592)--"
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: clamav-milter 0.97.4 at gatekeeper.allbsd.org
X-Virus-Status: Clean
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (mail.allbsd.org [133.31.130.32]); Fri, 04 Jan 2013 02:33:26 +0900 (JST)
X-Spam-Status: No, score=-98.1 required=13.0 tests=CONTENT_TYPE_PRESENT,
 ONLY1HOPDIRECT,SAMEHELOBY2HOP,USER_IN_WHITELIST autolearn=no version=3.3.2
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on
 gatekeeper.allbsd.org
Cc: alc@FreeBSD.org, stable@FreeBSD.org, rmacklem@uoguelph.ca
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 03 Jan 2013 17:33:34 -0000

----Security_Multipart(Fri_Jan__4_02_32_44_2013_592)--
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Konstantin Belousov <kostikbel@gmail.com> wrote
  in <20130102174044.GB82219@kib.kiev.ua>:

ko> > I might take a closer look this evening and see if I can spot anything
ko> > in the log, rick
ko> > ps: I hope Alan and Kostik don't mind being added to the cc list.
ko>
ko> What I see in the log is that the lock cascade rooted in the thread
ko> 100838, which owns system map mutex. I believe this prevents malloc(9)
ko> from making a progress in other threads, which e.g. own the ZFS vnode
ko> locks. As the result, the whole system wedged.
ko>
ko> Looking back at the thread 100838, we can see that it executes
ko> smp_tlb_shootdown(). It is impossible to tell from the static dump,
ko> is the appearance of the smp_tlb_shootdown() in the backtrace is
ko> transient, or the thread is spinning there, waiting for other CPUs to
ko> acknowledge the request. But, since the system wedged, most likely,
ko> smp_tlb_shootdown spins.
ko>
ko> Taking this hypothesis, the situation can occur, most likely, due to
ko> some other core running with the interrupts disabled. Inspection of the
ko> backtraces of the processes running on all cores does not show any which
ko> could legitimately own a spinlock or otherwise run with the interrupts
ko> disabled.
ko>
ko> One thing you could try to do is to enable WITNESS for the spinlocks,
ko> to try to catch the leaked spinlock. I very much doubt that this is
ko> the case.
ko>
ko> Another thing to try is to switch the CPU idle method to something
ko> else. Look at the machdep.idle* sysctls. It could be some CPU errata
ko> which blocks wakeup due the interrupt in some conditions in C1 ?

 Thank you.  It can take 1-2 weeks to reproduce this, so I set
 debug.witness.skipspin=0 and keeping machdep.idle acpi abd will see
 how it goes for a while.  I will report again if I can get another
 freeze.

-- Hiroki

----Security_Multipart(Fri_Jan__4_02_32_44_2013_592)--
Content-Type: application/pgp-signature
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (FreeBSD)

iEYEABECAAYFAlDlwLwACgkQTyzT2CeTzy0vgQCg0rRPQAS6lnmZjyeN66WO6+Uf
vTIAn34gFwijO6lsvAwKDxRkpI+zNZSZ
=bGWz
-----END PGP SIGNATURE-----

----Security_Multipart(Fri_Jan__4_02_32_44_2013_592)----