From owner-freebsd-fs@FreeBSD.ORG Wed Oct 17 07:04:49 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 91EC08E0; Wed, 17 Oct 2012 07:04:49 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from mail-we0-f182.google.com (mail-we0-f182.google.com [74.125.82.182]) by mx1.freebsd.org (Postfix) with ESMTP id E81988FC0C; Wed, 17 Oct 2012 07:04:48 +0000 (UTC) Received: by mail-we0-f182.google.com with SMTP id x43so5276803wey.13 for ; Wed, 17 Oct 2012 00:04:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; bh=rEt/+eA91eEpqN3jTjiYKzdDI5qyyUcq/FvbyVbqzzY=; b=VDDkhUJZsQEURODc9Y0WVKMQHijxsfS4Bbtky2Ixs9pjNAqAhf9oDkC7NW8FVT3Juh StoVAJkEK/dJoat/r1H7rMj3yKTr3oqC0a7YTfANoB2AOSuGuOT+ip+yKsrMAF3mNKgU rYFo9y4k1d9kQK8FL3YqFvc3na1Wk3t+hVjaF91yMU6ocRYUcjk1cJ88ItQZjw0jWM1d y3E62m3aGtR2+CDP/hyWHOG4LD6ilKvQorR+mA7YO1WIpavPkn7nuHfHbNnLOzbGJO+H gzks026+6l/xs7M6dl+6yBQeEUeKnqiv+UG4Rc+zLaavmm9wkwf0oYcB7Dr7DWTEhnUN Yo2w== Received: by 10.180.83.101 with SMTP id p5mr1943415wiy.2.1350457481904; Wed, 17 Oct 2012 00:04:41 -0700 (PDT) Received: from [10.0.0.86] ([93.152.184.10]) by mx.google.com with ESMTPS id w8sm22912550wif.4.2012.10.17.00.04.39 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 17 Oct 2012 00:04:41 -0700 (PDT) Subject: Re: I have a DDB session open to a crashed ZFS server Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\)) Content-Type: text/plain; charset=us-ascii From: Nikolay Denev In-Reply-To: Date: Wed, 17 Oct 2012 10:04:38 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <0B0CA833-79FA-4C8E-86AC-828E7947FF67@gmail.com> References: <1350317019.71982.50.camel@btw.pki2.com> <201210160844.41042.jhb@freebsd.org> <1350400597.72003.32.camel@btw.pki2.com> <201210161215.33369.jhb@freebsd.org> <507D8B69.3090903@FreeBSD.org> To: Dennis Glatting X-Mailer: Apple Mail (2.1498) Cc: freebsd-fs@freebsd.org, dg17@penx.com, Andriy Gapon X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Oct 2012 07:04:49 -0000 On Oct 16, 2012, at 9:48 PM, Dennis Glatting wrote: >=20 >=20 > On Tue, 16 Oct 2012, Andriy Gapon wrote: >=20 >> on 16/10/2012 19:15 John Baldwin said the following: >>> On Tuesday, October 16, 2012 11:16:37 am Dennis Glatting wrote: >>>> On Tue, 2012-10-16 at 08:44 -0400, John Baldwin wrote: >>>>> On Monday, October 15, 2012 12:03:39 pm Dennis Glatting wrote: >>>>>> FreeBSD/amd64 (mc) (ttyu0) >>>>>>=20 >>>>>> login: NMI ... going to debugger >>>>>> [ thread pid 11 tid 100003 ] >>>>>=20 >>>>> You got an NMI, not a crash. What happens if you just continue = ('c' command) >>>>> from DDB? >>>>>=20 >>>>=20 >>>> I hit the NMI button because of the "crash," which is a misword, to = get >>>> into DDB. >>>=20 >>> Ah, I would suggest "hung" or "deadlocked" next time. It certainly = seems like >>> a deadlock since all CPUs are idle. Some helpful commands here = might be >>> 'show sleepchain' and 'show lockchain'. >>>=20 >>> Pick a "stuck" process (like find) and run: >>>=20 >>> 'show sleepchain ' >>>=20 >>> In your case though it seems both 'find' and the various 'pbzip2' = threads >>> are stuck on a condition variable, so there isn't an easy way to = identify >>> an "owner" that is supposed to awaken these threads. It could be a = case >>> of a missed wakeup perhaps, but you'll need to get someone more = familiar >>> with ZFS to identify where these codes should be awakened normally. >>>=20 >>=20 >> I would also re-iterate a suggestion that I made to Nikolay ealrier: >> http://article.gmane.org/gmane.os.freebsd.devel.file-systems/15981 >>=20 >> BTW, in that case it turned out to be a genuine deadlock in ZFS ARC = handling of >> lowmem. >> procstat -kk -a is a great help for analyzing such situations. >>=20 >=20 > Without restarting the server and from memory, I believe the ARC on = this server is 32GB. The L2ARC is a 50-60GB SSD. The ZIL is a 16GB = partitioned SSD but my non-ZIL systems have the same problem. Main = memory is 128GB. >=20 > I can run procstat to a serial console and scarf the output. What = interval would be helpful? Five seconds? Remember when the system hangs, = no commands will run so the data will be pre-hang. >=20 > BTW, it takes 4-24 hours to hang under load. >=20 > Also, are you suggesting I apply the patch in the URL and run again? I = have been following your other posts but the patches you posted did not = cleanly apply, so I removed them from my rev. >=20 > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" Hi, I'm running with the patch from here : = http://thread.gmane.org/gmane.os.freebsd.devel.file-systems/16000/focus=3D= 16017 And there were no deadlocks since it's applied. If you're hitting the same issue as I was, this should help. Regards, Nikolay=