From owner-freebsd-fs@FreeBSD.ORG Tue Oct 16 20:10:22 2012 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7AE2C22D for ; Tue, 16 Oct 2012 20:10:22 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id A5E8F8FC14 for ; Tue, 16 Oct 2012 20:10:21 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id XAA26724; Tue, 16 Oct 2012 23:10:14 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1TODSv-000LzB-Um; Tue, 16 Oct 2012 23:10:13 +0300 Message-ID: <507DBF23.4050303@FreeBSD.org> Date: Tue, 16 Oct 2012 23:10:11 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:16.0) Gecko/20121013 Thunderbird/16.0.1 MIME-Version: 1.0 To: Dennis Glatting Subject: Re: I have a DDB session open to a crashed ZFS server References: <1350317019.71982.50.camel@btw.pki2.com> <201210160844.41042.jhb@freebsd.org> <1350400597.72003.32.camel@btw.pki2.com> <201210161215.33369.jhb@freebsd.org> <507D8B69.3090903@FreeBSD.org> In-Reply-To: X-Enigmail-Version: 1.4.5 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@FreeBSD.org, dg17@penx.com X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Oct 2012 20:10:22 -0000 on 16/10/2012 21:48 Dennis Glatting said the following: > > > On Tue, 16 Oct 2012, Andriy Gapon wrote: > >> on 16/10/2012 19:15 John Baldwin said the following: >>> On Tuesday, October 16, 2012 11:16:37 am Dennis Glatting wrote: >>>> On Tue, 2012-10-16 at 08:44 -0400, John Baldwin wrote: >>>>> On Monday, October 15, 2012 12:03:39 pm Dennis Glatting wrote: >>>>>> FreeBSD/amd64 (mc) (ttyu0) >>>>>> >>>>>> login: NMI ... going to debugger >>>>>> [ thread pid 11 tid 100003 ] >>>>> >>>>> You got an NMI, not a crash. What happens if you just continue ('c' command) >>>>> from DDB? >>>>> >>>> >>>> I hit the NMI button because of the "crash," which is a misword, to get >>>> into DDB. >>> >>> Ah, I would suggest "hung" or "deadlocked" next time. It certainly seems like >>> a deadlock since all CPUs are idle. Some helpful commands here might be >>> 'show sleepchain' and 'show lockchain'. >>> >>> Pick a "stuck" process (like find) and run: >>> >>> 'show sleepchain ' >>> >>> In your case though it seems both 'find' and the various 'pbzip2' threads >>> are stuck on a condition variable, so there isn't an easy way to identify >>> an "owner" that is supposed to awaken these threads. It could be a case >>> of a missed wakeup perhaps, but you'll need to get someone more familiar >>> with ZFS to identify where these codes should be awakened normally. >>> >> >> I would also re-iterate a suggestion that I made to Nikolay ealrier: >> http://article.gmane.org/gmane.os.freebsd.devel.file-systems/15981 >> >> BTW, in that case it turned out to be a genuine deadlock in ZFS ARC handling of >> lowmem. >> procstat -kk -a is a great help for analyzing such situations. >> > > Without restarting the server and from memory, I believe the ARC on this server > is 32GB. The L2ARC is a 50-60GB SSD. The ZIL is a 16GB partitioned SSD but my > non-ZIL systems have the same problem. Main memory is 128GB. This information doesn't help with the debugging unfortunately... > I can run procstat to a serial console and scarf the output. What interval would > be helpful? Five seconds? Remember when the system hangs, no commands will run > so the data will be pre-hang. Hmm... No, I need its output when the system hangs. If you can arrange to have a memory disk (mdmfs, mdconfig) with UFS on it, then I believe that you should be able to run procstat (and the shared libraries that it uses) from it. > BTW, it takes 4-24 hours to hang under load. > > Also, are you suggesting I apply the patch in the URL and run again? I have been > following your other posts but the patches you posted did not cleanly apply, so > I removed them from my rev. > The patches are intended for head and recent stable. The patch that you are referring to may help with debugging. -- Andriy Gapon