Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 16 Oct 2012 23:10:11 +0300
From:      Andriy Gapon <avg@FreeBSD.org>
To:        Dennis Glatting <dg@pki2.com>
Cc:        freebsd-fs@FreeBSD.org, dg17@penx.com
Subject:   Re: I have a DDB session open to a crashed ZFS server
Message-ID:  <507DBF23.4050303@FreeBSD.org>
In-Reply-To: <alpine.BSF.2.00.1210161139060.22959@btw.pki2.com>
References:  <1350317019.71982.50.camel@btw.pki2.com> <201210160844.41042.jhb@freebsd.org> <1350400597.72003.32.camel@btw.pki2.com> <201210161215.33369.jhb@freebsd.org> <507D8B69.3090903@FreeBSD.org> <alpine.BSF.2.00.1210161139060.22959@btw.pki2.com>

next in thread | previous in thread | raw e-mail | index | archive | help
on 16/10/2012 21:48 Dennis Glatting said the following:
> 
> 
> On Tue, 16 Oct 2012, Andriy Gapon wrote:
> 
>> on 16/10/2012 19:15 John Baldwin said the following:
>>> On Tuesday, October 16, 2012 11:16:37 am Dennis Glatting wrote:
>>>> On Tue, 2012-10-16 at 08:44 -0400, John Baldwin wrote:
>>>>> On Monday, October 15, 2012 12:03:39 pm Dennis Glatting wrote:
>>>>>> FreeBSD/amd64 (mc) (ttyu0)
>>>>>>
>>>>>> login: NMI ... going to debugger
>>>>>> [ thread pid 11 tid 100003 ]
>>>>>
>>>>> You got an NMI, not a crash.  What happens if you just continue ('c' command)
>>>>> from DDB?
>>>>>
>>>>
>>>> I hit the NMI button because of the "crash," which is a misword, to get
>>>> into DDB.
>>>
>>> Ah, I would suggest "hung" or "deadlocked" next time.  It certainly seems like
>>> a deadlock since all CPUs are idle.  Some helpful commands here might be
>>> 'show sleepchain' and 'show lockchain'.
>>>
>>> Pick a "stuck" process (like find) and run:
>>>
>>> 'show sleepchain <pid>'
>>>
>>> In your case though it seems both 'find' and the various 'pbzip2' threads
>>> are stuck on a condition variable, so there isn't an easy way to identify
>>> an "owner" that is supposed to awaken these threads.  It could be a case
>>> of a missed wakeup perhaps, but you'll need to get someone more familiar
>>> with ZFS to identify where these codes should be awakened normally.
>>>
>>
>> I would also re-iterate a suggestion that I made to Nikolay ealrier:
>> http://article.gmane.org/gmane.os.freebsd.devel.file-systems/15981
>>
>> BTW, in that case it turned out to be a genuine deadlock in ZFS ARC handling of
>> lowmem.
>> procstat -kk -a is a great help for analyzing such situations.
>>
> 
> Without restarting the server and from memory, I believe the ARC on this server
> is 32GB. The L2ARC is a 50-60GB SSD. The ZIL is a 16GB partitioned SSD but my
> non-ZIL systems have the same problem. Main memory is 128GB.

This information doesn't help with the debugging unfortunately...

> I can run procstat to a serial console and scarf the output. What interval would
> be helpful? Five seconds? Remember when the system hangs, no commands will run
> so the data will be pre-hang.

Hmm... No, I need its output when the system hangs.
If you can arrange to have a memory disk (mdmfs, mdconfig) with UFS on it, then
I believe that you should be able to run procstat (and the shared libraries that
it uses) from it.

> BTW, it takes 4-24 hours to hang under load.
> 
> Also, are you suggesting I apply the patch in the URL and run again? I have been
> following your other posts but the patches you posted did not cleanly apply, so
> I removed them from my rev.
> 

The patches are intended for head and recent stable.
The patch that you are referring to may help with debugging.

-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?507DBF23.4050303>