Date: Wed, 30 Jan 2013 06:44:59 +0900 (JST) From: Hiroki Sato <hrs@FreeBSD.org> To: kostikbel@gmail.com Cc: alc@FreeBSD.org, stable@FreeBSD.org, rmacklem@uoguelph.ca Subject: Re: NFS-exported ZFS instability Message-ID: <20130130.064459.2572086065267072.hrs@allbsd.org> In-Reply-To: <20130104.023244.472910818423317661.hrs@allbsd.org> References: <1914428061.1617223.1357133079421.JavaMail.root@erie.cs.uoguelph.ca> <20130102174044.GB82219@kib.kiev.ua> <20130104.023244.472910818423317661.hrs@allbsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
----Security_Multipart(Wed_Jan_30_06_45_00_2013_177)-- Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Hiroki Sato <hrs@freebsd.org> wrote in <20130104.023244.472910818423317661.hrs@allbsd.org>: hr> Konstantin Belousov <kostikbel@gmail.com> wrote hr> in <20130102174044.GB82219@kib.kiev.ua>: hr> hr> ko> > I might take a closer look this evening and see if I can spot anything hr> ko> > in the log, rick hr> ko> > ps: I hope Alan and Kostik don't mind being added to the cc list. hr> ko> hr> ko> What I see in the log is that the lock cascade rooted in the thread hr> ko> 100838, which owns system map mutex. I believe this prevents malloc(9) hr> ko> from making a progress in other threads, which e.g. own the ZFS vnode hr> ko> locks. As the result, the whole system wedged. hr> ko> hr> ko> Looking back at the thread 100838, we can see that it executes hr> ko> smp_tlb_shootdown(). It is impossible to tell from the static dump, hr> ko> is the appearance of the smp_tlb_shootdown() in the backtrace is hr> ko> transient, or the thread is spinning there, waiting for other CPUs to hr> ko> acknowledge the request. But, since the system wedged, most likely, hr> ko> smp_tlb_shootdown spins. hr> ko> hr> ko> Taking this hypothesis, the situation can occur, most likely, due to hr> ko> some other core running with the interrupts disabled. Inspection of the hr> ko> backtraces of the processes running on all cores does not show any which hr> ko> could legitimately own a spinlock or otherwise run with the interrupts hr> ko> disabled. hr> ko> hr> ko> One thing you could try to do is to enable WITNESS for the spinlocks, hr> ko> to try to catch the leaked spinlock. I very much doubt that this is hr> ko> the case. hr> ko> hr> ko> Another thing to try is to switch the CPU idle method to something hr> ko> else. Look at the machdep.idle* sysctls. It could be some CPU errata hr> ko> which blocks wakeup due the interrupt in some conditions in C1 ? hr> hr> Thank you. It can take 1-2 weeks to reproduce this, so I set hr> debug.witness.skipspin=0 and keeping machdep.idle acpi abd will see hr> how it goes for a while. I will report again if I can get another hr> freeze. Hmm, I could reproduce the same freeze when debug.witness.skipspin=0, too. DDB and crash dump outputs are the following: http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt The value of machdep.idle was acpi. I have seen this symptom on two boxes with the following CPUs, so I am guessing it is not specific to a CPU model: CPU: Intel(R) Pentium(R) D CPU 3.40GHz (3391.52-MHz K8-class CPU) CPU: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz (2666.82-MHz K8-class CPU) -- Hiroki ----Security_Multipart(Wed_Jan_30_06_45_00_2013_177)-- Content-Type: application/pgp-signature Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (FreeBSD) iEYEABECAAYFAlEIQtwACgkQTyzT2CeTzy1BRgCfVnj3MPpi7K66RHPzS10l1t4G mcsAnjw51lWKeULmst3GqXEISaRNbIP8 =jFjQ -----END PGP SIGNATURE----- ----Security_Multipart(Wed_Jan_30_06_45_00_2013_177)----
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130130.064459.2572086065267072.hrs>