Date: Wed, 29 Jul 2009 04:43:36 +0200 From: Attilio Rao <attilio@freebsd.org> To: Stefan Bethke <stb@lassitu.de> Cc: FreeBSD Current <freebsd-current@freebsd.org>, Giovanni Trematerra <giovanni.trematerra@gmail.com>, Dan Naumov <dan.naumov@gmail.com>, barbara <barbara.xxx1975@libero.it>, "Bjoern A. Zeeb" <bz@freebsd.org>, Robert Watson <rwatson@freebsd.org>, "C. C. Tang" <hiyorin@gmail.com> Subject: Re: spinlock held too long on reboot Message-ID: <3bbf2fe10907281943m2392a9f9w7c69303e6c3b91d0@mail.gmail.com> In-Reply-To: <226F1AFF-45D8-4E4C-BE7F-D2EDC35EC8F6@lassitu.de> References: <746CE32B-BCF8-460A-982D-25341554E8FD@lassitu.de> <3bbf2fe10905221234k12c45932gb1e197143cd74b5d@mail.gmail.com> <FCCBA84A-1B9E-4C63-BD42-452565BE9292@lassitu.de> <20090522230333.X72053@maildrop.int.zabbadoz.net> <3bbf2fe10905221846q7fd1fe9cue744de61f9e12612@mail.gmail.com> <226F1AFF-45D8-4E4C-BE7F-D2EDC35EC8F6@lassitu.de>
next in thread | previous in thread | raw e-mail | index | archive | help
2009/5/23 Stefan Bethke <stb@lassitu.de>: > I wrote: > >> Syncing disks, vnodes remaining...0 done >> All buffers synced. >> GEOM_MIRROR: Device diesel_root: provider mirror/diesel_root destroyed. >> Uptime: 6m32s >> GEOM_MIRROR: Device diesel_root destroyed. >> Rebooting... >> cpu_reset: Stopping other CPUs >> spin lock 0xffffffff8078c900 (sched lock 1) held by 0xffffff00014d4ab0 >> (tid 100002) too long >> panic: spin lock held too long >> cpuid = 0 >> KDB: enter: panic >> [thread pid 77 tid 100090 ] >> Stopped at kdb_enter+0x3d: movq $0,0x48bbd0(%rip) >> db> bt >> Tracing pid 77 tid 100090 td 0xffffff000457bab0 >> kdb_enter() at kdb_enter+0x3d >> panic() at panic+0x17b >> _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39 >> _mtx_lock_spin() at _mtx_lock_spin+0x9e >> _mtx_lock_spin_flags() at _mtx_lock_spin_flags+0x72 >> sched_balance_group() at sched_balance_group+0xc5 >> sched_balance_group() at sched_balance_group+0x1f8 >> sched_balance() at sched_balance+0xa2 >> sched_clock() at sched_clock+0xf6 >> statclock() at statclock+0xbd >> lapic_handle_timer() at lapic_handle_timer+0x197 >> Xtimerint() at Xtimerint+0x8c >> --- interrupt, rip = 0xffffffff80541cc4, rsp = 0xffffff80771dba90, rbp = >> 0xffffff80771dbab0 --- >> DELAY() at DELAY+0x64 >> cpu_reset() at cpu_reset+0xdd >> boot() at boot+0x2e6 >> reboot() at reboot+0x42 >> syscall() at syscall+0x1a5 >> Xfast_syscall() at Xfast_syscall+0xd0 >> --- syscall (55, FreeBSD ELF64, reboot), rip = 0x800788eec, rsp = >> 0x7fffffffeca8, rbp = 0 --- > > > I've only seen this once. If I should encounter it again, is there > something you'd like me to look at? [ Sorry, trying to add anyone who alredy reported such a problem even if I know many of you experienced it on -STABLE] Could you try this patch against -CURRENT: http://www.freebsd.org/~attilio/stop_nmi.diff This patch basically does 2 things: 1) Removing the STOP_NMI option, and adding the infrastructure for using NMI on KDB invocation and normal stop IPIs on standard cpu shutdown. In order to accomplish that and forsee a better design than what STOP_NMI does now, 2 new functions are introduced: * ipi_hstop_selected() which does, if the architecture offers such an option, the possibility to send a "forced" IPI through a privileged channel (NMI on amd64 and ia32) in order to stop CPUs passed in the mask. Note that for the other architectures that are not amd64 and ia32 ipi_hstop_selected() is defaulted to ipi_selected(..., STOP_IPI), but if maintainers want to override that they can simply implement something harder * stop_cpus_hard() which is a 'more powerful' version of stop_cpus() that uses ipi_hstop_selected() instead than ipi_selected(..., STOP_IPI) in order to stop cpus In the end, while shutdown subsystem keeps using stop_cpus(), kdb now uses stop_cpus_hard(). 2) Disable interrupts on CPU0 while doing the stop_cpus() for others. That does avoid spourious fast handlers to preempt the CPU0 while doing the stopping (aka: timerint running hardclock()) If you can report if that patch fixes the problem for you it would be great. I'm alredy well aware that this patch needs an entry in UPDATING too if we verify it does solve the problem. If someone wants to port this to STABLE_7 and he is faster than me, he is welcome. Due to invasivness of the patch, it should be modified if eventually to be ported on STABLE_7. I tested it on i386, but I would eventually need of run a make universe. I will do ASAP. * Please don't forget to drop STOP_NMI by your own custom config files * Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3bbf2fe10907281943m2392a9f9w7c69303e6c3b91d0>