Date: Wed, 17 Aug 2011 14:04:47 -0700 From: Jeremy Chadwick <freebsd@jdc.parodius.com> To: freebsd-stable@FreeBSD.org Subject: Re: panic: spin lock held too long (RELENG_8 from today) Message-ID: <20110817210446.GA49737@icarus.home.lan> In-Reply-To: <20110817175201.GB1973@libertas.local.camdensoftware.com> References: <20110707082027.GX48734@deviant.kiev.zoral.com.ua> <4E159959.2070401@sentex.net> <4E15A08C.6090407@sentex.net> <20110818.023832.373949045518579359.hrs@allbsd.org> <20110817175201.GB1973@libertas.local.camdensoftware.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Aug 17, 2011 at 10:52:01AM -0700, Chip Camden wrote: > Quoth Hiroki Sato on Thursday, 18 August 2011: > > Hi, > > > > Mike Tancsa <mike@sentex.net> wrote > > in <4E15A08C.6090407@sentex.net>: > > > > mi> On 7/7/2011 7:32 AM, Mike Tancsa wrote: > > mi> > On 7/7/2011 4:20 AM, Kostik Belousov wrote: > > mi> >> > > mi> >> BTW, we had a similar panic, "spinlock held too long", the spinlock > > mi> >> is the sched lock N, on busy 8-core box recently upgraded to the > > mi> >> stable/8. Unfortunately, machine hung dumping core, so the stack trace > > mi> >> for the owner thread was not available. > > mi> >> > > mi> >> I was unable to make any conclusion from the data that was present. > > mi> >> If the situation is reproducable, you coulld try to revert r221937. This > > mi> >> is pure speculation, though. > > mi> > > > mi> > Another crash just now after 5hrs uptime. I will try and revert r221937 > > mi> > unless there is any extra debugging you want me to add to the kernel > > mi> > instead ? > > > > I am also suffering from a reproducible panic on an 8-STABLE box, an > > NFS server with heavy I/O load. I could not get a kernel dump > > because this panic locked up the machine just after it occurred, but > > according to the stack trace it was the same as posted one. > > Switching to an 8.2R kernel can prevent this panic. > > > > Any progress on the investigation? > > > > -- > > spin lock 0xffffffff80cb46c0 (sched lock 0) held by 0xffffff01900458c0 (tid 100489) too long > > panic: spin lock held too long > > cpuid = 1 > > KDB: stack backtrace: > > db_trace_self_wrapper() at db_trace_self_wrapper+0x2a > > kdb_backtrace() at kdb_backtrace+0x37 > > panic() at panic+0x187 > > _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39 > > _mtx_lock_spin() at _mtx_lock_spin+0x9e > > sched_add() at sched_add+0x117 > > setrunnable() at setrunnable+0x78 > > sleepq_signal() at sleepq_signal+0x7a > > cv_signal() at cv_signal+0x3b > > xprt_active() at xprt_active+0xe3 > > svc_vc_soupcall() at svc_vc_soupcall+0xc > > sowakeup() at sowakeup+0x69 > > tcp_do_segment() at tcp_do_segment+0x25e7 > > tcp_input() at tcp_input+0xcdd > > ip_input() at ip_input+0xac > > netisr_dispatch_src() at netisr_dispatch_src+0x7e > > ether_demux() at ether_demux+0x14d > > ether_input() at ether_input+0x17d > > em_rxeof() at em_rxeof+0x1ca > > em_handle_que() at em_handle_que+0x5b > > taskqueue_run_locked() at taskqueue_run_locked+0x85 > > taskqueue_thread_loop() at taskqueue_thread_loop+0x4e > > fork_exit() at fork_exit+0x11f > > fork_trampoline() at fork_trampoline+0xe > > -- > > > > -- Hiroki > > > I'm also getting similar panics on 8.2-STABLE. Locks up everything and I > have to power off. Once, I happened to be looking at the console when it > happened and copied dow the following: > > Sleeping thread (tif 100037, pid 0) owns a non-sleepable lock > panic: sleeping thread > cpuid=1 No idea, might be relevant to the thread. > Another time I got: > > lock order reversal: > 1st 0xffffff000593e330 snaplk (snaplk) @ /usr/src/sys/kern/vfr_vnops.c:296 > 2nd 0xffffff0005e5d578 ufs (ufs) @ /usr/src/sys/ufs/ffs/ffs_snapshot.c:1587 > > I didn't copy down the traceback. "snaplk" refers to UFS snapshots. The above must have been typed in manually as well, due to some typos in filenames as well. Either this is a different problem, or if everyone in this thread is doing UFS snapshots (dump -L, mksnap_ffs, etc.) and having this problem happen then I recommend people stop using UFS snapshots. I've ranted about their unreliability in the past (years upon years ago -- still seems valid) and just how badly they can "wedge" a system. This is one of the many (MANY!) reasons why we use rsnapshot/rsync instead. The atime clobbering issue is the only downside. I don't see what this has to do with "heavy WAN I/O" unless you're doing something like dump-over-ssh, in which case see the above paragraph. > These panics seem to hit when I'm doing heavy WAN I/O. I can go for > about a day without one as long as I stay away from the web or even chat. > Last night this system copied a backup of 35GB over the local network > without failing, but as soon as I hopped onto Firefox this morning, down > she went. I don't know if that's coincidence or useful data. > > I didn't get to say "Thanks" to Eitan Adler for attempting to help me > with this on Monday night. Thanks, Eitan! -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110817210446.GA49737>