From owner-freebsd-current@FreeBSD.ORG Sun Jul 31 19:03:22 2011 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 94BED106564A; Sun, 31 Jul 2011 19:03:22 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 381DD8FC0C; Sun, 31 Jul 2011 19:03:21 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap4EAFymNU6DaFvO/2dsb2JhbABAhEekDoFAAQEFIwRSGw4KAgINGQJZBrYxkA+BK4QHgRAEknuRAA X-IronPort-AV: E=Sophos;i="4.67,296,1309752000"; d="scan'208";a="132942797" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn-pri.mail.uoguelph.ca with ESMTP; 31 Jul 2011 15:03:21 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 25954B3F5A; Sun, 31 Jul 2011 15:03:21 -0400 (EDT) Date: Sun, 31 Jul 2011 15:03:21 -0400 (EDT) From: Rick Macklem To: Andriy Gapon Message-ID: <1368975528.1226755.1312139001048.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <4E3505FA.6000403@FreeBSD.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: FreeBSD-Current Subject: Re: Heavy I/O blocks FreeBSD box for several seconds X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 31 Jul 2011 19:03:22 -0000 Andriy Gapon wrote: > on 31/07/2011 01:55 Rick Macklem said the following: > > Andriv Gapon wrote: > >> on 26/07/2011 00:44 Rick Macklem said the following: > >>> hrs sent me this panic. I'm wondering if it might be relevant to > >>> this? > >> > >> To 'this' what? > > > > Well, my thinking was (and it is quite likely completely incorrect) > > is > > that, since this panic is a result of holding the spin lock too > > long, > > that it might be another symptom of the same problem as when the > > scheduler > > seems to "lock up" for several seconds. > > I don't think that there is any issue with the scheduler "locking up" > or taking > too long to make scheduling decisions, at least when we talk about > performance > problems. There can be issues with the scheduler making sub-optimal > scheduling > decisions. > > > --> The bug might be some case where the spin lock isn't being > > released > > when it should be, and that "could" result in "Heavy I/O blocks > > FreeBSD box for several seconds". Note that the server was under > > heavy I/O load when this panic occurred. > > > > > >> A panic is a panic. > >> > >>> spin lock 0xffffffff80cb52c0 (sched lock 1) held by > >>> 0xffffff0012c7f8c0 (tid 100317) too long > >>> panic: spin lock held too long > >> > >> I guess it's more related to the thread on stable@ which has the > >> panic > >> message > >> as its subject. > >> > >>> cpuid = 0 > >>> KDB: stack backtrace: > >>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2a > >>> kdb_backtrace() at kdb_backtrace+0x37 > >>> panic() at panic+0x187 > >>> _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39 > >>> _mtx_lock_spin() at _mtx_lock_spin+0x9e > >>> sched_add() at sched_add+0x117 > >>> setrunnable() at setrunnable+0x78 > >>> sleepq_signal() at sleepq_signal+0x7a > >>> cv_signal() at cv_signal+0x3b > >>> xprt_active() at xprt_active+0xe3 > >>> svc_vc_soupcall() at svc_vc_soupcall+0xc > >>> sowakeup() at sowakeup+0x69 > >>> tcp_do_segment() at tcp_do_segment+0x2cbd > >>> tcp_input() at tcp_input+0xcdd > >>> ip_input() at ip_input+0xac > >>> netisr_dispatch_src() at netisr_dispatch_src+0x7e > >>> ether_demux() at ether_demux+0x14d > >>> ether_input() at ether_input+0x17d > >>> em_rxeof() at em_rxeof+0x1ca > >>> em_handle_que() at em_handle_que+0x5b > >>> taskqueue_run_locked() at taskqueue_run_locked+0x85 > >>> taskqueue_thread_loop() at taskqueue_thread_loop+0x4e > >>> fork_exit() at fork_exit+0x11f > >>> fork_trampoline() at fork_trampoline+0xe > >>> --- trap 0, rip = 0, rsp = 0xffffff8000160d00, rbp = 0 --- > >>> KDB: enter: panic > >>> [thread pid 0 tid 100033 ] > >>> Stopped at kdb_enter+0x3b: movq $0,0x6b4e62(%rip) > >>> db> ps > >>> > >> > >> -- > >> Andriy Gapon > >> Ok, so if the scheduler spin lock is being held too long, what could cause this, if it isn't a scheduler bug? I can't think of how NFS would affect this beyond causing a heavy I/O load, but if there is some way it does, I suppose I need to know what that is? rick