From owner-freebsd-hackers@freebsd.org  Wed Aug 10 15:20:16 2016
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4ABD3BB50C9
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Wed, 10 Aug 2016 15:20:16 +0000 (UTC)
 (envelope-from hoomanfazaeli@gmail.com)
Received: from mail-wm0-x242.google.com (mail-wm0-x242.google.com
 [IPv6:2a00:1450:400c:c09::242])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id C87D41696
 for <freebsd-hackers@freebsd.org>; Wed, 10 Aug 2016 15:20:15 +0000 (UTC)
 (envelope-from hoomanfazaeli@gmail.com)
Received: by mail-wm0-x242.google.com with SMTP id q128so10202810wma.1
 for <freebsd-hackers@freebsd.org>; Wed, 10 Aug 2016 08:20:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=message-id:date:from:user-agent:mime-version:to:cc:subject
 :references:in-reply-to:content-transfer-encoding;
 bh=5vG+6qZnLABAeaijyaFtdq6tTvF9nzfgNfI1nlxYEyU=;
 b=pf4Fe/hBFMTWRTBNIl69/2D2yQm3sfdAjnjZcjwfhFbjLOT/prpnaZBFh6IxyweUoT
 en6ymxSO/W9pvNI4iXW8b4J/GnDArRIS/B9TB3z791ugj15bcTO8GYIGbTXbUfQ3C5zO
 hNTFDDv67nppkZO1okipwpi4xOnFi8wwDUTA0FkDOee90BDgTbzKMmQpsRN9x9xIycuv
 nXyGDmjFGrE8wGROuiejoYagQLj9Enh/27a4rI2N+2mNHqiE713aQejJE1lw07Jf/dYs
 uZfDjP64iILaIW9/JhW54iQKCAg86GxFXywbRjx/oPk+sCP98rO7lz0H1Ttf0btZKH6m
 FG/A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to
 :cc:subject:references:in-reply-to:content-transfer-encoding;
 bh=5vG+6qZnLABAeaijyaFtdq6tTvF9nzfgNfI1nlxYEyU=;
 b=N/7Lyxp10h1qxLE+/R4bkQc8652qx3JPn6g0l5rpyEUjDaBr3Bbwn5rh4ihQHs6wsd
 Nhu+okK1sPAGWLAXWBzwGU8sWeGVK0yEaLbAtPdNIdBpg/zO6OtsPsBlu52nuyk1ozty
 YSWPRdx2fLVKylrtNiEVoph1Yrvrr8NoN1vKX7VTSU3Wzzi/dIL3Oy9kX8PT7JJZ2n+s
 YRGrEKfSkjJ3l4C8fgj8DTEqr47eM0gVf18YagLpbEXVPKLJXaNcm+u+y5zIxY1f6AwM
 LHiTL5htlVUZna33G0Tw65/Yk26246206yQDd2a2kkygUMEnYhpZuNW9tMQ25HNVnKjO
 +1Eg==
X-Gm-Message-State: AEkoout1IigN0Y9xVGp9iDg0Nc351s/CSmClSP7gsEwqKIeMK2f79HoS14cFxrUxxWsotg==
X-Received: by 10.194.148.19 with SMTP id to19mr5375601wjb.81.1470842414089;
 Wed, 10 Aug 2016 08:20:14 -0700 (PDT)
Received: from [192.168.2.30] ([2.190.216.101])
 by smtp.googlemail.com with ESMTPSA id d8sm8895054wmi.0.2016.08.10.08.20.12
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Wed, 10 Aug 2016 08:20:13 -0700 (PDT)
Message-ID: <57AB462A.2080608@gmail.com>
Date: Wed, 10 Aug 2016 19:50:10 +0430
From: Hooman Fazaeli <hoomanfazaeli@gmail.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:17.0) Gecko/20130215 Thunderbird/17.0.3
MIME-Version: 1.0
To: Konstantin Belousov <kostikbel@gmail.com>
CC: FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject: Re: 9.3-RELEASE panic: spin lock held too long
References: <57AB349B.2010805@gmail.com> <20160810141948.GP83214@kib.kiev.ua>
In-Reply-To: <20160810141948.GP83214@kib.kiev.ua>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 10 Aug 2016 15:20:16 -0000

On 2016-08-10 18:49, Konstantin Belousov wrote:
> On Wed, Aug 10, 2016 at 06:35:15PM +0430, Hooman Fazaeli wrote:
>> Hi
>>
>> on a 9.3-REL i386 box we have occasional "spin lock held too long" panics.
>>
>> System info:
>> -------------
>> - Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz CPU (4 cores, no hyper theading)
>> - 4G non-ECC RAM
>> - asterisk-1.8.30.0 from ports
>> - dahdi-kmod26-2.6.1.r10738 from ports
>> - powerd disabled.
>> - Workload: ISDN & SIP call processing.
>> ------------
>>
>> The panics are either on 'sched lock' or 'turnstile lock' spin locks.
>>
>> PANIC 1
>> =======
>> As below trace shows:
>>
>> 1- input arrives on a UDP socket
>> 2- doselwakeup is called.
>> 3- That wakeup call ends up in sched_add.
>> 4- sched_add grabs 'sched lock 0' spin lock, and aparenlty, holds it for a too long time.
>> 5- The pancing thread does the same calls as owner thread but panics because
>>      it can't grab the the same spin lock.
>>
>>   > kgdb /boot/kernel/kernel /var/crash/vmcore.14
>> ...
>> kernel trap 12 with interrupts disabled
>> spin lock 0xc140a4c0 (sched lock 0) held by 0xc807a2f0 (tid 100045) too long
>> panic: spin lock held too long
>> cpuid = 3
>> KDB: stack backtrace:
>> #0 0xc0b17eaf at kdb_backtrace+0x4f
>> #1 0xc0adeaef at panic+0x16f
>> #2 0xc0ac9cff at _mtx_lock_spin_failed+0x3f
>> #3 0xc0ac9e75 at _mtx_lock_spin+0x165
>> #4 0xc0b096c5 at sched_add+0xf5
>> #5 0xc0b09890 at sched_wakeup+0x70
>> #6 0xc0ae8968 at setrunnable+0x88
>> #7 0xc0b2227e at sleepq_resume_thread+0x12e
>> #8 0xc0b22fd3 at sleepq_broadcast+0xd3
>> #9 0xc0a8c4cd at cv_broadcastpri+0x4d
>> #10 0xc0b2a406 at doselwakeup+0xe6
>> #11 0xc0b2a4be at selwakeuppri+0xe
>> #12 0xc0a9fa59 at knote_enqueue+0x59
>> #13 0xc0aa073f at kqueue_register+0x84f
>> #14 0xc0aa09f3 at kern_kevent+0xf3
>> #15 0xc0aa16ce at sys_kevent+0x19e
>> #16 0xc0fcc8c3 at syscall+0x443
>> #17 0xc0fb60f1 at Xint0x80_syscall+0x21
>> Uptime: 7m44s
>>
>>   > bt
>> #0  doadump (textdump=1) at pcpu.h:250
>> #1  0xc0ade835 in kern_reboot (howto=260) at ../../../kern/kern_shutdown.c:454
>> #2  0xc0adeb32 in panic (fmt=<value optimized out>) at ../../../kern/kern_shutdown.c:642
>> #3  0xc0ac9cff in _mtx_lock_spin_failed (m=0x0) at ../../../kern/kern_mutex.c:515
>> #4  0xc0ac9e75 in _mtx_lock_spin (m=0xc140a4c0, tid=3384060112, opts=0, file=0x0, line=0) at ../../../kern/kern_mutex.c:557
>> #5  0xc0b096c5 in sched_add (td=0xc9b00bc0, flags=0) at ../../../kern/sched_ule.c:1153
>> #6  0xc0b09890 in sched_wakeup (td=0xc9b00bc0) at ../../../kern/sched_ule.c:1991
>> #7  0xc0ae8968 in setrunnable (td=0xc9b00bc0) at ../../../kern/kern_synch.c:537
>> #8  0xc0b2227e in sleepq_resume_thread (sq=0xc869fd40, td=0xc9b00bc0, pri=104) at ../../../kern/subr_sleepqueue.c:763
>> #9  0xc0b22fd3 in sleepq_broadcast (wchan=0xc95741e4, flags=1, pri=104, queue=0) at ../../../kern/subr_sleepqueue.c:865
>> #10 0xc0a8c4cd in cv_broadcastpri (cvp=0xc95741e4, pri=104) at ../../../kern/kern_condvar.c:448
>> #11 0xc0b2a406 in doselwakeup (sip=0xc963faac, pri=104) at ../../../kern/sys_generic.c:1683
>> #12 0xc0b2a4be in selwakeuppri (sip=0xc963faac, pri=104) at ../../../kern/sys_generic.c:1651
>> #13 0xc0a9fa59 in knote_enqueue (kn=<value optimized out>) at ../../../kern/kern_event.c:1786
>> #14 0xc0aa073f in kqueue_register (kq=0xc963fa80, kev=0xf0e07b20, td=0xc9b4a8d0, waitok=1) at ../../../kern/kern_event.c:1154
>> #15 0xc0aa09f3 in kern_kevent (td=0xc9b4a8d0, fd=152, nchanges=2, nevents=0, k_ops=0xf0e07c20, timeout=0x0) at ../../../kern/kern_event.c:850
>> #16 0xc0aa16ce in sys_kevent (td=0xc9b4a8d0, uap=0xf0e07ccc) at ../../../kern/kern_event.c:771
>> #17 0xc0fcc8c3 in syscall (frame=0xf0e07d08) at subr_syscall.c:135
>> #18 0xc0fb60f1 in Xint0x80_syscall () at ../../../i386/i386/exception.s:270
>> #19 0x00000033 in ?? ()
>> Previous frame inner to this frame (corrupt stack?)
>>
>> (kgdb) tid 100045
>> [Switching to thread 34 (Thread 100045)]#0  cpustop_handler () at ../../../i386/i386/mp_machdep.c:1496
>> 1496    ../../../i386/i386/mp_machdep.c: No such file or directory.
>>           in ../../../i386/i386/mp_machdep.c
>>
>> (kgdb) bt
>> #0  cpustop_handler () at ../../../i386/i386/mp_machdep.c:1496
>> #1  0xc0fc1805 in ipi_nmi_handler () at ../../../i386/i386/mp_machdep.c:1478
>> #2  0xc0fccf38 in trap (frame=0xe1be9620) at ../../../i386/i386/trap.c:227
>> #3  0xc0fb605c in calltrap () at ../../../i386/i386/exception.s:170
>> #4  0xc0fddf45 in DELAY (n=1) at ../../../x86/isa/clock.c:283
>> #5  0xc0ac9e6c in _mtx_lock_spin (m=0xc140b1c0, tid=3355943664, opts=0, file=0x0, line=0) at ../../../kern/kern_mutex.c:555
>> #6  0xc0b096c5 in sched_add (td=0xc7de0000, flags=4) at ../../../kern/sched_ule.c:1153
>> #7  0xc0aae008 in intr_event_schedule_thread (ie=<value optimized out>) at ../../../kern/kern_intr.c:921
>> #8  0xc0aaf8fd in swi_sched (cookie=0xc7d95300, flags=<value optimized out>) at ../../../kern/kern_intr.c:1174
>> #9  0xc0af5b53 in callout_tick () at ../../../kern/kern_timeout.c:361
>> #10 0xc0a8c3e1 in hardclock_cnt (cnt=1, usermode=0) at ../../../kern/kern_clock.c:554
>> #11 0xc0fd403f in handleevents (now=0xe1be9870, fake=0) at ../../../kern/kern_clocksource.c:215
>> #12 0xc0fd5d3f in timercb (et=0xc148c860, arg=0x0) at ../../../kern/kern_clocksource.c:390
>> #13 0xc0fe731f in lapic_handle_timer (frame=0xe1be98b4) at ../../../x86/x86/local_apic.c:818
>> #14 0xc0fb6600 in Xtimerint () at apic_vector.s:108
>> #15 0xc0aca0df in _mtx_lock_sleep (m=0xc95741d0, tid=3355943664, opts=<value optimized out>, file=0x0, line=0) at ../../../kern/kern_mutex.c:400
>> #16 0xc0b2a3e9 in doselwakeup (sip=0xc9a491f4, pri=104) at ../../../kern/sys_generic.c:1681
>> #17 0xc0b2a4be in selwakeuppri (sip=0xc9a491f4, pri=104) at ../../../kern/sys_generic.c:1651
>> #18 0xc0b4886f in sowakeup (so=0xc9a491a0, sb=0xc9a491f4) at ../../../kern/uipc_sockbuf.c:182
>> #19 0xc0c9680d in udp_append (inp=0xc87bb0fc, ip=0xc8ead80e, n=0xc8ebf500, off=28, udp_in=0xe1be9a7c) at ../../../netinet/udp_usrreq.c:330
>> #20 0xc0c98ad6 in udp_input (m=0xc8ebf500, off=20) at ../../../netinet/udp_usrreq.c:616
>> #21 0xc0c0d5e7 in ip_input (m=0xc8ebf500) at ../../../netinet/ip_input.c:760
>> #22 0xc0bae68f in netisr_dispatch_src (proto=1, source=0, m=0xc8ebf500) at ../../../net/netisr.c:1013
>> #23 0xc0bae930 in netisr_dispatch (proto=1, m=0xc8ebf500) at ../../../net/netisr.c:1104
>> #24 0xc0ba50c1 in ether_demux (ifp=0xc807f000, m=0xc8ebf500) at ../../../net/if_ethersubr.c:943
>> #25 0xc0ba551f in ether_nh_input (m=0xc8ebf500) at ../../../net/if_ethersubr.c:762
>> #26 0xc0bae68f in netisr_dispatch_src (proto=9, source=0, m=0xc8ebf500) at ../../../net/netisr.c:1013
>> #27 0xc0bae930 in netisr_dispatch (proto=9, m=0xc8ebf500) at ../../../net/netisr.c:1104
>> #28 0xc0ba4c09 in ether_input (ifp=0xc807f000, m=0xc8ebf500) at ../../../net/if_ethersubr.c:803
>> #29 0xc06930c2 in igb_rxeof (que=0xc8078180, count=99, done=0x0) at ../../../dev/e1000/if_igb.c:4735
>> #30 0xc0693328 in igb_msix_que (arg=0xc8078180) at ../../../dev/e1000/if_igb.c:1601
>> #31 0xc0aae18b in intr_event_execute_handlers (p=0xc7d9e5b0, ie=0xc8077a00) at ../../../kern/kern_intr.c:1272
>> #32 0xc0aaf990 in ithread_loop (arg=0xc80a43e0) at ../../../kern/kern_intr.c:1285
>> #33 0xc0aaa96f in fork_exit (callout=0xc0aaf910 <ithread_loop>, arg=0xc80a43e0, frame=0xe1be9d08) at ../../../kern/kern_fork.c:996
>> #34 0xc0fb6104 in fork_trampoline () at ../../../i386/i386/exception.s:279
> I am not convinced that the thread 100045 is the owner of the spin
> lock.  mtx_lock_spin()->DELAY() frames on the stack mean that the
> thread is trying to obtain the spin lock but did not succeeded yet.
> Or did you not mean that ?
>
> You may obtain the current owner by printing the mutex content from kgdb.
> Look at the mtx_lock field, it is a pointer to the struct thread owning
> the spin lock, modulo some flags in lower bits.
 > kgdb /boot/kernel/kernel /var/crash/vmcore.14
...
...
(kgdb) bt
#0  doadump (textdump=1) at pcpu.h:250
#1  0xc0ade835 in kern_reboot (howto=260) at ../../../kern/kern_shutdown.c:454
#2  0xc0adeb32 in panic (fmt=<value optimized out>) at ../../../kern/kern_shutdown.c:642
#3  0xc0ac9cff in _mtx_lock_spin_failed (m=0x0) at ../../../kern/kern_mutex.c:515
#4  0xc0ac9e75 in _mtx_lock_spin (m=0xc140a4c0, tid=3384060112, opts=0, file=0x0, line=0) at ../../../kern/kern_mutex.c:557
#5  0xc0b096c5 in sched_add (td=0xc9b00bc0, flags=0) at ../../../kern/sched_ule.c:1153
#6  0xc0b09890 in sched_wakeup (td=0xc9b00bc0) at ../../../kern/sched_ule.c:1991
#7  0xc0ae8968 in setrunnable (td=0xc9b00bc0) at ../../../kern/kern_synch.c:537
#8  0xc0b2227e in sleepq_resume_thread (sq=0xc869fd40, td=0xc9b00bc0, pri=104) at ../../../kern/subr_sleepqueue.c:763
#9  0xc0b22fd3 in sleepq_broadcast (wchan=0xc95741e4, flags=1, pri=104, queue=0) at ../../../kern/subr_sleepqueue.c:865
#10 0xc0a8c4cd in cv_broadcastpri (cvp=0xc95741e4, pri=104) at ../../../kern/kern_condvar.c:448
#11 0xc0b2a406 in doselwakeup (sip=0xc963faac, pri=104) at ../../../kern/sys_generic.c:1683
#12 0xc0b2a4be in selwakeuppri (sip=0xc963faac, pri=104) at ../../../kern/sys_generic.c:1651
#13 0xc0a9fa59 in knote_enqueue (kn=<value optimized out>) at ../../../kern/kern_event.c:1786
#14 0xc0aa073f in kqueue_register (kq=0xc963fa80, kev=0xf0e07b20, td=0xc9b4a8d0, waitok=1) at ../../../kern/kern_event.c:1154
#15 0xc0aa09f3 in kern_kevent (td=0xc9b4a8d0, fd=152, nchanges=2, nevents=0, k_ops=0xf0e07c20, timeout=0x0) at ../../../kern/kern_event.c:850
#16 0xc0aa16ce in sys_kevent (td=0xc9b4a8d0, uap=0xf0e07ccc) at ../../../kern/kern_event.c:771
#17 0xc0fcc8c3 in syscall (frame=0xf0e07d08) at subr_syscall.c:135
#18 0xc0fb60f1 in Xint0x80_syscall () at ../../../i386/i386/exception.s:270
#19 0x00000033 in ?? ()
Previous frame inner to this frame (corrupt stack?)

(kgdb) up 4
#4  0xc0ac9e75 in _mtx_lock_spin (m=0xc140a4c0, tid=3384060112, opts=0, file=0x0, line=0) at ../../../kern/kern_mutex.c:557
557     ../../../kern/kern_mutex.c: No such file or directory.
         in ../../../kern/kern_mutex.c

(kgdb) p *m
$1 = {lock_object = {lo_name = 0xc140ab08 "sched lock 0", lo_flags = 720896, lo_data = 0, lo_witness = 0x0}, mtx_lock = 3355943664}

------------

As you see, the mtx_lock is 3355943664 (0xc807a2f0), the same TID reported in panic string.

(kgdb) info threads
...
34 Thread 100045 (PID=12: intr/irq267: igb0:que 0) sched_switch (td=0xc807a2f0, newtd=0xc7da18d0, flags=265) at ../../../kern/sched_ule.c:1904
...


-- 
Best regards
Hooman Fazaeli