From owner-freebsd-hackers@freebsd.org Mon Feb 8 09:09:20 2016 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 14AE5A9E300 for ; Mon, 8 Feb 2016 09:09:20 +0000 (UTC) (envelope-from prt@prt.org) Received: from smtp3.mail.clearhost.co.uk (smtp3.mail.clearhost.co.uk [IPv6:2001:1420::25:103]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "*.mail.clearhost.co.uk", Issuer "RapidSSL CA" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id B68F01A06 for ; Mon, 8 Feb 2016 09:09:19 +0000 (UTC) (envelope-from prt@prt.org) Received: from [2001:1420:a:105:c62c:3ff:fe2f:bf] (port=49395 helo=parsnip.heronsbrook.org.uk) by smtp3.mail.clearhost.co.uk with esmtpa (Exim 4.76 (FreeBSD)) (envelope-from ) id 1aShoq-0007ik-Uk for freebsd-hackers@freebsd.org; Mon, 08 Feb 2016 09:09:16 +0000 To: "freebsd-hackers@freebsd.org" From: Paul Thornton Subject: Troubleshooting panic on 10.2-RELEASE Message-ID: <56B85B7E.9080705@prt.org> Date: Mon, 8 Feb 2016 09:10:22 +0000 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 08 Feb 2016 09:09:20 -0000 Hi My Kernel debugging isn't up to troubleshooting what's going on here. We have two out of 10 identical VMs that regularly (between 5 and 20 hours) panic. On one, I have recompiled GENERIC to get a kernel.debug and have crash dumps being recorded. They are running GENERIC kernels with ZFS-on-root (dmesg at the bottom of this mail). All of the other machines work fine, although the ones that are crashing have the highest use at the moment. The panic is lock-related, and going by pid 12 ([intr]) is interrupt related to network tx/rx. Unread portion of the kernel message buffer: Sleeping thread (tid 100024, pid 12) owns a non-sleepable lock KDB: stack backtrace of thread 100024: #0 0xffffffff80952061 at mi_switch+0xe1 #1 0xffffffff8099043a at sleepq_wait+0x3a #2 0xffffffff80951a87 at _sleep+0x287 #3 0xffffffff80a040c4 at filt_bpfread+0x94 #4 0xffffffff809079eb at knote+0xdb #5 0xffffffff80a00acb at catchpacket+0x67b #6 0xffffffff80a00d20 at bpf_mtap+0x1e0 #7 0xffffffff80a0f634 at ether_nh_input+0x174 #8 0xffffffff80a177d2 at netisr_dispatch_src+0x62 #9 0xffffffff80df8fcf at vmxnet3_rxq_eof+0x4ff #10 0xffffffff80df86f1 at vmxnet3_legacy_intr+0xe1 #11 0xffffffff8091482b at intr_event_execute_handlers+0xab #12 0xffffffff80914c76 at ithread_loop+0x96 #13 0xffffffff8091244a at fork_exit+0x9a #14 0xffffffff80d30cde at fork_trampoline+0xe panic: sleeping thread cpuid = 0 So far, out of 5 dumps, I see two patterns emerging. One, like the above, involves vmxnet3, the other (to my untrained eye) looks like a write on a Unix socket. They both share knote. (kgdb) backtrace #0 doadump (textdump=) at pcpu.h:219 #1 0xffffffff80948642 in kern_reboot (howto=260) at ../../../kern/kern_shutdown.c:451 #2 0xffffffff80948a25 in vpanic (fmt=, ap=) at ../../../kern/kern_shutdown.c:758 #3 0xffffffff809488b3 in panic (fmt=0x0) at ../../../kern/kern_shutdown.c:687 #4 0xffffffff80995f89 in propagate_priority (td=) at ../../../kern/subr_turnstile.c:227 #5 0xffffffff80996a0e in turnstile_wait (ts=, owner=, queue=) at ../../../kern/subr_turnstile.c:743 #6 0xffffffff8092ea3b in __mtx_lock_sleep (c=0xfffff80027b63718, tid=18446735277657056416, opts=, file=, line=0) at ../../../kern/kern_mutex.c:522 #7 0xffffffff809079a5 in knote (list=0xfffff80027ca50a8, hint=0, lockflags=) at ../../../kern/kern_event.c:1873 #8 0xffffffff80a00acb in catchpacket (d=, pkt=, pktlen=, snaplen=, cpfn=, bt=) at ../../../net/bpf.c:980 #9 0xffffffff80a00d20 in bpf_mtap (bp=0xfffff800027e3680, m=0xfffff800274d2b00) at ../../../net/bpf.c:2142 #10 0xffffffff80a0f634 in ether_nh_input (m=) at ../../../net/if_ethersubr.c:520 #11 0xffffffff80a177d2 in netisr_dispatch_src (proto=, source=, m=0x0) at ../../../net/netisr.c:976 #12 0xffffffff80df8fcf in vmxnet3_rxq_eof (rxq=0xfffff80002797400) at ../../../dev/vmware/vmxnet3/if_vmx.c:2076 #13 0xffffffff80df86f1 in vmxnet3_legacy_intr (xsc=0xfffff800027c3800) at ../../../dev/vmware/vmxnet3/if_vmx.c:2245 #14 0xffffffff8091482b in intr_event_execute_handlers (p=, ie=0xfffff800027ea100) at ../../../kern/kern_intr.c:1264 #15 0xffffffff80914c76 in ithread_loop (arg=0xfffff800027ca7a0) at ../../../kern/kern_intr.c:1277 #16 0xffffffff8091244a in fork_exit (callout=0xffffffff80914be0 , arg=0xfffff800027ca7a0, frame=0xfffffe00934ccac0) at ../../../kern/kern_fork.c:1018 #17 0xffffffff80d30cde in fork_trampoline () at ../../../amd64/amd64/exception.S:611 #18 0x0000000000000000 in ?? () [From a different panic and reboot] (kgdb) backtrace #0 doadump (textdump=) at pcpu.h:219 #1 0xffffffff80948642 in kern_reboot (howto=260) at ../../../kern/kern_shutdown.c:451 #2 0xffffffff80948a25 in vpanic (fmt=, ap=) at ../../../kern/kern_shutdown.c:758 #3 0xffffffff809488b3 in panic (fmt=0x0) at ../../../kern/kern_shutdown.c:687 #4 0xffffffff80995f89 in propagate_priority (td=) at ../../../kern/subr_turnstile.c:227 #5 0xffffffff80996a0e in turnstile_wait (ts=, owner=, queue=) at ../../../kern/subr_turnstile.c:743 #6 0xffffffff8092ea3b in __mtx_lock_sleep (c=0xfffff8002770de18, tid=18446735278278059168, opts=, file=, line=0) at ../../../kern/kern_mutex.c:522 #7 0xffffffff809079a5 in knote (list=0xfffff80027866b80, hint=0, lockflags=) at ../../../kern/kern_event.c:1873 #8 0xffffffff809bcc35 in sowakeup (so=0xfffff80027866ae0, sb=0xfffff80027866b70) at ../../../kern/uipc_sockbuf.c:189 #9 0xffffffff809cce46 in uipc_send (so=0xfffff8002786fae0, flags=, m=, nam=0x0, control=0x0, td=) at ../../../kern/uipc_usrreq.c:902 #10 0xffffffff809c0526 in sosend_generic (so=0xfffff8002786fae0, addr=0x0, uio=0xfffffe00935fa960, top=, control=, flags=, td=0x0) at ../../../kern/uipc_socket.c:1283 #11 0xffffffff809a4869 in soo_write (fp=, uio=0xfffffe00935fa960, active_cred=, flags=, td=) at ../../../kern/sys_socket.c:103 #12 0xffffffff8099c647 in dofilewrite (td=0xfffff800276e24a0, fd=7, fp=0xfffff80027871d20, auio=0xfffffe00935fa960, offset=, flags=0) at file.h:304 #13 0xffffffff8099c378 in kern_writev (td=0xfffff800276e24a0, fd=7, auio=0xfffffe00935fa960) at ../../../kern/sys_generic.c:481 #14 0xffffffff8099c303 in sys_write (td=, uap=) at ../../../kern/sys_generic.c:396 #15 0xffffffff80d4b3a7 in amd64_syscall (td=0xfffff800276e24a0, traced=0) at subr_syscall.c:134 #16 0xffffffff80d30a8b in Xfast_syscall () at ../../../amd64/amd64/exception.S:396 #17 0x0000000801a1cc0a in ?? () We haven't seen this on any other 10.2-RELEASE install - either on a VM or physical machine. What's the best way forward to troubleshoot this? Thanks Paul. boot dmesg follows: Copyright (c) 1992-2015 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 10.2-RELEASE-p8 #1: Fri Feb 5 16:22:56 UTC 2016 root@portal.dev:/usr/src/sys/amd64/compile/GENERIC amd64 FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512 CPU: Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz (3500.00-MHz K8-class CPU) Origin="GenuineIntel" Id=0x306f2 Family=0x6 Model=0x3f Stepping=2 Features=0xfa3fbff Features2=0xfdfa3203 AMD Features=0x2c100800 AMD Features2=0x21 Structured Extended Features=0x272a XSAVE Features=0x1 TSC: P-state invariant Hypervisor: Origin = "VMwareVMware" real memory = 2147483648 (2048 MB) avail memory = 2050248704 (1955 MB) Event timer "LAPIC" quality 600 ACPI APIC Table: MADT: Forcing active-low polarity and level trigger for SCI ioapic0 irqs 0-23 on motherboard random: initialized kbd1 at kbdmux0 acpi0: on motherboard acpi0: Power Button (fixed) Timecounter "HPET" frequency 14318180 Hz quality 950 cpu0: on acpi0 attimer0: port 0x40-0x43 irq 0 on acpi0 Timecounter "i8254" frequency 1193182 Hz quality 0 Event timer "i8254" frequency 1193182 Hz quality 100 atrtc0: port 0x70-0x71 irq 8 on acpi0 Event timer "RTC" frequency 32768 Hz quality 0 Timecounter "ACPI-fast" frequency 3579545 Hz quality 900 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x1008-0x100b on acpi0 pcib0: port 0xcf8-0xcff on acpi0 pci0: on pcib0 pcib1: at device 1.0 on pci0 pci1: on pcib1 isab0: at device 7.0 on pci0 isa0: on isab0 atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x1060-0x106f at device 7.1 on pci0 ata0: at channel 0 on atapci0 ata1: at channel 1 on atapci0 pci0: at device 7.3 (no driver attached) vgapci0: port 0x1070-0x107f mem 0xe8000000-0xefffffff,0xfe000000-0xfe7fffff irq 16 at device 15.0 on pci0 vgapci0: Boot video device pcib2: at device 17.0 on pci0 pci2: on pcib2 pcib3: at device 21.0 on pci0 pci3: on pcib3 vmx0: port 0x4000-0x400f mem 0xfd5fc000-0xfd5fcfff,0xfd5fd000-0xfd5fdfff,0xfd5fe000-0xfd5fffff irq 18 at device 0.0 on pci3 vmx0: Ethernet address: 00:0c:29:e5:ef:c4 pcib4: at device 21.1 on pci0 pci4: on pcib4 pcib5: at device 21.2 on pci0 pci5: on pcib5 pcib6: at device 21.3 on pci0 pci6: on pcib6 pcib7: at device 21.4 on pci0 pci7: on pcib7 pcib8: at device 21.5 on pci0 pci8: on pcib8 pcib9: at device 21.6 on pci0 pci9: on pcib9 pcib10: at device 21.7 on pci0 pci10: on pcib10 pcib11: at device 22.0 on pci0 pci11: on pcib11 vmx1: port 0x5000-0x500f mem 0xfd4fc000-0xfd4fcfff,0xfd4fd000-0xfd4fdfff,0xfd4fe000-0xfd4fffff irq 19 at device 0.0 on pci11 vmx1: Ethernet address: 0a:d1:c0:01:03:c0 pcib12: at device 22.1 on pci0 pci12: on pcib12 pcib13: at device 22.2 on pci0 pci13: on pcib13 pcib14: at device 22.3 on pci0 pci14: on pcib14 pcib15: at device 22.4 on pci0 pci15: on pcib15 pcib16: at device 22.5 on pci0 pci16: on pcib16 pcib17: at device 22.6 on pci0 pci17: on pcib17 pcib18: at device 22.7 on pci0 pci18: on pcib18 pcib19: at device 23.0 on pci0 pci19: on pcib19 mpt0: port 0x6000-0x60ff mem 0xfd3ec000-0xfd3effff,0xfd3f0000-0xfd3fffff irq 16 at device 0.0 on pci19 mpt0: MPI Version=1.5.0.0 pcib20: at device 23.1 on pci0 pci20: on pcib20 pcib21: at device 23.2 on pci0 pci21: on pcib21 pcib22: at device 23.3 on pci0 pci22: on pcib22 pcib23: at device 23.4 on pci0 pci23: on pcib23 pcib24: at device 23.5 on pci0 pci24: on pcib24 pcib25: at device 23.6 on pci0 pci25: on pcib25 pcib26: at device 23.7 on pci0 pci26: on pcib26 pcib27: at device 24.0 on pci0 pci27: on pcib27 vmx2: port 0x7000-0x700f mem 0xfd2fc000-0xfd2fcfff,0xfd2fd000-0xfd2fdfff,0xfd2fe000-0xfd2fffff irq 17 at device 0.0 on pci27 vmx2: Ethernet address: 00:0c:29:e5:ef:d8 pcib28: at device 24.1 on pci0 pci28: on pcib28 pcib29: at device 24.2 on pci0 pci29: on pcib29 pcib30: at device 24.3 on pci0 pci30: on pcib30 pcib31: at device 24.4 on pci0 pci31: on pcib31 pcib32: at device 24.5 on pci0 pci32: on pcib32 pcib33: at device 24.6 on pci0 pci33: on pcib33 pcib34: at device 24.7 on pci0 pci34: on pcib34 acpi_acad0: on acpi0 atkbdc0: port 0x60,0x64 irq 1 on acpi0 atkbd0: irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] psm0: irq 12 on atkbdc0 psm0: [GIANT-LOCKED] psm0: model IntelliMouse, device ID 3 orm0: at iomem 0xc0000-0xc7fff,0xc8000-0xc8fff,0xc9000-0xc9fff,0xca000-0xcbfff,0xcc000-0xccfff,0xdc000-0xdffff,0xe0000-0xe7fff on isa0 sc0: at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 ppc0: cannot reserve I/O port range acpi_throttle0: on cpu0 Timecounters tick every 1.000 msec ipfw2 (+ipv6) initialized, divert loadable, nat loadable, default to deny, logging disabled random: unblocking device. da0 at mpt0 bus 0 scbus2 target 0 lun 0 da0: Fixed Direct Access SCSI-2 device da0: 300.000MB/s transfers da0: Command Queueing enabled da0: 40960MB (83886080 512 byte sectors: 255H 63S/T 5221C) da0: quirks=0x40 Timecounter "TSC-low" frequency 1749998500 Hz quality 1000 Trying to mount root from ufs:/dev/da0p3 [rw]... WARNING: / was not properly dismounted ZFS NOTICE: Prefetch is disabled by default if less than 4GB of RAM is present; to enable, add "vfs.zfs.prefetch_disable=0" to /boot/loader.conf. ZFS filesystem version: 5 ZFS storage pool version: features support (5000)