Date: Thu, 21 Apr 2016 15:16:09 +0100 From: Justin Clift <justin@postgresql.org> To: freebsd-infiniband@freebsd.org Subject: Kernel panic (page fault) on 10.3-STABLE with IB & VIMAGE modules Message-ID: <210EB5F8-DEC1-4F5E-9CC7-003AF3784B50@postgresql.org>
next in thread | raw e-mail | index | archive | help
Hi all, Have been hitting a kernel panic (page fault) with the IB modules loaded on 10.3-STABLE. (compiled multiple times over the last few days, all panicing) Spent several hours narrowing down the cause, and it's definitely a bad interaction between the IB modules (unsure which) + the "VIMAGE" module. I'll fill out a bug report in a bit. In the meantime, does the below have any useful info in it that I can use for further investigation? (commands taken from https://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug-gdb.html) *********************************************************************************** root@cluster1:/usr/obj/usr/src/sys/CONNECTX # kgdb kernel.debug /var/crash/vmcore.0 GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Unread portion of the kernel message buffer: code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (irq271: mlx4_core0) trap number = 12 panic: page fault cpuid = 0 KDB: stack backtrace: #0 0xffffffff807263d0 at kdb_backtrace+0x60 #1 0xffffffff806e8c76 at vpanic+0x126 #2 0xffffffff806e8b43 at panic+0x43 #3 0xffffffff80b8bf3b at trap_fatal+0x36b #4 0xffffffff80b8c23d at trap_pfault+0x2ed #5 0xffffffff80b8b8ba at trap+0x47a #6 0xffffffff80b71892 at calltrap+0x8 #7 0xffffffff807be1a2 at netisr_dispatch_src+0x62 #8 0xffffffff808f89fa at ipoib_cm_handle_rx_wc+0x22a #9 0xffffffff808fcc98 at ipoib_ib_completion+0x78 #10 0xffffffff80930c43 at mlx4_cq_completion+0x63 #11 0xffffffff80933d43 at mlx4_eq_int+0x2c3 #12 0xffffffff80932fac at mlx4_msi_x_interrupt+0xc #13 0xffffffff806b35cb at intr_event_execute_handlers+0xab #14 0xffffffff806b3a16 at ithread_loop+0x96 #15 0xffffffff806b104a at fork_exit+0x9a #16 0xffffffff80b71dce at fork_trampoline+0xe Uptime: 3m47s Dumping 485 out of 7857 MB:..4%..14%..24%..33%..43%..53%..63%..73%..83%..93% Reading symbols from /boot/kernel/ums.ko.symbols...done. Loaded symbols for /boot/kernel/ums.ko.symbols #0 doadump (textdump=<value optimized out>) at pcpu.h:219 219 __asm("movq %%gs:%1,%0" : "=r" (td) (kgdb) list *0xffffffff808f89fa 0xffffffff808f89fa is in ipoib_cm_handle_rx_wc (/usr/src/sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c:565). 560 mb->m_pkthdr.rcvif = dev; 561 proto = *mtod(mb, uint16_t *); 562 m_adj(mb, IPOIB_ENCAP_LEN); 563 564 IPOIB_MTAP_PROTO(dev, mb, proto); 565 ipoib_demux(dev, mb, ntohs(proto)); 566 567 repost: 568 if (has_srq) { 569 if (unlikely(ipoib_cm_post_receive_srq(priv, wr_id))) Current language: auto; currently minimal (kgdb) list *0xffffffff807be1a2 0xffffffff807be1a2 is in netisr_dispatch_src (/usr/src/sys/net/netisr.c:976). 971 if (dispatch_policy == NETISR_DISPATCH_DIRECT) { 972 nwsp = DPCPU_PTR(nws); 973 npwp = &nwsp->nws_work[proto]; 974 npwp->nw_dispatched++; 975 npwp->nw_handled++; 976 netisr_proto[proto].np_handler(m); 977 error = 0; 978 goto out_unlock; 979 } 980 (kgdb) list *0xffffffff80b71892 0xffffffff80b71892 is at /usr/src/sys/amd64/amd64/exception.S:238. 233 .type calltrap,@function 234 calltrap: 235 movq %rsp,%rdi 236 call trap 237 MEXITCOUNT 238 jmp doreti /* Handle any pending ASTs */ 239 240 /* 241 * alltraps_noen entry point. Unlike alltraps above, we want to 242 * leave the interrupts disabled. This corresponds to (kgdb) list *0xffffffff80b8b8ba 0xffffffff80b8b8ba is in trap (/usr/src/sys/amd64/amd64/trap.c:447). 442 443 KASSERT(cold || td->td_ucred != NULL, 444 ("kernel trap doesn't have ucred")); 445 switch (type) { 446 case T_PAGEFLT: /* page fault */ 447 (void) trap_pfault(frame, FALSE); 448 goto out; 449 450 case T_DNA: 451 KASSERT(!PCB_USER_FPU(td->td_pcb), (kgdb) *********************************************************************************** Regards and best wishes, Justin Clift -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?210EB5F8-DEC1-4F5E-9CC7-003AF3784B50>
