Date: Tue, 28 Feb 2012 23:15:41 +0000 From: John <jwd@freebsd.org> To: freebsd-scsi@freebsd.org Subject: System Hang / kernel panic with new mps driver Message-ID: <20120228231541.GA14736@FreeBSD.org>
next in thread | raw e-mail | index | archive | help
Hi Folks, I realize there has been some discussion going on related to the new mps driver recently. Today I caught a kernel panic when we lost a multipath channel and then fixed it (replaced the cable). The complete dmesg from a system boot with hw.mps.debug_level=0xd (it's large): http://people.freebsd.org/~jwd/mpsdmesg.txt Kernel is 9-stable: r231864M /usr/obj/usr/src.2012-02-17_01.26.11/sys/ZFS When the cable was replaced, the system started throwing messages to the console quite fast: Feb 28 18:33:24 testsys kernel: mps1: mpssas_alloc_tm freezing simq Feb 28 18:33:24 testsys kernel: mps1: timedout cm 0xffffff800136b240 allocated tm 0xffffff800132f528 Feb 28 18:33:24 testsys kernel: mps1: mpssas_scsiio_timeout checking sc 0xffffff800131a000 cm 0xffffff800135ee38 Feb 28 18:33:24 testsys kernel: (xpt0:mps1:0:67:0): SERVICE ACTION IN(16). CDB: 9e 10 0 0 0 0 0 0 0 0 0 0 0 d 0 0 length 13 SMID 776 completed timedout cm 0xffffff800136b240 ccb 0xfffffe00612f 4800 during recovery ioc 8048 s(noperiph:mps1:0:67:0): SMID 29 abort TaskMID 776 status 0x0 code 0x0 count 1 Feb 28 18:33:24 testsys kernel: (noperiph:mps1:0:67:0): SMID 29 finished recovery after aborting TaskMID 776 Feb 28 18:33:24 testsys kernel: mps1: mpssas_free_tm releasing simq Feb 28 18:33:24 testsys kernel: (xpt0:mps1:0:67:0): SERVICE ACTION IN(16). CDB: 9e 10 0 0 0 0 0 0 0 0 0 0 0 d 0 0 length 13 SMID 623 command timeout cm 0xffffff800135ee38 ccb 0xfffffe12d189680 0 Feb 28 18:33:24 testsys kernel: mps1: mpssas_alloc_tm freezing simq Feb 28 18:33:24 testsys kernel: mps1: timedout cm 0xffffff800135ee38 allocated tm 0xffffff800132f670 Feb 28 18:33:24 testsys kernel: (xpt0:mps1:0:67:0): SERVICE ACTION IN(16). CDB: 9e 10 0 0 0 0 0 0 0 0 0 0 0 d 0 0 length 13 SMID 623 completed timedout cm 0xf Feb 28 18:33:24 testsys kernel: fffff800135ee38 ccb 0xfffffe12d1896800 during recovery ioc 8048 s(noperiph:mps1:0:67:0): SMID 30 abort TaskMID 623 status 0x0 code 0x0 count 1 Feb 28 18:33:24 testsys kernel: (noperiph:mps1:0:67:0): SMID 30 finished recovery after aborting TaskMID 623 Feb 28 18:33:24 testsys kernel: mps1: mpssas_free_tm releasing simq Many, many of the above.. eventually followed by: Feb 28 18:35:07 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0102 to persistent table because there is no free space available Feb 28 18:35:07 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0104 to persistent table because there is no free space available Feb 28 18:35:07 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0105 to persistent table because there is no free space available ... Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0132 to persistent table because there is no free space available Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0133 to persistent table because there is no free space available Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0134 to persistent table because there is no free space available Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0135 to persistent table because there is no free space available more mps CDB messages, and then Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0136 to persistent table because there is no free space available Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0137 to persistent table because there is no free space available Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0138 to persistent table because there is no free space available ... Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0150 to persistent table because there is no free space available Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0151 to persistent table because there is no free space available Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0152 to persistent table because there is no free space available Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0153 to persistent table because there is no free space available Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0155 to persistent table because there is no free space available At this point the kernel panic'd. (typed from a phone pic - sorry) Fatal trap 12: page fault while in kernel mode ... Stopped at xpt_compile_path+0x87: cmpl 0x50(%rbx),%r11d bt xpt_compile_path() at xpt_compile_path+0x87 xpt_create_path() at xpt_create_path+0x66 mps_rescan_done() at mpssas_rescan_done+0x16a camisr_runqueue() at camisr_runqueue+0x63 intr_event_execute_handlers() at intr_event_execute_handlers+0x104 ithread_loop() at ithread_loop+0xa4 fork_exit() at fork_exit+0x11f fork_trampoline() at fork_trampoline+0xe Not sure if it makes a difference, but during a normal (clean boot), the above 'failed to add the device' messages show up 160 times. This doesn't appear to be a #define value, but a 'fact' from the adapter. Since all the shelves are dual attached and every disk shows up twice, could this make a difference to the initialization logic? (ses devices show up twice also). Any thoughts are welcome. The system is available for testing. Thanks, John
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120228231541.GA14736>