From owner-freebsd-scsi@FreeBSD.ORG Tue Feb 28 23:15:42 2012 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3A974106564A for ; Tue, 28 Feb 2012 23:15:42 +0000 (UTC) (envelope-from jwd@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 1FD7B8FC0C for ; Tue, 28 Feb 2012 23:15:42 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.5/8.14.5) with ESMTP id q1SNFgKS050090 for ; Tue, 28 Feb 2012 23:15:42 GMT (envelope-from jwd@freefall.freebsd.org) Received: (from jwd@localhost) by freefall.freebsd.org (8.14.5/8.14.5/Submit) id q1SNFfkw050089 for freebsd-scsi@freebsd.org; Tue, 28 Feb 2012 23:15:41 GMT (envelope-from jwd) Date: Tue, 28 Feb 2012 23:15:41 +0000 From: John To: freebsd-scsi@freebsd.org Message-ID: <20120228231541.GA14736@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.3i Subject: System Hang / kernel panic with new mps driver X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Feb 2012 23:15:42 -0000 Hi Folks, I realize there has been some discussion going on related to the new mps driver recently. Today I caught a kernel panic when we lost a multipath channel and then fixed it (replaced the cable). The complete dmesg from a system boot with hw.mps.debug_level=0xd (it's large): http://people.freebsd.org/~jwd/mpsdmesg.txt Kernel is 9-stable: r231864M /usr/obj/usr/src.2012-02-17_01.26.11/sys/ZFS When the cable was replaced, the system started throwing messages to the console quite fast: Feb 28 18:33:24 testsys kernel: mps1: mpssas_alloc_tm freezing simq Feb 28 18:33:24 testsys kernel: mps1: timedout cm 0xffffff800136b240 allocated tm 0xffffff800132f528 Feb 28 18:33:24 testsys kernel: mps1: mpssas_scsiio_timeout checking sc 0xffffff800131a000 cm 0xffffff800135ee38 Feb 28 18:33:24 testsys kernel: (xpt0:mps1:0:67:0): SERVICE ACTION IN(16). CDB: 9e 10 0 0 0 0 0 0 0 0 0 0 0 d 0 0 length 13 SMID 776 completed timedout cm 0xffffff800136b240 ccb 0xfffffe00612f 4800 during recovery ioc 8048 s(noperiph:mps1:0:67:0): SMID 29 abort TaskMID 776 status 0x0 code 0x0 count 1 Feb 28 18:33:24 testsys kernel: (noperiph:mps1:0:67:0): SMID 29 finished recovery after aborting TaskMID 776 Feb 28 18:33:24 testsys kernel: mps1: mpssas_free_tm releasing simq Feb 28 18:33:24 testsys kernel: (xpt0:mps1:0:67:0): SERVICE ACTION IN(16). CDB: 9e 10 0 0 0 0 0 0 0 0 0 0 0 d 0 0 length 13 SMID 623 command timeout cm 0xffffff800135ee38 ccb 0xfffffe12d189680 0 Feb 28 18:33:24 testsys kernel: mps1: mpssas_alloc_tm freezing simq Feb 28 18:33:24 testsys kernel: mps1: timedout cm 0xffffff800135ee38 allocated tm 0xffffff800132f670 Feb 28 18:33:24 testsys kernel: (xpt0:mps1:0:67:0): SERVICE ACTION IN(16). CDB: 9e 10 0 0 0 0 0 0 0 0 0 0 0 d 0 0 length 13 SMID 623 completed timedout cm 0xf Feb 28 18:33:24 testsys kernel: fffff800135ee38 ccb 0xfffffe12d1896800 during recovery ioc 8048 s(noperiph:mps1:0:67:0): SMID 30 abort TaskMID 623 status 0x0 code 0x0 count 1 Feb 28 18:33:24 testsys kernel: (noperiph:mps1:0:67:0): SMID 30 finished recovery after aborting TaskMID 623 Feb 28 18:33:24 testsys kernel: mps1: mpssas_free_tm releasing simq Many, many of the above.. eventually followed by: Feb 28 18:35:07 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0102 to persistent table because there is no free space available Feb 28 18:35:07 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0104 to persistent table because there is no free space available Feb 28 18:35:07 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0105 to persistent table because there is no free space available ... Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0132 to persistent table because there is no free space available Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0133 to persistent table because there is no free space available Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0134 to persistent table because there is no free space available Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0135 to persistent table because there is no free space available more mps CDB messages, and then Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0136 to persistent table because there is no free space available Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0137 to persistent table because there is no free space available Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0138 to persistent table because there is no free space available ... Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0150 to persistent table because there is no free space available Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0151 to persistent table because there is no free space available Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0152 to persistent table because there is no free space available Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0153 to persistent table because there is no free space available Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0155 to persistent table because there is no free space available At this point the kernel panic'd. (typed from a phone pic - sorry) Fatal trap 12: page fault while in kernel mode ... Stopped at xpt_compile_path+0x87: cmpl 0x50(%rbx),%r11d bt xpt_compile_path() at xpt_compile_path+0x87 xpt_create_path() at xpt_create_path+0x66 mps_rescan_done() at mpssas_rescan_done+0x16a camisr_runqueue() at camisr_runqueue+0x63 intr_event_execute_handlers() at intr_event_execute_handlers+0x104 ithread_loop() at ithread_loop+0xa4 fork_exit() at fork_exit+0x11f fork_trampoline() at fork_trampoline+0xe Not sure if it makes a difference, but during a normal (clean boot), the above 'failed to add the device' messages show up 160 times. This doesn't appear to be a #define value, but a 'fact' from the adapter. Since all the shelves are dual attached and every disk shows up twice, could this make a difference to the initialization logic? (ses devices show up twice also). Any thoughts are welcome. The system is available for testing. Thanks, John