From owner-freebsd-scsi@FreeBSD.ORG  Tue Feb 28 23:15:42 2012
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
Delivered-To: freebsd-scsi@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3A974106564A
	for <freebsd-scsi@freebsd.org>; Tue, 28 Feb 2012 23:15:42 +0000 (UTC)
	(envelope-from jwd@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id 1FD7B8FC0C
	for <freebsd-scsi@freebsd.org>; Tue, 28 Feb 2012 23:15:42 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
	by freefall.freebsd.org (8.14.5/8.14.5) with ESMTP id q1SNFgKS050090
	for <freebsd-scsi@freebsd.org>; Tue, 28 Feb 2012 23:15:42 GMT
	(envelope-from jwd@freefall.freebsd.org)
Received: (from jwd@localhost)
	by freefall.freebsd.org (8.14.5/8.14.5/Submit) id q1SNFfkw050089
	for freebsd-scsi@freebsd.org; Tue, 28 Feb 2012 23:15:41 GMT
	(envelope-from jwd)
Date: Tue, 28 Feb 2012 23:15:41 +0000
From: John <jwd@freebsd.org>
To: freebsd-scsi@freebsd.org
Message-ID: <20120228231541.GA14736@FreeBSD.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.4.2.3i
Subject: System Hang / kernel panic with new mps driver
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Feb 2012 23:15:42 -0000

Hi Folks,

   I realize there has been some discussion going on related to the new
mps driver recently. Today I caught a kernel panic when we lost a multipath
channel and then fixed it (replaced the cable).

   The complete dmesg from a system boot with hw.mps.debug_level=0xd (it's large):

http://people.freebsd.org/~jwd/mpsdmesg.txt

   Kernel is 9-stable: r231864M /usr/obj/usr/src.2012-02-17_01.26.11/sys/ZFS 

   When the cable was replaced, the system started throwing messages
to the console quite fast:

Feb 28 18:33:24 testsys kernel: mps1: mpssas_alloc_tm freezing simq
Feb 28 18:33:24 testsys kernel: mps1: timedout cm 0xffffff800136b240 allocated tm 0xffffff800132f528
Feb 28 18:33:24 testsys kernel: mps1: mpssas_scsiio_timeout checking sc 0xffffff800131a000 cm 0xffffff800135ee38
Feb 28 18:33:24 testsys kernel: (xpt0:mps1:0:67:0): SERVICE ACTION IN(16). CDB: 9e 10 0 0 0 0 0 0 0 0 0 0 0 d 0 0 length 13 SMID 776 completed timedout cm 0xffffff800136b240 ccb 0xfffffe00612f
4800 during recovery ioc 8048 s(noperiph:mps1:0:67:0): SMID 29 abort TaskMID 776 status 0x0 code 0x0 count 1
Feb 28 18:33:24 testsys kernel: (noperiph:mps1:0:67:0): SMID 29 finished recovery after aborting TaskMID 776
Feb 28 18:33:24 testsys kernel: mps1: mpssas_free_tm releasing simq
Feb 28 18:33:24 testsys kernel: (xpt0:mps1:0:67:0): SERVICE ACTION IN(16). CDB: 9e 10 0 0 0 0 0 0 0 0 0 0 0 d 0 0 length 13 SMID 623 command timeout cm 0xffffff800135ee38 ccb 0xfffffe12d189680
0
Feb 28 18:33:24 testsys kernel: mps1: mpssas_alloc_tm freezing simq
Feb 28 18:33:24 testsys kernel: mps1: timedout cm 0xffffff800135ee38 allocated tm 0xffffff800132f670
Feb 28 18:33:24 testsys kernel: (xpt0:mps1:0:67:0): SERVICE ACTION IN(16). CDB: 9e 10 0 0 0 0 0 0 0 0 0 0 0 d 0 0 length 13 SMID 623 completed timedout cm 0xf
Feb 28 18:33:24 testsys kernel: fffff800135ee38 ccb 0xfffffe12d1896800 during recovery ioc 8048 s(noperiph:mps1:0:67:0): SMID 30 abort TaskMID 623 status 0x0 code 0x0 count 1
Feb 28 18:33:24 testsys kernel: (noperiph:mps1:0:67:0): SMID 30 finished recovery after aborting TaskMID 623
Feb 28 18:33:24 testsys kernel: mps1: mpssas_free_tm releasing simq

   Many, many of the above.. eventually followed by:

Feb 28 18:35:07 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0102 to persistent table because there is no free space available
Feb 28 18:35:07 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0104 to persistent table because there is no free space available
Feb 28 18:35:07 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0105 to persistent table because there is no free space available
...
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0132 to persistent table because there is no free space available
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0133 to persistent table because there is no free space available
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0134 to persistent table because there is no free space available
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0135 to persistent table because there is no free space available

   more mps CDB messages, and then

Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0136 to persistent table because there is no free space available
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0137 to persistent table because there is no free space available
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0138 to persistent table because there is no free space available
...
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0150 to persistent table because there is no free space available
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0151 to persistent table because there is no free space available
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0152 to persistent table because there is no free space available
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0153 to persistent table because there is no free space available
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0155 to persistent table because there is no free space available

   At this point the kernel panic'd. (typed from a phone pic - sorry)

Fatal trap 12: page fault while in kernel mode
...
Stopped at xpt_compile_path+0x87: cmpl 0x50(%rbx),%r11d

bt
xpt_compile_path() at xpt_compile_path+0x87
xpt_create_path() at xpt_create_path+0x66
mps_rescan_done() at mpssas_rescan_done+0x16a
camisr_runqueue() at camisr_runqueue+0x63
intr_event_execute_handlers() at intr_event_execute_handlers+0x104
ithread_loop() at ithread_loop+0xa4
fork_exit() at fork_exit+0x11f
fork_trampoline() at fork_trampoline+0xe

Not sure if it makes a difference, but during a normal (clean boot), the above
'failed to add the device' messages show up 160 times. This doesn't appear
to be a #define value, but a 'fact' from the adapter. Since all the shelves
are dual attached and every disk shows up twice, could this make a difference
to the initialization logic?  (ses devices show up twice also).

Any thoughts are welcome. The system is available for testing.

Thanks,
John