From owner-freebsd-questions@freebsd.org Thu Oct 6 17:27:30 2016 Return-Path: Delivered-To: freebsd-questions@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9868ABEC992 for ; Thu, 6 Oct 2016 17:27:30 +0000 (UTC) (envelope-from robroy@robroygregg.com) Received: from mail.robroygregg.com (173-13-147-189-sfba.hfc.comcastbusiness.net [173.13.147.189]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 83F31920 for ; Thu, 6 Oct 2016 17:27:29 +0000 (UTC) (envelope-from robroy@robroygregg.com) Received: from funmax (funmax.d.net [192.168.16.3]) by mail.robroygregg.com (OpenSMTPD) with ESMTP id a4de1a31 for ; Thu, 6 Oct 2016 10:20:48 -0700 (PDT) Date: Thu, 6 Oct 2016 10:20:48 -0700 (PDT) From: Robroy Gregg X-X-Sender: robroy@funmax.d.net To: freebsd-questions@freebsd.org Subject: isp(4) QLE2462 initiator failure with 10.3-RELEASE Message-ID: User-Agent: Alpine 2.20 (BSF 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset=US-ASCII X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 06 Oct 2016 17:27:30 -0000 FreeBSD Friends, I opened a FreeBSD Forums thread with this question on Monday. I'm sorry to duplicate the question in two places, yet I figured I might have better luck with being noticed by developers here on the mailing lists. Here's the thread: https://forums.freebsd.org/threads/57923/ A chum and I have been setting up some FreeBSD 10.3-RELEASE servers at work, which access ZFS pools on Hitachi Modular and Enterprise family arrays. FreeBSD attaches to the Brocade fabric with a QLE2462 FC HBA, and sees four paths to each LU. Here's a drawing of the basic idea: http://www.robroygregg.com/misc/2016Oct03.PNG The drawing leaves out a few more arrays (of the same types), and various switches in the fabric between the arrays and the two 6510s (in the drawing). ===== The Problem ===== The first FC HBA port, isp0 stopped working spontaneously, after several weeks of uptime with light I/O. All LU paths automatically failed over to isp1, yet paths through isp0 remain non-functional even now. The first sign of trouble appeared in /var/log/messages, followed by many more similar errors for other LU paths: isp0: Chan 0 Abort Cmd for N-Port 0x0005 @ Port 0x111300 isp0: Polled Mailbox Command (0x54) Timeout (5000000us) (started @ isp_control:4733) isp0: Mailbox Command 'EXECUTE IOCB A64' failed (TIMEOUT) isp0: isp_watchdog: timeout for handle 0x6570200d (da5:isp0:0:4:1): FIN dl16384 resid 0 CDB=0x2a 0x00 0x03 0x51 0x1b 0xe5 0x00 0x00 0x20 0x00 STS 0x0 XS_ERR=0xb (da5:isp0:0:4:1): WRITE(10). CDB: 2a 00 03 51 1b e5 00 00 20 00 (da5:isp0:0:4:1): CAM status: Command timeout (da5:isp0:0:4:1): Retrying command These caused successful fail-overs to paths through isp1, which looked like this in /var/log/messages: (da5:isp0:0:4:1): Error 5, Retries exhausted GEOM_MULTIPATH: Error 5, da5 in 85040360_0999 marked FAIL GEOM_MULTIPATH: da17 is now active path in 85040360_0999 ===== What I've already tried ===== * I tried manually failing back to paths through isp0 with commands like "gmultipath restore 66209_002E da2" followed by "gmultipath rotate 66209_002E." When I/Os are tried over isp0, it shows the same, original symptom (shown below in context), until it fails back to a path through isp1. GEOM_MULTIPATH: da3 in 66209_002E is marked OK. GEOM_MULTIPATH: da3 is now active path in 66209_002E isp0: Chan 0 Abort Cmd for N-Port 0x0004 @ Port 0x0e2000 isp0: Polled Mailbox Command (0x54) Timeout (5000000us) (started @ isp_control:4733) isp0: Mailbox Command 'EXECUTE IOCB A64' failed (TIMEOUT) isp0: isp_watchdog: timeout for handle 0x65a7200d (da3:isp0:0:3:0): FIN dl2560 resid 0 CDB=0x2a 0x00 0x04 0x2a 0xa7 0x89 0x00 0x00 0x05 0x00 STS 0x0 XS_ERR=0xb (da3:isp0:0:3:0): WRITE(10). CDB: 2a 00 04 2a a7 89 00 00 05 00 (da3:isp0:0:3:0): CAM status: Command timeout (da3:isp0:0:3:0): Retrying command * I've tried failing over to every possible array target for an LU, over isp0; it was the same for each target. * I've tried replacing every fiber optic cabling segment between the isp0 HBA port and the switch; the behavior was unchanged. * I've tried physically swapping the isp0 and isp1 HBA port connections--the symptom stuck to isp0, even when its I/Os were being attempted through the physical connection formerly used (successfully) by isp1. * I've tried disabling and re-enabling the Brocade switch port. When the port was enabled, it assumed the "In_Sync" state (instead of the "Online" state it shows when it's working): 2 2 150200 id N4 In_Sync FC ===== Computer information ===== This is a Hitachi CR220H, which is based on an MSI S0051a motherboard. ===== FC HBA information ===== This is a QLE2462 at firmware level 8.01.02 and BIOS level 3.29. ispfw(4)'s being used, and claims to have successfully placed its own firmware on the card during boot, presumably over-riding the levels I flashed (mentioned here). Related sysctls: # sysctl -a | grep dev.isp dev.isp.1.topo: 3 dev.isp.1.loopstate: 9 dev.isp.1.fwstate: 3 dev.isp.1.linkstate: 1 dev.isp.1.speed: 4 dev.isp.1.role: 2 dev.isp.1.gone_device_time: 30 dev.isp.1.loop_down_limit: 60 dev.isp.1.wwpn: 2378182195041974935 dev.isp.1.wwnn: 2305843126027336343 dev.isp.1.%parent: pci3 dev.isp.1.%pnpinfo: vendor=0x1077 device=0x2432 subvendor=0x1077 subdevice=0x0138 class=0x0c0400 dev.isp.1.%location: pci0:3:0:1 dev.isp.1.%driver: isp dev.isp.1.%desc: Qlogic ISP 2432 PCI FC-AL Adapter dev.isp.0.topo: 3 dev.isp.0.loopstate: 9 dev.isp.0.fwstate: 3 dev.isp.0.linkstate: 1 dev.isp.0.speed: 4 dev.isp.0.role: 2 dev.isp.0.gone_device_time: 30 dev.isp.0.loop_down_limit: 60 dev.isp.0.wwpn: 2377900720063167127 dev.isp.0.wwnn: 2305843126025239191 dev.isp.0.%parent: pci3 dev.isp.0.%pnpinfo: vendor=0x1077 device=0x2432 subvendor=0x1077 subdevice=0x0138 class=0x0c0400 dev.isp.0.%location: pci0:3:0:0 dev.isp.0.%driver: isp dev.isp.0.%desc: Qlogic ISP 2432 PCI FC-AL Adapter dev.isp.%parent: ===== FC switch information ===== Each FC HBA port's attached to a (separate) Brocade 6510 running FOS v7.4.1. The symptom's not specific to either of these switches (I tried swapping the connections around, and the symptom stuck to isp0). ===== Array information ===== LUs from both Hitachi Modular (AMS) and Enterprise (VSP) arrays are visible over the QLE2462. When this problem happens, the behavior's uniform for all array paths; the symptom's not specific to any one array, or array family. ===== What's happening now ===== I'm guessing that this problem would temporarily go away if I rebooted the computer, yet we won't be able to continue on with the project until we figure out what happened to isp0--we're afraid that it'll happen again, naturally at the most inopportune time possible. So the computer's still in its problem state now. Thanks so very much! Robroy Robroy Gregg Salinas, California