From owner-freebsd-scsi@FreeBSD.ORG Mon Jun 18 21:36:01 2012 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 24B511065670 for ; Mon, 18 Jun 2012 21:36:01 +0000 (UTC) (envelope-from dustinwenz@ebureau.com) Received: from internet02.ebureau.com (internet02.ebureau.com [65.127.24.21]) by mx1.freebsd.org (Postfix) with ESMTP id D96C38FC08 for ; Mon, 18 Jun 2012 21:36:00 +0000 (UTC) Received: from service02.office.ebureau.com (service02.office.ebureau.com [192.168.20.15]) by internet02.ebureau.com (Postfix) with ESMTP id 53923CBCDF9 for ; Mon, 18 Jun 2012 16:36:00 -0500 (CDT) Received: from localhost (localhost [127.0.0.1]) by service02.office.ebureau.com (Postfix) with ESMTP id 19D099F56AAE for ; Mon, 18 Jun 2012 16:36:00 -0500 (CDT) X-Virus-Scanned: amavisd-new at ebureau.com Received: from service02.office.ebureau.com ([127.0.0.1]) by localhost (service02.office.iscompanies.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id CCTPPkoTN+-k for ; Mon, 18 Jun 2012 16:35:58 -0500 (CDT) Received: from square.office.iscompanies.com (square.office.iscompanies.com [10.10.20.22]) by service02.office.ebureau.com (Postfix) with ESMTPSA id EA8799F56A9D for ; Mon, 18 Jun 2012 16:35:58 -0500 (CDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1257) From: Dustin Wenz In-Reply-To: Date: Mon, 18 Jun 2012 16:35:58 -0500 Content-Transfer-Encoding: quoted-printable Message-Id: <1165F6D3-3207-4CEC-9D6C-4615FBEBE13A@ebureau.com> References: To: freebsd-scsi@freebsd.org X-Mailer: Apple Mail (2.1257) Subject: Re: Marginal disks prevent boot with mps(4) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Jun 2012 21:36:01 -0000 What part of cam would be responsible for managing disk conditions such = as this? I've looked through the cam(4) docs, and some of the options = that are configurable, but none of it seems like it might help. It's = possible that I've overlooked something, but I'm not sure what. It would be very helpful if there was a way to remove a device entry = using camcontrol without it hanging. That would at least me allow me to = deal with these failures until a fix is found/created. - .Dustin On Jun 15, 2012, at 6:45 PM, Kyle Creyts wrote: > Iirc, this is a camctl problem. >=20 > Dustin Wenz wrote: >=20 > I just received a SFF-8088->8087 cable via FedEx this morning, which = allowed me to continue to isolate this problem. >=20 > What I discovered is that it makes no difference whether a bad disk is = connected to an expander, or if one is connected directly to the HBA. = So, if this is a hardware bug, it must be present in the LSI = SAS2008-based HBA that I'm using. The firmware on the card was also = upgraded from v11.00.00.00 to v13.00.57.00, which is the latest as far = as I am aware. That did not seem to change the behavior. >=20 > I did notice that earlier during startup, I see this message a page or = so before the endless ioc messages start: > mps0: polling failed > mpssas_get_sata_identify: poll for page completed with error = 60_mapping_get_dev > info: failed to compute the hashed SAS address for SATA device = with handle 0x0009 >=20 > It seems that the driver knows something is up; even before it gets = stuck later on... >=20 > So far, the only way I can get this configuration to boot is to change = the status for MPI2_IOCSTATUS_SCSI_IOC_TERMINATED to CAM_REQ_CMP_ERR, as = Ken mentioned. That change will still cause the machine to report some = "ioc terminated" messages, but will not hang the startup process = indefinitely. However, I'm not sure what the implications of making that = change on a production machine would be. >=20 > If this is LSI's problem, I don't see why they would bother to fix it. = As far as I know, they are the only 6Gb SAS/SATA HBA vendor that works = on FreeBSD. We have no choice but to buy their stuff, even if it's not = robust. >=20 > - .Dustin >=20 > On Jun 8, 2012, at 4:53 PM, Kenneth D. Merry wrote: >=20 >> On Fri, Jun 08, 2012 at 16:25:31 -0500, Dustin Wenz wrote: >>> I just installed a build of 9.0-STABLE in order to test the changes = since release. I was hoping that some of the error-handling in mps would = alter the behavior I've seen with some SATA disks (particularly, Seagate = ST3000DM001 disks) connected through an LSI SAS 9201-16e HBA. >>>=20 >>=20 >> Are you using an expander, or are the disks connected directly to the = HBA? >>=20 >> What firmware version are you using on the HBA? Make sure you have = the >> latest firmware version on the card. >>=20 >>> It is apparently possible for these disks to get in a state where = their presence prevents the machine from booting. This problem has = existed for some time, according to some archive-searching I've done, = but there isn't much consensus on how to fix it. >>>=20 >>> The disks are good enough that they can be probed at startup, but = some part of initialization cannot complete. This is the message I see = repeated forever upon boot (the probe number does change slightly): >>>=20 >>> (probe14:mps0:0:14:0): INQUIRY. CDB: 12 0 0 0 24 0 length 36 = SMID 215 terminated ioc 804b scsi 0 state c xfer 0 >>>=20 >>> There is a comment in mps_sas.c which suggests that this error is = usually transient, but that seems not to be the case here. Can anyone = suggest a modification that might permit booting in this state? >>>=20 >>=20 >> There is not a lot that the driver can do in this case. The command = is >> getting terminated by the firmware in the HBA, and we really don't = have a >> lot of information to indicate why. >>=20 >> You could change the status returned for = MPI2_IOCSTATUS_SCSI_IOC_TERMINATED >> to CAM_REQ_CMP_ERR, and that would just mean that the probe for that = disk >> would eventually fail and the kernel would boot. CAM_REQUEUE_REQ = tells >> CAM to retry the command without decrementing the retry count. That = is >> why you aren't able to boot. >>=20 >> If upgrading the HBA firmware doesn't fix the problem, I would = suggest >> contacting LSI support, and see if they can get additional = diagnostics off >> the board to figure out what the problem is. >>=20 >> Ken >> --=20 >> Kenneth Merry >> ken@FreeBSD.ORG >=20 > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to = "freebsd-scsi-unsubscribe@freebsd.org"