From owner-freebsd-scsi@FreeBSD.ORG  Fri Jun 15 23:06:54 2012
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
Delivered-To: freebsd-scsi@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A7A7B106566B
	for <freebsd-scsi@freebsd.org>; Fri, 15 Jun 2012 23:06:54 +0000 (UTC)
	(envelope-from dustinwenz@ebureau.com)
Received: from internet02.ebureau.com (internet02.ebureau.com [65.127.24.21])
	by mx1.freebsd.org (Postfix) with ESMTP id 68FEC8FC12
	for <freebsd-scsi@freebsd.org>; Fri, 15 Jun 2012 23:06:54 +0000 (UTC)
Received: from service02.office.ebureau.com (service02.office.ebureau.com
	[192.168.20.15])
	by internet02.ebureau.com (Postfix) with ESMTP id EF449CB4B61
	for <freebsd-scsi@freebsd.org>; Fri, 15 Jun 2012 18:06:47 -0500 (CDT)
Received: from localhost (localhost [127.0.0.1])
	by service02.office.ebureau.com (Postfix) with ESMTP id D53109F0C1D3
	for <freebsd-scsi@freebsd.org>; Fri, 15 Jun 2012 18:06:47 -0500 (CDT)
X-Virus-Scanned: amavisd-new at ebureau.com
Received: from service02.office.ebureau.com ([127.0.0.1])
	by localhost (service02.office.iscompanies.com [127.0.0.1])
	(amavisd-new, port 10024)
	with ESMTP id ZntZD03PMn4a for <freebsd-scsi@freebsd.org>;
	Fri, 15 Jun 2012 18:06:47 -0500 (CDT)
Received: from square.office.iscompanies.com (square.office.iscompanies.com
	[10.10.20.22])
	by service02.office.ebureau.com (Postfix) with ESMTPSA id 744B19F0C1C6
	for <freebsd-scsi@freebsd.org>; Fri, 15 Jun 2012 18:06:47 -0500 (CDT)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1257)
From: Dustin Wenz <dustinwenz@ebureau.com>
In-Reply-To: <20120608215326.GA83721@nargothrond.kdm.org>
Date: Fri, 15 Jun 2012 18:06:47 -0500
Content-Transfer-Encoding: quoted-printable
Message-Id: <551EFA9B-74F7-4CFC-954C-C9E0440E2BDC@ebureau.com>
References: <60F17E0E-EE4A-4F37-9925-055315B987B1@ebureau.com>
	<20120608215326.GA83721@nargothrond.kdm.org>
To: freebsd-scsi@freebsd.org
X-Mailer: Apple Mail (2.1257)
Subject: Re: Marginal disks prevent boot with mps(4)
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Jun 2012 23:06:54 -0000

I just received a SFF-8088->8087 cable via FedEx this morning, which =
allowed me to continue to isolate this problem.

What I discovered is that it makes no difference whether a bad disk is =
connected to an expander, or if one is connected directly to the HBA. =
So, if this is a hardware bug, it must be present in the LSI =
SAS2008-based HBA that I'm using. The firmware on the card was also =
upgraded from v11.00.00.00 to v13.00.57.00, which is the latest as far =
as I am aware. That did not seem to change the behavior.

I did notice that earlier during startup, I see this message a page or =
so before the endless ioc messages start:
	mps0: polling failed
	mpssas_get_sata_identify: poll for page completed with error =
60_mapping_get_dev
	info: failed to compute the hashed SAS address for SATA device =
with handle 0x0009

It seems that the driver knows something is up; even before it gets =
stuck later on...

So far, the only way I can get this configuration to boot is to change =
the status for MPI2_IOCSTATUS_SCSI_IOC_TERMINATED to CAM_REQ_CMP_ERR, as =
Ken mentioned. That change will still cause the machine to report some =
"ioc terminated" messages, but will not hang the startup process =
indefinitely. However, I'm not sure what the implications of making that =
change on a production machine would be.

If this is LSI's problem, I don't see why they would bother to fix it. =
As far as I know, they are the only 6Gb SAS/SATA HBA vendor that works =
on FreeBSD. We have no choice but to buy their stuff, even if it's not =
robust.

	- .Dustin

On Jun 8, 2012, at 4:53 PM, Kenneth D. Merry wrote:

> On Fri, Jun 08, 2012 at 16:25:31 -0500, Dustin Wenz wrote:
>> I just installed a build of 9.0-STABLE in order to test the changes =
since release. I was hoping that some of the error-handling in mps would =
alter the behavior I've seen with some SATA disks (particularly, Seagate =
ST3000DM001 disks) connected through an LSI SAS 9201-16e HBA.
>>=20
>=20
> Are you using an expander, or are the disks connected directly to the =
HBA?
>=20
> What firmware version are you using on the HBA?  Make sure you have =
the
> latest firmware version on the card.
>=20
>> It is apparently possible for these disks to get in a state where =
their presence prevents the machine from booting. This problem has =
existed for some time, according to some archive-searching I've done, =
but there isn't much consensus on how to fix it.
>>=20
>> The disks are good enough that they can be probed at startup, but =
some part of initialization cannot complete. This is the message I see =
repeated forever upon boot (the probe number does change slightly):
>>=20
>> 	(probe14:mps0:0:14:0): INQUIRY. CDB: 12 0 0 0 24 0 length 36 =
SMID 215 terminated ioc 804b scsi 0 state c xfer 0
>>=20
>> There is a comment in mps_sas.c which suggests that this error is =
usually transient, but that seems not to be the case here. Can anyone =
suggest a modification that might permit booting in this state?
>>=20
>=20
> There is not a lot that the driver can do in this case.  The command =
is
> getting terminated by the firmware in the HBA, and we really don't =
have a
> lot of information to indicate why.
>=20
> You could change the status returned for =
MPI2_IOCSTATUS_SCSI_IOC_TERMINATED
> to CAM_REQ_CMP_ERR, and that would just mean that the probe for that =
disk
> would eventually fail and the kernel would boot.  CAM_REQUEUE_REQ =
tells
> CAM to retry the command without decrementing the retry count.  That =
is
> why you aren't able to boot.
>=20
> If upgrading the HBA firmware doesn't fix the problem, I would suggest
> contacting LSI support, and see if they can get additional diagnostics =
off
> the board to figure out what the problem is.
>=20
> Ken
> --=20
> Kenneth Merry
> ken@FreeBSD.ORG