Date: Wed, 16 Jun 2010 17:32:18 -0600 From: Scott Long <scottl@samsco.org> To: Andrew Boyer <aboyer@averesystems.com> Cc: freebsd-scsi@freebsd.org Subject: Re: Overlapped Commands error Message-ID: <C46A13B3-BFA7-4FD7-AD52-F0A60D6CF424@samsco.org> In-Reply-To: <51DD9715-89B2-4058-A4FE-7097603013CC@averesystems.com> References: <51DD9715-89B2-4058-A4FE-7097603013CC@averesystems.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Jun 16, 2010, at 10:17 AM, Andrew Boyer wrote: > Hello SCSI experts, > We recently saw this SCSI command error: >=20 >> Jun 15 15:08:32 eval12 kernel: (da1:mpt0:0:1:0): READ(10). CDB: 28 0 = 2 c8 7f a0 0 0 20 0 >> Jun 15 15:08:32 eval12 kernel: (da1:mpt0:0:1:0): CAM Status: SCSI = Status Error >> Jun 15 15:08:32 eval12 kernel: (da1:mpt0:0:1:0): SCSI Status: Check = Condition >> Jun 15 15:08:32 eval12 kernel: (da1:mpt0:0:1:0): ABORTED COMMAND = asc:4e,0 >> Jun 15 15:08:32 eval12 kernel: (da1:mpt0:0:1:0): Overlapped commands = attempted field replaceable unit: 1 >> Jun 15 15:08:32 eval12 kernel: (da1:mpt0:0:1:0): Retrying Command = (per Sense Data) >> Jun 15 15:08:37 eval12 kernel: mpt0: request 0xffffffff815d5c20:40101 = timed out for ccb 0xffffff000d54d800 (req->ccb 0xffffff000d54d800) >> Jun 15 15:08:37 eval12 kernel: mpt0: attempting to abort req = 0xffffffff815d5c20:40101 function 0 >> Jun 15 15:08:38 eval12 kernel: mpt0: mpt_wait_req(1) timed out >> Jun 15 15:08:38 eval12 kernel: mpt0: mpt_recover_commands: abort = timed-out. Resetting controller >> Jun 15 15:08:38 eval12 kernel: mpt0: mpt_cam_event: 0x0 >> Jun 15 15:08:38 eval12 kernel: mpt0: mpt_cam_event: 0x0 >> Jun 15 15:08:38 eval12 kernel: mpt0: completing timedout/aborted req = 0xffffffff815d5c20:40101 >> Jun 15 15:09:00 eval12 kernel: mpt0: mpt_cam_event: 0x16 >> Jun 15 15:09:00 eval12 kernel: mpt0: mpt_cam_event: 0x12 >> Jun 15 15:09:00 eval12 kernel: mpt0: mpt_cam_event: 0x16 >=20 > No one here has ever seen this before. We're using a CAM and MPT = stack from August 2009 with an LSI1068e HBA connected to Seagate SAS = HDDs. >=20 > This is what the SCSI Architecture Manual (SAM-5 draft) has to say = about overlapped commands: >> [...] >=20 > Can anyone point me to where in the stack the command identifier is = assigned? I see where MPT assigns tags in target mode, but it's the = initiator in this case. Any advice? Don't want to step on Matt, but wanted to expand on what he's said so = far. CAM doesn't assign tag identifiers for initiator I/O, it leaves that up = to the driver and hardware. The tag_id field that you see in CCB's is = for target I/O only. In the case of MPT, the firmware assigns tags, = while on simpler controllers like ESP the driver does it. CAM does = provide the tag action message, i.e. SIMPLE, ORDERED, HEAD_OF_Q, and = it's up to the driver to relay that to hardware, which MPT does in = mpt_start(). The MPT architecture abstracts a lot of the transport protocol away, so = it's generally assumed that it's going to do the right thing in a case = like this. I don't know if the firmware is wrong, or if FreeBSD is = wrong. CAM almost always attaches a SIMPLE action flag with I/O = commands, and the MPT driver looks like it will faithfully translate = that into the corresponding MPT flag. By looking at the inquiry data, = it's roughly possible to determine if the device supports tagged = queuing, so maybe CAM needs to be smarter about this. Instead of the TQ = flag just affecting command scheduling, maybe it also needs to suppress = attaching the SIMPLE action flag, and likewise the MPT driver should set = an UNTAGGED flag in correlation to that. I would expect the MPT firmware to look at the inquiry data and behave = appropriately despite what might be sent in the MPT i/o request, but = again, maybe that's asking too much. If you're adventurous, try = modifying the MPT driver to always set the MPI_SCSIIO_CONTROL_UNTAGGED = flag in mpt_start(), and see if that makes your problem go away. >=20 > Also, is CAM doing the right thing by retrying? scsi_error_action() = in cam/scsi/scsi_all.c always sets the retry bit on aborted commands, = even though the spec quoted above makes it sound like this should be a = fatal error ("This is considered a catastrophic failure on the part of = the SCSI initiator device"). Should scsi_error_action() be looking at = the Additional Sense Code? >=20 The error recovery code in CAM already cross references the ASC/ASCQ to = an action table, but that table is often incomplete for uncommon edge = cases. Try the following: RCS file: /usr1/ncvs/src/sys/cam/scsi/scsi_all.c,v retrieving revision 1.55.2.3 diff -u -r1.55.2.3 scsi_all.c --- scsi_all.c 14 Feb 2010 19:38:27 -0000 1.55.2.3 +++ scsi_all.c 16 Jun 2010 23:31:47 -0000 @@ -1962,7 +1962,7 @@ { SST(0x4D, 0xFF, SS_RDEF | SSQ_RANGE, NULL) }, /* Range 0x00->0xFF */ /* DTLPWROMAEBKVF */ - { SST(0x4E, 0x00, SS_RDEF, + { SST(0x4E, 0x00, SS_FATAL | ENXIO, "Overlapped commands attempted") }, /* T */ { SST(0x50, 0x00, SS_RDEF, Scott
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C46A13B3-BFA7-4FD7-AD52-F0A60D6CF424>