Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 21 Jan 2010 07:49:40 -0500
From:      John Baldwin <jhb@freebsd.org>
To:        Stephane LAPIE <stephane.lapie@darkbsd.org>
Cc:        freebsd-hardware@freebsd.org
Subject:   Re: DELL SAS5/E Controller bug
Message-ID:  <201001210749.40575.jhb@freebsd.org>
In-Reply-To: <4B58008C.4050207@darkbsd.org>
References:  <4B56CD4C.80503@darkbsd.org> <201001201105.26367.jhb@freebsd.org> <4B58008C.4050207@darkbsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday 21 January 2010 2:21:48 am Stephane LAPIE wrote:
> John Baldwin wrote:
> > On Wednesday 20 January 2010 10:09:43 am Stephane LAPIE wrote:
> >> John Baldwin wrote:
> >>> On Wednesday 20 January 2010 4:30:52 am Stephane LAPIE wrote:
> >>>> Hello list,
> >>>>
> >>>> Basically I'm experiencing the same problem as described here :
> >>>> https://forums.freebsd.org/showthread.php?t=9407 (linking for 
reference)
> >>>>
> >>>> Drives disconnections are not recognized instantly, and instead I get
> >>>> the following dmesg entries :
> >>>> mpt0: mpt_cam_event: 0x16
> >>>> mpt0: mpt_cam_event: 0x16
> >>>>
> >>>> (Sometimes I also get "mpt0: mpt_cam_event: 0x12" events)
> >>>>
> >>>> This is really crippling as this litterally paralyzes the ZFS pool 
until
> >>>> the controller finally comes to its senses (...or until a disk gets
> >>>> replugged in, which provokes a flush of all the buffered failed SCSI
> >>>> requests).
> >>>>
> >>>> Hardware is recognized as :
> >>>> mpt0@pci0:6:8:0:	class=0x010000 card=0x1f041028 chip=0x00541000 
rev=0x01
> >>>> hdr=0x00
> >>>>     vendor = 'LSI Logic (Was: Symbios Logic, NCR)'
> >>>>     device = 'SAS 3000 series, 8-port with 1068 -StorPort'
> >>>>     class = mass storage
> >>>>     subclass = SCSI
> >>>>
> >>>> Did anyone else experience this, or find a proper work-around ?
> >>> Invoke 'camcontrol rescan' after removing a drive.  mptutil(8) does the 
> >>> equivalent when adding and removing volumes to make up for the driver 
not 
> >>> automatically rescanning.
> >> I already tried reset/rescan via camcontrol, but after removing a drive, 
> >> the process freezes (process status "D", Ctrl+T in terminal shows it's 
> >> in a "cbwait" state, it can't be bg'ed). I did not wait for a hardware 
> >> timeout, I tried replugging the drive, which released the ZFS and 
> >> camcontrol locks.
> >>
> >>
> >> Also, I tried poking around with mptutil and could obtain the following 
> >> information, if it can be of any help :
> >>
> >> freebsd-r610# mptutil -u 0 show adapter
> >> mpt0 Adapter:
> >>         Board Name: SAS5e
> >>     Board Assembly:
> >>          Chip Name: C1068
> >>      Chip Revision: UNUSED
> >>        RAID Levels: none
> >> mptutil: Reading config page header failed: Invalid configuration page
> >>
> >> (The above error message should be normal since this is not a RAID 
> >> controller, though a bit jarring)
> > 
> > This patch should fix that:
> > 
> > Index: mpt_show.c
> > ===================================================================
> > --- mpt_show.c	(revision 202640)
> > +++ mpt_show.c	(working copy)
> > @@ -78,6 +78,7 @@
> >  	CONFIG_PAGE_MANUFACTURING_0 *man0;
> >  	CONFIG_PAGE_IOC_2 *ioc2;
> >  	CONFIG_PAGE_IOC_6 *ioc6;
> > +	U16 IOCStatus;
> >  	int fd, comma;
> >  
> >  	if (ac != 1) {
> > @@ -108,7 +109,7 @@
> >  
> >  	free(man0);
> >  
> > -	ioc2 = mpt_read_ioc_page(fd, 2, NULL);
> > +	ioc2 = mpt_read_ioc_page(fd, 2, &IOCStatus);
> >  	if (ioc2 != NULL) {
> >  		printf("      RAID Levels:");
> >  		comma = 0;
> > @@ -151,9 +152,10 @@
> >  			printf(" none");
> >  		printf("\n");
> >  		free(ioc2);
> > -	}
> > +	} else if (IOCStatus != MPI_IOCSTATUS_CONFIG_INVALID_PAGE)
> > +		warnx("mpt_read_ioc_page(2): %s", mpt_ioc_status(IOCStatus));
> >  
> > -	ioc6 = mpt_read_ioc_page(fd, 6, NULL);
> > +	ioc6 = mpt_read_ioc_page(fd, 6, &IOCStatus);
> >  	if (ioc6 != NULL) {
> >  		display_stripe_map("    RAID0 Stripes",
> >  		    ioc6->SupportedStripeSizeMapIS);
> > @@ -172,7 +174,8 @@
> >  			printf("-%u", ioc6->MaxDrivesIME);
> >  		printf("\n");
> >  		free(ioc6);
> > -	}
> > +	} else if (IOCStatus != MPI_IOCSTATUS_CONFIG_INVALID_PAGE)
> > +		warnx("mpt_read_ioc_page(2): %s", mpt_ioc_status(IOCStatus));
> >  
> >  	/* TODO: Add an ioctl to fetch IOC_FACTS and print firmware version. */
> >  
> > 
> >> However, the following is a bit disturbing :
> >>
> >> freebsd-r610# mptutil -u 0 show drives
> >> mpt0 Physical Drives:
> >>   da0 (  932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 0
> >>   da1 (  932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 1
> >>   da2 (  932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 2
> >>   da3 (  932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 3
> >>   da4 (  932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 4
> >>   da5 (  932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 5
> >>   da6 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 6
> >>   da7 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 7
> >>   da8 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 8
> >>   da9 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 9
> >> da10 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 10
> >> da11 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 11
> >> da12 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 12
> >> da13 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 13
> >> da14 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 14
> >> da15 (  136G) ONLINE <Dell VIRTUAL DISK 1028> SAS bus 0 id 0
> >>
> >> The above listing seems weird, as da15 should belong to mpt1.
> > 
> > Agreed.  I specifically ask that CAM only return results for devices on 
bus 0
> > of mptX.  Before when I debugged this I used gdb and set a breakpoint in
> > mpt_fetch_disks() so I could examine the structures that CAM returned.  
This
> > is the code that identifies mptX vs mpt<any>:
> > 
> > 		/* Match mptX bus 0. */
> > 		ccb.cdm.patterns[0].type = DEV_MATCH_BUS;
> > 		b = &ccb.cdm.patterns[0].pattern.bus_pattern;
> > 		snprintf(b->dev_name, sizeof(b->dev_name), "mpt");
> > 		b->unit_number = mpt_unit;
> > 		b->bus_id = 0;
> > 		b->flags = BUS_MATCH_NAME | BUS_MATCH_UNIT | BUS_MATCH_BUS_ID;
> > 
> > 'mpt_unit' is a global variable that is set to the value of the 'u'
> > parameter.
> > 
> >> freebsd-r610# mptutil -u 1 show drives
> >> mptutil: mpt_fetch_disks got wrong CAM matches
> >> mpt1 Physical Drives:
> >>     0 (  137G) ONLINE <FUJITSU MBE2147RC D701> SAS bus 0 id 1
> >>     1 (  137G) ONLINE <FUJITSU MBE2147RC D701> SAS bus 0 id 9
> > 
> > Similarly I would use gdb to exmaine the reply from CAM here to see why
> > it got 'wrong CAM matches'.  The code expects the first match to match
> > the bus and the next N matches should be 'daX' devices.
> > 
> 
> I just applied your patch to mptutil source, which now returns :
> 
> freebsd-r610# mptutil show adapter
> mpt0 Adapter:
>        Board Name: SAS5e
>    Board Assembly:
> 	Chip Name: C1068
>     Chip Revision: UNUSED
>       RAID Levels: none
> mptutil: mpt_read_ioc_page(2): Invalid configuration page

Gah, that should be the case that I ignore.  Can you replace the second 
warnx() call I added with this:

		warnx("mpt_read_ioc_page(6): %s (%x)", mpt_ioc_status(IOCStatus),
		    IOCStatus);

> I will give a try on the gdb thing once I get a chance of installing the
> source tree on this test machine.
> 
> 
> Also, I pasted the dmesg trace of trying to remove da0 and da6 and
> trying to have the system register the removal via a "camcontrol rescan 0" :
> 
> -> Unplugging "da0" and "da6" :
> mpt0: mpt_cam_event: 0x16
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x16
> mpt0: mpt_cam_event: 0x16
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x16
> 
> -> Then running "camcontrol rescan 0" (which leaves "cbwait" state and
> finishes at 187s real time)
> mpt0: request 0xffffff80005bcea0:5936 timed out for ccb
> 0xffffff00032d4000 (req->ccb 0xffffff00032d4000)
> mpt0: attempting to abort req 0xffffff80005bcea0:5936 function 0
> mpt0: mpt_wait_req(1) timed out
> mpt0: mpt_recover_commands: abort timed-out. Resetting controller
> mpt0: mpt_cam_event: 0x0
> mpt0: completing timedout/aborted req 0xffffff80005bcea0:5936
> mpt0: mpt_cam_event: 0x16
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x16
> (da0:mpt0:0:0:0): lost device
> (da0:mpt0:0:0:0): Synchronize cache failed, status == 0x4a, scsi status
> == 0x0
> (da0:mpt0:0:0:0): removing device entry
> (da6:mpt0:0:6:0): lost device
> (da6:mpt0:0:6:0): Synchronize cache failed, status == 0x4a, scsi status
> == 0x0
> (da6:mpt0:0:6:0): removing device entry
> 
> -> Then replugging the drive "da0" :
> mpt0: mpt_cam_event: 0x16
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x16

I know that the rescan after removing a device is a bit messy (lots of 
messages before daX actually goes away), but I don't recall it taking such a 
long time.

> Is there any documentation or hint as to what those mpt_cam_event are ?
> I could whip myself a quick patch to at least change the display so one
> would figure what these are.
> 
> It feels like the 0x12 and 0x16 have to be handled to invalidate the
> device that has been unplugged so the next request won't timeout but
> fail directly.

The documentation is not public.  The 0x12 and 0x16 messages are events that
I have seen.  You can try talking to scottl@ as he has access to the docs.

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201001210749.40575.jhb>