Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 20 Jan 2010 11:05:26 -0500
From:      John Baldwin <jhb@freebsd.org>
To:        Stephane LAPIE <stephane.lapie@darkbsd.org>
Cc:        freebsd-hardware@freebsd.org
Subject:   Re: DELL SAS5/E Controller bug
Message-ID:  <201001201105.26367.jhb@freebsd.org>
In-Reply-To: <4B571CB7.3020303@darkbsd.org>
References:  <4B56CD4C.80503@darkbsd.org> <201001200848.16874.jhb@freebsd.org> <4B571CB7.3020303@darkbsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday 20 January 2010 10:09:43 am Stephane LAPIE wrote:
> John Baldwin wrote:
> > On Wednesday 20 January 2010 4:30:52 am Stephane LAPIE wrote:
> >> Hello list,
> >>
> >> Basically I'm experiencing the same problem as described here :
> >> https://forums.freebsd.org/showthread.php?t=9407 (linking for reference)
> >>
> >> Drives disconnections are not recognized instantly, and instead I get
> >> the following dmesg entries :
> >> mpt0: mpt_cam_event: 0x16
> >> mpt0: mpt_cam_event: 0x16
> >>
> >> (Sometimes I also get "mpt0: mpt_cam_event: 0x12" events)
> >>
> >> This is really crippling as this litterally paralyzes the ZFS pool until
> >> the controller finally comes to its senses (...or until a disk gets
> >> replugged in, which provokes a flush of all the buffered failed SCSI
> >> requests).
> >>
> >> Hardware is recognized as :
> >> mpt0@pci0:6:8:0:	class=0x010000 card=0x1f041028 chip=0x00541000 rev=0x01
> >> hdr=0x00
> >>     vendor = 'LSI Logic (Was: Symbios Logic, NCR)'
> >>     device = 'SAS 3000 series, 8-port with 1068 -StorPort'
> >>     class = mass storage
> >>     subclass = SCSI
> >>
> >> Did anyone else experience this, or find a proper work-around ?
> > 
> > Invoke 'camcontrol rescan' after removing a drive.  mptutil(8) does the 
> > equivalent when adding and removing volumes to make up for the driver not 
> > automatically rescanning.
> 
> I already tried reset/rescan via camcontrol, but after removing a drive, 
> the process freezes (process status "D", Ctrl+T in terminal shows it's 
> in a "cbwait" state, it can't be bg'ed). I did not wait for a hardware 
> timeout, I tried replugging the drive, which released the ZFS and 
> camcontrol locks.
> 
> 
> Also, I tried poking around with mptutil and could obtain the following 
> information, if it can be of any help :
> 
> freebsd-r610# mptutil -u 0 show adapter
> mpt0 Adapter:
>         Board Name: SAS5e
>     Board Assembly:
>          Chip Name: C1068
>      Chip Revision: UNUSED
>        RAID Levels: none
> mptutil: Reading config page header failed: Invalid configuration page
> 
> (The above error message should be normal since this is not a RAID 
> controller, though a bit jarring)

This patch should fix that:

Index: mpt_show.c
===================================================================
--- mpt_show.c	(revision 202640)
+++ mpt_show.c	(working copy)
@@ -78,6 +78,7 @@
 	CONFIG_PAGE_MANUFACTURING_0 *man0;
 	CONFIG_PAGE_IOC_2 *ioc2;
 	CONFIG_PAGE_IOC_6 *ioc6;
+	U16 IOCStatus;
 	int fd, comma;
 
 	if (ac != 1) {
@@ -108,7 +109,7 @@
 
 	free(man0);
 
-	ioc2 = mpt_read_ioc_page(fd, 2, NULL);
+	ioc2 = mpt_read_ioc_page(fd, 2, &IOCStatus);
 	if (ioc2 != NULL) {
 		printf("      RAID Levels:");
 		comma = 0;
@@ -151,9 +152,10 @@
 			printf(" none");
 		printf("\n");
 		free(ioc2);
-	}
+	} else if (IOCStatus != MPI_IOCSTATUS_CONFIG_INVALID_PAGE)
+		warnx("mpt_read_ioc_page(2): %s", mpt_ioc_status(IOCStatus));
 
-	ioc6 = mpt_read_ioc_page(fd, 6, NULL);
+	ioc6 = mpt_read_ioc_page(fd, 6, &IOCStatus);
 	if (ioc6 != NULL) {
 		display_stripe_map("    RAID0 Stripes",
 		    ioc6->SupportedStripeSizeMapIS);
@@ -172,7 +174,8 @@
 			printf("-%u", ioc6->MaxDrivesIME);
 		printf("\n");
 		free(ioc6);
-	}
+	} else if (IOCStatus != MPI_IOCSTATUS_CONFIG_INVALID_PAGE)
+		warnx("mpt_read_ioc_page(2): %s", mpt_ioc_status(IOCStatus));
 
 	/* TODO: Add an ioctl to fetch IOC_FACTS and print firmware version. */
 

> However, the following is a bit disturbing :
> 
> freebsd-r610# mptutil -u 0 show drives
> mpt0 Physical Drives:
>   da0 (  932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 0
>   da1 (  932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 1
>   da2 (  932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 2
>   da3 (  932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 3
>   da4 (  932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 4
>   da5 (  932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 5
>   da6 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 6
>   da7 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 7
>   da8 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 8
>   da9 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 9
> da10 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 10
> da11 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 11
> da12 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 12
> da13 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 13
> da14 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 14
> da15 (  136G) ONLINE <Dell VIRTUAL DISK 1028> SAS bus 0 id 0
> 
> The above listing seems weird, as da15 should belong to mpt1.

Agreed.  I specifically ask that CAM only return results for devices on bus 0
of mptX.  Before when I debugged this I used gdb and set a breakpoint in
mpt_fetch_disks() so I could examine the structures that CAM returned.  This
is the code that identifies mptX vs mpt<any>:

		/* Match mptX bus 0. */
		ccb.cdm.patterns[0].type = DEV_MATCH_BUS;
		b = &ccb.cdm.patterns[0].pattern.bus_pattern;
		snprintf(b->dev_name, sizeof(b->dev_name), "mpt");
		b->unit_number = mpt_unit;
		b->bus_id = 0;
		b->flags = BUS_MATCH_NAME | BUS_MATCH_UNIT | BUS_MATCH_BUS_ID;

'mpt_unit' is a global variable that is set to the value of the 'u'
parameter.

> freebsd-r610# mptutil -u 1 show drives
> mptutil: mpt_fetch_disks got wrong CAM matches
> mpt1 Physical Drives:
>     0 (  137G) ONLINE <FUJITSU MBE2147RC D701> SAS bus 0 id 1
>     1 (  137G) ONLINE <FUJITSU MBE2147RC D701> SAS bus 0 id 9

Similarly I would use gdb to exmaine the reply from CAM here to see why
it got 'wrong CAM matches'.  The code expects the first match to match
the bus and the next N matches should be 'daX' devices.

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201001201105.26367.jhb>