Date: Thu, 21 Jan 2010 07:49:40 -0500 From: John Baldwin <jhb@freebsd.org> To: Stephane LAPIE <stephane.lapie@darkbsd.org> Cc: freebsd-hardware@freebsd.org Subject: Re: DELL SAS5/E Controller bug Message-ID: <201001210749.40575.jhb@freebsd.org> In-Reply-To: <4B58008C.4050207@darkbsd.org> References: <4B56CD4C.80503@darkbsd.org> <201001201105.26367.jhb@freebsd.org> <4B58008C.4050207@darkbsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday 21 January 2010 2:21:48 am Stephane LAPIE wrote: > John Baldwin wrote: > > On Wednesday 20 January 2010 10:09:43 am Stephane LAPIE wrote: > >> John Baldwin wrote: > >>> On Wednesday 20 January 2010 4:30:52 am Stephane LAPIE wrote: > >>>> Hello list, > >>>> > >>>> Basically I'm experiencing the same problem as described here : > >>>> https://forums.freebsd.org/showthread.php?t=9407 (linking for reference) > >>>> > >>>> Drives disconnections are not recognized instantly, and instead I get > >>>> the following dmesg entries : > >>>> mpt0: mpt_cam_event: 0x16 > >>>> mpt0: mpt_cam_event: 0x16 > >>>> > >>>> (Sometimes I also get "mpt0: mpt_cam_event: 0x12" events) > >>>> > >>>> This is really crippling as this litterally paralyzes the ZFS pool until > >>>> the controller finally comes to its senses (...or until a disk gets > >>>> replugged in, which provokes a flush of all the buffered failed SCSI > >>>> requests). > >>>> > >>>> Hardware is recognized as : > >>>> mpt0@pci0:6:8:0: class=0x010000 card=0x1f041028 chip=0x00541000 rev=0x01 > >>>> hdr=0x00 > >>>> vendor = 'LSI Logic (Was: Symbios Logic, NCR)' > >>>> device = 'SAS 3000 series, 8-port with 1068 -StorPort' > >>>> class = mass storage > >>>> subclass = SCSI > >>>> > >>>> Did anyone else experience this, or find a proper work-around ? > >>> Invoke 'camcontrol rescan' after removing a drive. mptutil(8) does the > >>> equivalent when adding and removing volumes to make up for the driver not > >>> automatically rescanning. > >> I already tried reset/rescan via camcontrol, but after removing a drive, > >> the process freezes (process status "D", Ctrl+T in terminal shows it's > >> in a "cbwait" state, it can't be bg'ed). I did not wait for a hardware > >> timeout, I tried replugging the drive, which released the ZFS and > >> camcontrol locks. > >> > >> > >> Also, I tried poking around with mptutil and could obtain the following > >> information, if it can be of any help : > >> > >> freebsd-r610# mptutil -u 0 show adapter > >> mpt0 Adapter: > >> Board Name: SAS5e > >> Board Assembly: > >> Chip Name: C1068 > >> Chip Revision: UNUSED > >> RAID Levels: none > >> mptutil: Reading config page header failed: Invalid configuration page > >> > >> (The above error message should be normal since this is not a RAID > >> controller, though a bit jarring) > > > > This patch should fix that: > > > > Index: mpt_show.c > > =================================================================== > > --- mpt_show.c (revision 202640) > > +++ mpt_show.c (working copy) > > @@ -78,6 +78,7 @@ > > CONFIG_PAGE_MANUFACTURING_0 *man0; > > CONFIG_PAGE_IOC_2 *ioc2; > > CONFIG_PAGE_IOC_6 *ioc6; > > + U16 IOCStatus; > > int fd, comma; > > > > if (ac != 1) { > > @@ -108,7 +109,7 @@ > > > > free(man0); > > > > - ioc2 = mpt_read_ioc_page(fd, 2, NULL); > > + ioc2 = mpt_read_ioc_page(fd, 2, &IOCStatus); > > if (ioc2 != NULL) { > > printf(" RAID Levels:"); > > comma = 0; > > @@ -151,9 +152,10 @@ > > printf(" none"); > > printf("\n"); > > free(ioc2); > > - } > > + } else if (IOCStatus != MPI_IOCSTATUS_CONFIG_INVALID_PAGE) > > + warnx("mpt_read_ioc_page(2): %s", mpt_ioc_status(IOCStatus)); > > > > - ioc6 = mpt_read_ioc_page(fd, 6, NULL); > > + ioc6 = mpt_read_ioc_page(fd, 6, &IOCStatus); > > if (ioc6 != NULL) { > > display_stripe_map(" RAID0 Stripes", > > ioc6->SupportedStripeSizeMapIS); > > @@ -172,7 +174,8 @@ > > printf("-%u", ioc6->MaxDrivesIME); > > printf("\n"); > > free(ioc6); > > - } > > + } else if (IOCStatus != MPI_IOCSTATUS_CONFIG_INVALID_PAGE) > > + warnx("mpt_read_ioc_page(2): %s", mpt_ioc_status(IOCStatus)); > > > > /* TODO: Add an ioctl to fetch IOC_FACTS and print firmware version. */ > > > > > >> However, the following is a bit disturbing : > >> > >> freebsd-r610# mptutil -u 0 show drives > >> mpt0 Physical Drives: > >> da0 ( 932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 0 > >> da1 ( 932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 1 > >> da2 ( 932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 2 > >> da3 ( 932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 3 > >> da4 ( 932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 4 > >> da5 ( 932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 5 > >> da6 ( 932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 6 > >> da7 ( 932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 7 > >> da8 ( 932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 8 > >> da9 ( 932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 9 > >> da10 ( 932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 10 > >> da11 ( 932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 11 > >> da12 ( 932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 12 > >> da13 ( 932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 13 > >> da14 ( 932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 14 > >> da15 ( 136G) ONLINE <Dell VIRTUAL DISK 1028> SAS bus 0 id 0 > >> > >> The above listing seems weird, as da15 should belong to mpt1. > > > > Agreed. I specifically ask that CAM only return results for devices on bus 0 > > of mptX. Before when I debugged this I used gdb and set a breakpoint in > > mpt_fetch_disks() so I could examine the structures that CAM returned. This > > is the code that identifies mptX vs mpt<any>: > > > > /* Match mptX bus 0. */ > > ccb.cdm.patterns[0].type = DEV_MATCH_BUS; > > b = &ccb.cdm.patterns[0].pattern.bus_pattern; > > snprintf(b->dev_name, sizeof(b->dev_name), "mpt"); > > b->unit_number = mpt_unit; > > b->bus_id = 0; > > b->flags = BUS_MATCH_NAME | BUS_MATCH_UNIT | BUS_MATCH_BUS_ID; > > > > 'mpt_unit' is a global variable that is set to the value of the 'u' > > parameter. > > > >> freebsd-r610# mptutil -u 1 show drives > >> mptutil: mpt_fetch_disks got wrong CAM matches > >> mpt1 Physical Drives: > >> 0 ( 137G) ONLINE <FUJITSU MBE2147RC D701> SAS bus 0 id 1 > >> 1 ( 137G) ONLINE <FUJITSU MBE2147RC D701> SAS bus 0 id 9 > > > > Similarly I would use gdb to exmaine the reply from CAM here to see why > > it got 'wrong CAM matches'. The code expects the first match to match > > the bus and the next N matches should be 'daX' devices. > > > > I just applied your patch to mptutil source, which now returns : > > freebsd-r610# mptutil show adapter > mpt0 Adapter: > Board Name: SAS5e > Board Assembly: > Chip Name: C1068 > Chip Revision: UNUSED > RAID Levels: none > mptutil: mpt_read_ioc_page(2): Invalid configuration page Gah, that should be the case that I ignore. Can you replace the second warnx() call I added with this: warnx("mpt_read_ioc_page(6): %s (%x)", mpt_ioc_status(IOCStatus), IOCStatus); > I will give a try on the gdb thing once I get a chance of installing the > source tree on this test machine. > > > Also, I pasted the dmesg trace of trying to remove da0 and da6 and > trying to have the system register the removal via a "camcontrol rescan 0" : > > -> Unplugging "da0" and "da6" : > mpt0: mpt_cam_event: 0x16 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x16 > mpt0: mpt_cam_event: 0x16 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x16 > > -> Then running "camcontrol rescan 0" (which leaves "cbwait" state and > finishes at 187s real time) > mpt0: request 0xffffff80005bcea0:5936 timed out for ccb > 0xffffff00032d4000 (req->ccb 0xffffff00032d4000) > mpt0: attempting to abort req 0xffffff80005bcea0:5936 function 0 > mpt0: mpt_wait_req(1) timed out > mpt0: mpt_recover_commands: abort timed-out. Resetting controller > mpt0: mpt_cam_event: 0x0 > mpt0: completing timedout/aborted req 0xffffff80005bcea0:5936 > mpt0: mpt_cam_event: 0x16 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x16 > (da0:mpt0:0:0:0): lost device > (da0:mpt0:0:0:0): Synchronize cache failed, status == 0x4a, scsi status > == 0x0 > (da0:mpt0:0:0:0): removing device entry > (da6:mpt0:0:6:0): lost device > (da6:mpt0:0:6:0): Synchronize cache failed, status == 0x4a, scsi status > == 0x0 > (da6:mpt0:0:6:0): removing device entry > > -> Then replugging the drive "da0" : > mpt0: mpt_cam_event: 0x16 > mpt0: mpt_cam_event: 0x12 > mpt0: mpt_cam_event: 0x16 I know that the rescan after removing a device is a bit messy (lots of messages before daX actually goes away), but I don't recall it taking such a long time. > Is there any documentation or hint as to what those mpt_cam_event are ? > I could whip myself a quick patch to at least change the display so one > would figure what these are. > > It feels like the 0x12 and 0x16 have to be handled to invalidate the > device that has been unplugged so the next request won't timeout but > fail directly. The documentation is not public. The 0x12 and 0x16 messages are events that I have seen. You can try talking to scottl@ as he has access to the docs. -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201001210749.40575.jhb>