Date: Tue, 7 Jul 2015 17:42:44 +0100 From: Steven Hartland <killing@multiplay.co.uk> To: freebsd-scsi@freebsd.org Subject: Re: Device timeouts(?) with LSI SAS3008 on mpr(4) Message-ID: <559C0184.4050102@multiplay.co.uk> In-Reply-To: <20150707183135.2c3f5aa45696b55a17e2f87f@yamagi.org> References: <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org> <9426ced85d7def424e106fdefd7448ae@mail.gmail.com> <20150707183135.2c3f5aa45696b55a17e2f87f@yamagi.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Have you eliminated the midplane / cabling as the issue as that's very common. On 07/07/2015 17:31, Yamagi Burmeister wrote: > Hello Stephen, > I'm seeing those errors on all 3 servers and on all 16 devices. The 2 > dmesg entries were just an example. It seems to be random were they > occure. Maybe the second controller mps1 has a higher chance then > mps0, but I'm not sure. > > My co-worker suspected FreeBSDs power management. On on of the servers > I forced c-states to C1 and deactivated powerd. In the last 2 hours no > new errors arose but it's far too early to draw conclusions. > > Regards, > Yamagi > > On Tue, 7 Jul 2015 09:37:22 -0600 > Stephen Mcconnell <stephen.mcconnell@avagotech.com> wrote: > >> Hi Yamagi, >> >> I see two drives that are having problems. Are there others? Can you try >> to remove those drives and let me know what happens. To me, it actually >> looks like those drives could be faulty. >> >> Steve >> >>> -----Original Message----- >>> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd- >>> scsi@freebsd.org] On Behalf Of Yamagi Burmeister >>> Sent: Tuesday, July 07, 2015 5:24 AM >>> To: freebsd-scsi@freebsd.org >>> Subject: Device timeouts(?) with LSI SAS3008 on mpr(4) >>> >>> Hello, >>> I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform. >>> Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each >> adapter >>> serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE as of >> r283938 on >>> 2 servers and r285196 on the last one. >>> >>> The controller identify themself as: >>> >>> ---- >>> >>> mpr0: <Avago Technologies (LSI) SAS3008> port 0x6000-0x60ff mem >>> 0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on >>> pci2 mpr0: IOCFacts : MsgVersion: 0x205 >>> HeaderVersion: 0x2300 >>> IOCNumber: 0 >>> IOCExceptions: 0x0 >>> MaxChainDepth: 128 >>> NumberOfPorts: 1 >>> RequestCredit: 10240 >>> ProductID: 0x2221 >>> IOCRequestFrameSize: 32 >>> MaxInitiators: 32 >>> MaxTargets: 1024 >>> MaxSasExpanders: 42 >>> MaxEnclosures: 43 >>> HighPriorityCredit: 128 >>> MaxReplyDescriptorPostQueueDepth: 65504 >>> ReplyFrameSize: 32 >>> MaxVolumes: 0 >>> MaxDevHandle: 1106 >>> MaxPersistentEntries: 128 >>> mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd >>> mpr0: IOCCapabilities: >>> >> 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex >>> ,HostDisc> >>> >>> ---- >>> >>> 08.00.00.00 is the last available firmware. >>> >>> >>> Since day one 'dmesg' is cluttered with CAM errors: >>> >>> ---- >>> >>> mpr1: Sending reset from mprsas_send_abort for target ID 5 >>> (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08 >>> 00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0 >>> (da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 >>> 01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0): >>> READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0 >> state c >>> xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1: >>> (da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command >>> (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 >>> (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0): >>> SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT >> ATTENTION >>> asc:29,0 (Power on, reset, or bus device reset occurred) >>> (da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0): >>> READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM >>> status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check >> Condition >>> (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, >> or >>> bus device reset occurred) (da11:mpr1:0:5:0): Retrying command (per >> sense >>> data) (noperiph:mpr1:0:4294967295:0): SMID 2 Aborting command >>> 0xfffffe0001601a30 >>> >>> mpr1: Sending reset from mprsas_send_abort for target ID 2 >>> (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00 >> length >>> 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0 >>> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length >>> 12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1: >>> Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS >>> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 >>> (da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0): >>> Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 >>> 00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error >>> (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI >>> sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset >>> occurred) (da8:mpr1:0:2:0): Retrying command (per sense data) >>> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00 >>> (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI >>> status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION >>> asc:29,0 (Power on, reset, or bus device reset occurred) >>> (da8:mpr1:0:2:0): Retrying command (per sense data) >>> (noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command >>> 0xfffffe000160b660 >>> >>> ---- >>> >>> ZFS doesn't like this and sees read errors or even write errors. In >> extreme cases >>> the device is marked as FAULTED: >>> >>> ---- >>> >>> pool: examplepool >>> state: DEGRADED >>> status: One or more devices are faulted in response to persistent >> errors. >>> Sufficient replicas exist for the pool to continue functioning in a >> degraded state. >>> action: Replace the faulted device, or use 'zpool clear' to mark the >> device >>> repaired. >>> scan: none requested >>> config: >>> >>> NAME STATE READ WRITE CKSUM >>> examplepool DEGRADED 0 0 0 >>> raidz1-0 ONLINE 0 0 0 >>> da3p1 ONLINE 0 0 0 >>> da4p1 ONLINE 0 0 0 >>> da5p1 ONLINE 0 0 0 >>> logs >>> da1p1 FAULTED 3 0 0 too many errors >>> cache >>> da1p2 FAULTED 3 0 0 too many errors >>> spares >>> da2p1 AVAIL >>> >>> errors: No known data errors >>> >>> ---- >>> >>> The problems arise on all 3 machines all all SSDs nearly daily. So I >> highly suspect >>> a software issue. Has anyone an idea what's going on and what I can do >> to solve >>> this problems? More information can be provided if necessary. >>> >>> Regards, >>> Yamagi >>> >>> -- >>> Homepage: www.yamagi.org >>> XMPP: yamagi@yamagi.org >>> GnuPG/GPG: 0xEFBCCBCB >>> _______________________________________________ >>> freebsd-scsi@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?559C0184.4050102>