FreeBSD Mail Archives

Date:      Tue, 7 Jul 2015 17:42:44 +0100
From:      Steven Hartland <killing@multiplay.co.uk>
To:        freebsd-scsi@freebsd.org
Subject:   Re: Device timeouts(?) with LSI SAS3008 on mpr(4)
Message-ID:  <559C0184.4050102@multiplay.co.uk>
In-Reply-To: <20150707183135.2c3f5aa45696b55a17e2f87f@yamagi.org>
References:  <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org> <9426ced85d7def424e106fdefd7448ae@mail.gmail.com> <20150707183135.2c3f5aa45696b55a17e2f87f@yamagi.org>

Have you eliminated the midplane / cabling as the issue as that's very 
common.

On 07/07/2015 17:31, Yamagi Burmeister wrote:
> Hello Stephen,
> I'm seeing those errors on all 3 servers and on all 16 devices. The 2
> dmesg entries were just an example. It seems to be random were they
> occure. Maybe the second controller mps1 has a higher chance then
> mps0, but I'm not sure.
>
> My co-worker suspected FreeBSDs power management. On on of the servers
> I forced c-states to C1 and deactivated powerd. In the last 2 hours no
> new errors arose but it's far too early to draw conclusions.
>
> Regards,
> Yamagi
>
> On Tue, 7 Jul 2015 09:37:22 -0600
> Stephen Mcconnell <stephen.mcconnell@avagotech.com> wrote:
>
>> Hi Yamagi,
>>
>> I see two drives that are having problems.  Are there others?  Can you try
>> to remove those drives and let me know what happens.  To me, it actually
>> looks like those drives could be faulty.
>>
>> Steve
>>
>>> -----Original Message-----
>>> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd-
>>> scsi@freebsd.org] On Behalf Of Yamagi Burmeister
>>> Sent: Tuesday, July 07, 2015 5:24 AM
>>> To: freebsd-scsi@freebsd.org
>>> Subject: Device timeouts(?) with LSI SAS3008 on mpr(4)
>>>
>>> Hello,
>>> I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform.
>>> Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each
>> adapter
>>> serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE as of
>> r283938 on
>>> 2 servers and r285196 on the last one.
>>>
>>> The controller identify themself as:
>>>
>>> ----
>>>
>>> mpr0: <Avago Technologies (LSI) SAS3008> port 0x6000-0x60ff mem
>>> 0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on
>>> pci2 mpr0: IOCFacts  : MsgVersion: 0x205
>>>          HeaderVersion: 0x2300
>>>          IOCNumber: 0
>>>          IOCExceptions: 0x0
>>>          MaxChainDepth: 128
>>>          NumberOfPorts: 1
>>>          RequestCredit: 10240
>>>          ProductID: 0x2221
>>>          IOCRequestFrameSize: 32
>>>          MaxInitiators: 32
>>>          MaxTargets: 1024
>>>          MaxSasExpanders: 42
>>>          MaxEnclosures: 43
>>>          HighPriorityCredit: 128
>>>          MaxReplyDescriptorPostQueueDepth: 65504
>>>          ReplyFrameSize: 32
>>>          MaxVolumes: 0
>>>          MaxDevHandle: 1106
>>>          MaxPersistentEntries: 128
>>> mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd
>>> mpr0: IOCCapabilities:
>>>
>> 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex
>>> ,HostDisc>
>>>
>>> ----
>>>
>>> 08.00.00.00 is the last available firmware.
>>>
>>>
>>> Since day one 'dmesg' is cluttered with CAM errors:
>>>
>>> ----
>>>
>>> mpr1: Sending reset from mprsas_send_abort for target ID 5
>>>          (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08
>>> 00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0
>>> (da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00
>>> 01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0):
>>> READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0
>> state c
>>> xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1:
>>> (da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command
>>> (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00
>>> (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0):
>>> SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT
>> ATTENTION
>>> asc:29,0 (Power on, reset, or bus device reset occurred)
>>> (da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0):
>>> READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM
>>> status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check
>> Condition
>>> (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset,
>> or
>>> bus device reset occurred) (da11:mpr1:0:5:0): Retrying command (per
>> sense
>>> data) (noperiph:mpr1:0:4294967295:0): SMID 2 Aborting command
>>> 0xfffffe0001601a30
>>>
>>> mpr1: Sending reset from mprsas_send_abort for target ID 2
>>>          (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00
>> length
>>> 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0
>>> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length
>>> 12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1:
>>> Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS
>>> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
>>> (da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0):
>>> Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00
>>> 00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error
>>> (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI
>>> sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset
>>> occurred) (da8:mpr1:0:2:0): Retrying command (per sense data)
>>> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00
>>> (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI
>>> status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION
>>> asc:29,0 (Power on, reset, or bus device reset occurred)
>>> (da8:mpr1:0:2:0): Retrying command (per sense data)
>>> (noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command
>>> 0xfffffe000160b660
>>>
>>> ----
>>>
>>> ZFS doesn't like this and sees read errors or even write errors. In
>> extreme cases
>>> the device is marked as FAULTED:
>>>
>>> ----
>>>
>>>    pool: examplepool
>>>   state: DEGRADED
>>> status: One or more devices are faulted in response to persistent
>> errors.
>>> Sufficient replicas exist for the pool to continue functioning in a
>> degraded state.
>>> action: Replace the faulted device, or use 'zpool clear' to mark the
>> device
>>> repaired.
>>>    scan: none requested
>>> config:
>>>
>>> 	NAME        STATE     READ WRITE CKSUM
>>> 	examplepool DEGRADED     0     0     0
>>> 	  raidz1-0  ONLINE       0     0     0
>>> 	    da3p1   ONLINE       0     0     0
>>> 	    da4p1   ONLINE       0     0     0
>>> 	    da5p1   ONLINE       0     0     0
>>> 	logs
>>> 	  da1p1     FAULTED      3     0     0  too many errors
>>> 	cache
>>> 	  da1p2     FAULTED      3     0     0  too many errors
>>> 	spares
>>> 	  da2p1     AVAIL
>>>
>>> errors: No known data errors
>>>
>>> ----
>>>
>>> The problems arise on all 3 machines all all SSDs nearly daily. So I
>> highly suspect
>>> a software issue. Has anyone an idea what's going on and what I can do
>> to solve
>>> this problems? More information can be provided if necessary.
>>>
>>> Regards,
>>> Yamagi
>>>
>>> --
>>> Homepage:  www.yamagi.org
>>> XMPP:      yamagi@yamagi.org
>>> GnuPG/GPG: 0xEFBCCBCB
>>> _______________________________________________
>>> freebsd-scsi@freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
>

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?559C0184.4050102>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation