From owner-freebsd-scsi@freebsd.org  Wed Jul  8 05:47:02 2015
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id D503399570A
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Wed,  8 Jul 2015 05:47:02 +0000 (UTC)
 (envelope-from lists@yamagi.org)
Received: from mail1.yamagi.org (yugo.yamagi.org [212.48.122.103])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 976F21FBF
 for <freebsd-scsi@freebsd.org>; Wed,  8 Jul 2015 05:47:01 +0000 (UTC)
 (envelope-from lists@yamagi.org)
Received: from p4fed1304.dip0.t-ipconnect.de ([79.237.19.4]
 helo=kosei.home.yamagi.org.dhcp.yamagi.org)
 by mail1.yamagi.org with esmtpsa (TLSv1:DHE-RSA-AES256-SHA:256)
 (Exim 4.85 (FreeBSD)) (envelope-from <lists@yamagi.org>)
 id 1ZCiC9-000GEB-N7; Wed, 08 Jul 2015 07:46:59 +0200
Date: Wed, 8 Jul 2015 07:46:52 +0200
From: Yamagi Burmeister <lists@yamagi.org>
To: killing@multiplay.co.uk
Cc: freebsd-scsi@freebsd.org
Subject: Re: Device timeouts(?) with LSI SAS3008 on mpr(4)
Message-Id: <20150708074652.07a815e6aa08526d569f3077@yamagi.org>
In-Reply-To: <559C0184.4050102@multiplay.co.uk>
References: <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org>
 <9426ced85d7def424e106fdefd7448ae@mail.gmail.com>
 <20150707183135.2c3f5aa45696b55a17e2f87f@yamagi.org>
 <559C0184.4050102@multiplay.co.uk>
X-Mailer: Sylpheed 3.4.2 (GTK+ 2.24.28; x86_64-unknown-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 08 Jul 2015 05:47:02 -0000

Hello Steven,
since the issue occures on all 3 servers it's at least unlikely. But
I'll see what I can do.

Regards,
Yamagi

On Tue, 7 Jul 2015 17:42:44 +0100
Steven Hartland <killing@multiplay.co.uk> wrote:

> Have you eliminated the midplane / cabling as the issue as that's very 
> common.
> 
> On 07/07/2015 17:31, Yamagi Burmeister wrote:
> > Hello Stephen,
> > I'm seeing those errors on all 3 servers and on all 16 devices. The 2
> > dmesg entries were just an example. It seems to be random were they
> > occure. Maybe the second controller mps1 has a higher chance then
> > mps0, but I'm not sure.
> >
> > My co-worker suspected FreeBSDs power management. On on of the servers
> > I forced c-states to C1 and deactivated powerd. In the last 2 hours no
> > new errors arose but it's far too early to draw conclusions.
> >
> > Regards,
> > Yamagi
> >
> > On Tue, 7 Jul 2015 09:37:22 -0600
> > Stephen Mcconnell <stephen.mcconnell@avagotech.com> wrote:
> >
> >> Hi Yamagi,
> >>
> >> I see two drives that are having problems.  Are there others?  Can you try
> >> to remove those drives and let me know what happens.  To me, it actually
> >> looks like those drives could be faulty.
> >>
> >> Steve
> >>
> >>> -----Original Message-----
> >>> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd-
> >>> scsi@freebsd.org] On Behalf Of Yamagi Burmeister
> >>> Sent: Tuesday, July 07, 2015 5:24 AM
> >>> To: freebsd-scsi@freebsd.org
> >>> Subject: Device timeouts(?) with LSI SAS3008 on mpr(4)
> >>>
> >>> Hello,
> >>> I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform.
> >>> Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each
> >> adapter
> >>> serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE as of
> >> r283938 on
> >>> 2 servers and r285196 on the last one.
> >>>
> >>> The controller identify themself as:
> >>>
> >>> ----
> >>>
> >>> mpr0: <Avago Technologies (LSI) SAS3008> port 0x6000-0x60ff mem
> >>> 0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on
> >>> pci2 mpr0: IOCFacts  : MsgVersion: 0x205
> >>>          HeaderVersion: 0x2300
> >>>          IOCNumber: 0
> >>>          IOCExceptions: 0x0
> >>>          MaxChainDepth: 128
> >>>          NumberOfPorts: 1
> >>>          RequestCredit: 10240
> >>>          ProductID: 0x2221
> >>>          IOCRequestFrameSize: 32
> >>>          MaxInitiators: 32
> >>>          MaxTargets: 1024
> >>>          MaxSasExpanders: 42
> >>>          MaxEnclosures: 43
> >>>          HighPriorityCredit: 128
> >>>          MaxReplyDescriptorPostQueueDepth: 65504
> >>>          ReplyFrameSize: 32
> >>>          MaxVolumes: 0
> >>>          MaxDevHandle: 1106
> >>>          MaxPersistentEntries: 128
> >>> mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd
> >>> mpr0: IOCCapabilities:
> >>>
> >> 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex
> >>> ,HostDisc>
> >>>
> >>> ----
> >>>
> >>> 08.00.00.00 is the last available firmware.
> >>>
> >>>
> >>> Since day one 'dmesg' is cluttered with CAM errors:
> >>>
> >>> ----
> >>>
> >>> mpr1: Sending reset from mprsas_send_abort for target ID 5
> >>>          (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08
> >>> 00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0
> >>> (da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00
> >>> 01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0):
> >>> READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0
> >> state c
> >>> xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1:
> >>> (da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command
> >>> (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00
> >>> (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0):
> >>> SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT
> >> ATTENTION
> >>> asc:29,0 (Power on, reset, or bus device reset occurred)
> >>> (da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0):
> >>> READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM
> >>> status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check
> >> Condition
> >>> (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset,
> >> or
> >>> bus device reset occurred) (da11:mpr1:0:5:0): Retrying command (per
> >> sense
> >>> data) (noperiph:mpr1:0:4294967295:0): SMID 2 Aborting command
> >>> 0xfffffe0001601a30
> >>>
> >>> mpr1: Sending reset from mprsas_send_abort for target ID 2
> >>>          (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00
> >> length
> >>> 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0
> >>> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length
> >>> 12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1:
> >>> Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS
> >>> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
> >>> (da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0):
> >>> Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00
> >>> 00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error
> >>> (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI
> >>> sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset
> >>> occurred) (da8:mpr1:0:2:0): Retrying command (per sense data)
> >>> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00
> >>> (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI
> >>> status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION
> >>> asc:29,0 (Power on, reset, or bus device reset occurred)
> >>> (da8:mpr1:0:2:0): Retrying command (per sense data)
> >>> (noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command
> >>> 0xfffffe000160b660
> >>>
> >>> ----
> >>>
> >>> ZFS doesn't like this and sees read errors or even write errors. In
> >> extreme cases
> >>> the device is marked as FAULTED:
> >>>
> >>> ----
> >>>
> >>>    pool: examplepool
> >>>   state: DEGRADED
> >>> status: One or more devices are faulted in response to persistent
> >> errors.
> >>> Sufficient replicas exist for the pool to continue functioning in a
> >> degraded state.
> >>> action: Replace the faulted device, or use 'zpool clear' to mark the
> >> device
> >>> repaired.
> >>>    scan: none requested
> >>> config:
> >>>
> >>> 	NAME        STATE     READ WRITE CKSUM
> >>> 	examplepool DEGRADED     0     0     0
> >>> 	  raidz1-0  ONLINE       0     0     0
> >>> 	    da3p1   ONLINE       0     0     0
> >>> 	    da4p1   ONLINE       0     0     0
> >>> 	    da5p1   ONLINE       0     0     0
> >>> 	logs
> >>> 	  da1p1     FAULTED      3     0     0  too many errors
> >>> 	cache
> >>> 	  da1p2     FAULTED      3     0     0  too many errors
> >>> 	spares
> >>> 	  da2p1     AVAIL
> >>>
> >>> errors: No known data errors
> >>>
> >>> ----
> >>>
> >>> The problems arise on all 3 machines all all SSDs nearly daily. So I
> >> highly suspect
> >>> a software issue. Has anyone an idea what's going on and what I can do
> >> to solve
> >>> this problems? More information can be provided if necessary.
> >>>
> >>> Regards,
> >>> Yamagi
> >>>
> >>> --
> >>> Homepage:  www.yamagi.org
> >>> XMPP:      yamagi@yamagi.org
> >>> GnuPG/GPG: 0xEFBCCBCB
> >>> _______________________________________________
> >>> freebsd-scsi@freebsd.org mailing list
> >>> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> >>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
> >
> 
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"


-- 
Homepage:  www.yamagi.org
XMPP:      yamagi@yamagi.org
GnuPG/GPG: 0xEFBCCBCB