From owner-freebsd-scsi@freebsd.org Wed Jul 8 05:47:02 2015 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D503399570A for ; Wed, 8 Jul 2015 05:47:02 +0000 (UTC) (envelope-from lists@yamagi.org) Received: from mail1.yamagi.org (yugo.yamagi.org [212.48.122.103]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 976F21FBF for ; Wed, 8 Jul 2015 05:47:01 +0000 (UTC) (envelope-from lists@yamagi.org) Received: from p4fed1304.dip0.t-ipconnect.de ([79.237.19.4] helo=kosei.home.yamagi.org.dhcp.yamagi.org) by mail1.yamagi.org with esmtpsa (TLSv1:DHE-RSA-AES256-SHA:256) (Exim 4.85 (FreeBSD)) (envelope-from ) id 1ZCiC9-000GEB-N7; Wed, 08 Jul 2015 07:46:59 +0200 Date: Wed, 8 Jul 2015 07:46:52 +0200 From: Yamagi Burmeister To: killing@multiplay.co.uk Cc: freebsd-scsi@freebsd.org Subject: Re: Device timeouts(?) with LSI SAS3008 on mpr(4) Message-Id: <20150708074652.07a815e6aa08526d569f3077@yamagi.org> In-Reply-To: <559C0184.4050102@multiplay.co.uk> References: <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org> <9426ced85d7def424e106fdefd7448ae@mail.gmail.com> <20150707183135.2c3f5aa45696b55a17e2f87f@yamagi.org> <559C0184.4050102@multiplay.co.uk> X-Mailer: Sylpheed 3.4.2 (GTK+ 2.24.28; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 Jul 2015 05:47:02 -0000 Hello Steven, since the issue occures on all 3 servers it's at least unlikely. But I'll see what I can do. Regards, Yamagi On Tue, 7 Jul 2015 17:42:44 +0100 Steven Hartland wrote: > Have you eliminated the midplane / cabling as the issue as that's very > common. > > On 07/07/2015 17:31, Yamagi Burmeister wrote: > > Hello Stephen, > > I'm seeing those errors on all 3 servers and on all 16 devices. The 2 > > dmesg entries were just an example. It seems to be random were they > > occure. Maybe the second controller mps1 has a higher chance then > > mps0, but I'm not sure. > > > > My co-worker suspected FreeBSDs power management. On on of the servers > > I forced c-states to C1 and deactivated powerd. In the last 2 hours no > > new errors arose but it's far too early to draw conclusions. > > > > Regards, > > Yamagi > > > > On Tue, 7 Jul 2015 09:37:22 -0600 > > Stephen Mcconnell wrote: > > > >> Hi Yamagi, > >> > >> I see two drives that are having problems. Are there others? Can you try > >> to remove those drives and let me know what happens. To me, it actually > >> looks like those drives could be faulty. > >> > >> Steve > >> > >>> -----Original Message----- > >>> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd- > >>> scsi@freebsd.org] On Behalf Of Yamagi Burmeister > >>> Sent: Tuesday, July 07, 2015 5:24 AM > >>> To: freebsd-scsi@freebsd.org > >>> Subject: Device timeouts(?) with LSI SAS3008 on mpr(4) > >>> > >>> Hello, > >>> I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform. > >>> Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each > >> adapter > >>> serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE as of > >> r283938 on > >>> 2 servers and r285196 on the last one. > >>> > >>> The controller identify themself as: > >>> > >>> ---- > >>> > >>> mpr0: port 0x6000-0x60ff mem > >>> 0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on > >>> pci2 mpr0: IOCFacts : MsgVersion: 0x205 > >>> HeaderVersion: 0x2300 > >>> IOCNumber: 0 > >>> IOCExceptions: 0x0 > >>> MaxChainDepth: 128 > >>> NumberOfPorts: 1 > >>> RequestCredit: 10240 > >>> ProductID: 0x2221 > >>> IOCRequestFrameSize: 32 > >>> MaxInitiators: 32 > >>> MaxTargets: 1024 > >>> MaxSasExpanders: 42 > >>> MaxEnclosures: 43 > >>> HighPriorityCredit: 128 > >>> MaxReplyDescriptorPostQueueDepth: 65504 > >>> ReplyFrameSize: 32 > >>> MaxVolumes: 0 > >>> MaxDevHandle: 1106 > >>> MaxPersistentEntries: 128 > >>> mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd > >>> mpr0: IOCCapabilities: > >>> > >> 7a85c >>> ,HostDisc> > >>> > >>> ---- > >>> > >>> 08.00.00.00 is the last available firmware. > >>> > >>> > >>> Since day one 'dmesg' is cluttered with CAM errors: > >>> > >>> ---- > >>> > >>> mpr1: Sending reset from mprsas_send_abort for target ID 5 > >>> (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08 > >>> 00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0 > >>> (da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 > >>> 01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0): > >>> READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0 > >> state c > >>> xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1: > >>> (da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command > >>> (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 > >>> (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0): > >>> SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT > >> ATTENTION > >>> asc:29,0 (Power on, reset, or bus device reset occurred) > >>> (da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0): > >>> READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM > >>> status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check > >> Condition > >>> (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, > >> or > >>> bus device reset occurred) (da11:mpr1:0:5:0): Retrying command (per > >> sense > >>> data) (noperiph:mpr1:0:4294967295:0): SMID 2 Aborting command > >>> 0xfffffe0001601a30 > >>> > >>> mpr1: Sending reset from mprsas_send_abort for target ID 2 > >>> (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00 > >> length > >>> 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0 > >>> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length > >>> 12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1: > >>> Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS > >>> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 > >>> (da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0): > >>> Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 > >>> 00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error > >>> (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI > >>> sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset > >>> occurred) (da8:mpr1:0:2:0): Retrying command (per sense data) > >>> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00 > >>> (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI > >>> status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION > >>> asc:29,0 (Power on, reset, or bus device reset occurred) > >>> (da8:mpr1:0:2:0): Retrying command (per sense data) > >>> (noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command > >>> 0xfffffe000160b660 > >>> > >>> ---- > >>> > >>> ZFS doesn't like this and sees read errors or even write errors. In > >> extreme cases > >>> the device is marked as FAULTED: > >>> > >>> ---- > >>> > >>> pool: examplepool > >>> state: DEGRADED > >>> status: One or more devices are faulted in response to persistent > >> errors. > >>> Sufficient replicas exist for the pool to continue functioning in a > >> degraded state. > >>> action: Replace the faulted device, or use 'zpool clear' to mark the > >> device > >>> repaired. > >>> scan: none requested > >>> config: > >>> > >>> NAME STATE READ WRITE CKSUM > >>> examplepool DEGRADED 0 0 0 > >>> raidz1-0 ONLINE 0 0 0 > >>> da3p1 ONLINE 0 0 0 > >>> da4p1 ONLINE 0 0 0 > >>> da5p1 ONLINE 0 0 0 > >>> logs > >>> da1p1 FAULTED 3 0 0 too many errors > >>> cache > >>> da1p2 FAULTED 3 0 0 too many errors > >>> spares > >>> da2p1 AVAIL > >>> > >>> errors: No known data errors > >>> > >>> ---- > >>> > >>> The problems arise on all 3 machines all all SSDs nearly daily. So I > >> highly suspect > >>> a software issue. Has anyone an idea what's going on and what I can do > >> to solve > >>> this problems? More information can be provided if necessary. > >>> > >>> Regards, > >>> Yamagi > >>> > >>> -- > >>> Homepage: www.yamagi.org > >>> XMPP: yamagi@yamagi.org > >>> GnuPG/GPG: 0xEFBCCBCB > >>> _______________________________________________ > >>> freebsd-scsi@freebsd.org mailing list > >>> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi > >>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" > > > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" -- Homepage: www.yamagi.org XMPP: yamagi@yamagi.org GnuPG/GPG: 0xEFBCCBCB