Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 8 Jul 2015 07:45:12 +0200
From:      Yamagi Burmeister <lists@yamagi.org>
To:        stephen.mcconnell@avagotech.com
Cc:        freebsd-scsi@freebsd.org
Subject:   Re: Device timeouts(?) with LSI SAS3008 on mpr(4)
Message-ID:  <20150708074512.e676c8a9a5b7c6d56d357a02@yamagi.org>
In-Reply-To: <9426ced85d7def424e106fdefd7448ae@mail.gmail.com>
References:  <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org> <9426ced85d7def424e106fdefd7448ae@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Good morning,
it wasn't the power managment. This night the errors occured on da6,
da7 and da9. This is the same machine as yesterday:

Jul  8 05:06:21 mars kernel: (noperiph:mpr1:0:4294967295:0): SMID 83 Aborting command 0xfffffe0001a684e0
Jul  8 05:06:21 mars kernel: (da7:mpr1:0:1:0): READ(10). CDB: 28 00 48 0a 44 98 00 00 08 00 length 4096 SMID 556 terminated ioc 804b scsi 0 state c xfer 0
Jul  8 05:06:21 mars kernel: (da7:mpr1:0:1:0): READ(10). CDB: 28 00 48 10 bb a8 00 00 20 00 length 16384 SMID 745 terminated ioc 804b scsi 0 state c xfer 0
Jul  8 05:06:21 mars kernel: (da7:mpr1:0:1:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 680 term(da7:mpr1:0:1:0): WRITE(10). CDB: 2a 00 56 1b 1c 38 00 00 08 00 
Jul  8 05:06:21 mars kernel: inated ioc 804b scsi 0 state c xfer 0
Jul  8 05:06:21 mars kernel: (da7:mpr1:0:1:0): CAM status: Command timeout
Jul  8 05:06:21 mars kernel: (da7:mpr1:0:1:0): Retrying command
Jul  8 05:06:21 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00)
Jul  8 05:06:21 mars kernel: (da7:mpr1:0:1:0): READ(10). CDB: 28 00 48 0a 44 98 00 00 08 00 length 4096 SMID 696 terminated ioc 804b scsi 0 state c xfer 0
Jul  8 05:06:21 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00)
Jul  8 05:06:21 mars kernel: (da7:mpr1:0:1:0): READ(10). CDB: 28 00 48 10 bb a8 00 00 20 00 length 16384 SMID 517 terminated ioc 804b scsi 0 state c xfer 0
Jul  8 05:06:21 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00)
Jul  8 05:06:21 mars kernel: (da7:mpr1:0:1:0): WRITE(10). CDB: 2a 00 56 1b 1c 38 00 00 08 00 length 4096 SMID 905 terminated ioc 804b scsi 0 state c xfer 0
Jul  8 05:06:21 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00)
Jul  8 05:06:21 mars kernel: (da7:mpr1:0:1:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 290 terminated ioc 804b scsi 0 state c xfer 0
Jul  8 05:06:22 mars kernel: (da7:mpr1:0:1:0): READ(10). CDB: 28 00 48 0a 44 98 00 00 08 00 
Jul  8 05:06:22 mars kernel: (da7:mpr1:0:1:0): CAM status: SCSI Status Error
Jul  8 05:06:22 mars kernel: (da7:mpr1:0:1:0): SCSI status: Check Condition
Jul  8 05:06:22 mars kernel: (da7:mpr1:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jul  8 05:06:22 mars kernel: (da7:mpr1:0:1:0): Retrying command (per sense data)

Jul  8 06:33:26 mars kernel: (noperiph:mpr1:0:4294967295:0): SMID 84 Aborting command 0xfffffe0001a32fc0
Jul  8 06:33:27 mars kernel: (da9:mpr1:0:3:0): READ(10). CDB: 28 00 48 0f bc 90 00 00 20 00 length 16384 SMID 703 terminated ioc 804b scsi 0 state c xfer 0
Jul  8 06:33:27 mars kernel: (da9:mpr1:0:3:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 719 term(da9:mpr1:0:3:0): WRITE(10). CDB: 2a 00 48 3c d0 58 00 00 10 00 
Jul  8 06:33:27 mars kernel: inated ioc 804b scsi 0 state c xfer 0
Jul  8 06:33:27 mars kernel: (da9:mpr1:0:3:0): CAM status: Command timeout
Jul  8 06:33:27 mars kernel: (da9:mpr1:0:3:0): Retrying command
Jul  8 06:33:27 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00)
Jul  8 06:33:27 mars kernel: (da9:mpr1:0:3:0): READ(10). CDB: 28 00 48 0f bc 90 00 00 20 00 length 16384 SMID 851 terminated ioc 804b scsi 0 state c xfer 0
Jul  8 06:33:27 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00)
Jul  8 06:33:27 mars kernel: (da9:mpr1:0:3:0): WRITE(10). CDB: 2a 00 48 3c d0 58 00 00 10 00 length 8192 SMID 576 terminated ioc 804b scsi 0 state c xfer 0
Jul  8 06:33:27 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00)
Jul  8 06:33:27 mars kernel: (da9:mpr1:0:3:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 854 terminated ioc 804b scsi 0 state c xfer 0
Jul  8 06:33:28 mars kernel: (da9:mpr1:0:3:0): READ(10). CDB: 28 00 48 0f bc 90 00 00 20 00 
Jul  8 06:33:28 mars kernel: (da9:mpr1:0:3:0): CAM status: SCSI Status Error
Jul  8 06:33:28 mars kernel: (da9:mpr1:0:3:0): SCSI status: Check Condition
Jul  8 06:33:28 mars kernel: (da9:mpr1:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jul  8 06:33:28 mars kernel: (da9:mpr1:0:3:0): Retrying command (per sense data)

Jul  8 06:35:10 mars kernel: (noperiph:mpr1:0:4294967295:0): SMID 85 Aborting command 0xfffffe0001a70c10
Jul  8 06:35:10 mars kernel: (da6:mpr1:0:0:0): READ(10). CDB: 28 00 48 30 4a 40 00 00 18 00 length 12288 SMID 541 terminated ioc 804b scsi 0 state c xfer 0
Jul  8 06:35:10 mars kernel: (da6:mpr1:0:0:0): WRITE(10). CDB: 2a 00 48 59 82 e8 00 00 10 00 length 8192 SMID 467 terminated ioc 804b scsi 0 state c xfer 0
Jul  8 06:35:10 mars kernel: (da6:mpr1:0:0:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 
Jul  8 06:35:10 mars kernel: (da6:mpr1:0:0:0): CAM status: Command timeout
Jul  8 06:35:10 mars kernel: (da6:mpr1:0:0:0): Retrying command
Jul  8 06:35:10 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00)
Jul  8 06:35:10 mars kernel: (da6:mpr1:0:0:0): READ(10). CDB: 28 00 48 30 4a 40 00 00 18 00 length 12288 SMID 870 terminated ioc 804b scsi 0 state c xfer 0
Jul  8 06:35:10 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00)
Jul  8 06:35:10 mars kernel: (da6:mpr1:0:0:0): WRITE(10). CDB: 2a 00 48 59 82 e8 00 00 10 00 length 8192 SMID 478 terminated ioc 804b scsi 0 state c xfer 0
Jul  8 06:35:10 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00)
Jul  8 06:35:10 mars kernel: (da6:mpr1:0:0:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 764 terminated ioc 804b scsi 0 state c xfer 0
Jul  8 06:35:11 mars kernel: (da6:mpr1:0:0:0): READ(10). CDB: 28 00 48 30 4a 40 00 00 18 00 
Jul  8 06:35:11 mars kernel: (da6:mpr1:0:0:0): CAM status: SCSI Status Error
Jul  8 06:35:11 mars kernel: (da6:mpr1:0:0:0): SCSI status: Check Condition
Jul  8 06:35:11 mars kernel: (da6:mpr1:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jul  8 06:35:11 mars kernel: (da6:mpr1:0:0:0): Retrying command (per sense data)

Regards,
Yamagi


On Tue, 7 Jul 2015 09:37:22 -0600
Stephen Mcconnell <stephen.mcconnell@avagotech.com> wrote:

> Hi Yamagi,
> 
> I see two drives that are having problems.  Are there others?  Can you try
> to remove those drives and let me know what happens.  To me, it actually
> looks like those drives could be faulty.
> 
> Steve
> 
> > -----Original Message-----
> > From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd-
> > scsi@freebsd.org] On Behalf Of Yamagi Burmeister
> > Sent: Tuesday, July 07, 2015 5:24 AM
> > To: freebsd-scsi@freebsd.org
> > Subject: Device timeouts(?) with LSI SAS3008 on mpr(4)
> >
> > Hello,
> > I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform.
> > Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each
> adapter
> > serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE as of
> r283938 on
> > 2 servers and r285196 on the last one.
> >
> > The controller identify themself as:
> >
> > ----
> >
> > mpr0: <Avago Technologies (LSI) SAS3008> port 0x6000-0x60ff mem
> > 0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on
> > pci2 mpr0: IOCFacts  : MsgVersion: 0x205
> >         HeaderVersion: 0x2300
> >         IOCNumber: 0
> >         IOCExceptions: 0x0
> >         MaxChainDepth: 128
> >         NumberOfPorts: 1
> >         RequestCredit: 10240
> >         ProductID: 0x2221
> >         IOCRequestFrameSize: 32
> >         MaxInitiators: 32
> >         MaxTargets: 1024
> >         MaxSasExpanders: 42
> >         MaxEnclosures: 43
> >         HighPriorityCredit: 128
> >         MaxReplyDescriptorPostQueueDepth: 65504
> >         ReplyFrameSize: 32
> >         MaxVolumes: 0
> >         MaxDevHandle: 1106
> >         MaxPersistentEntries: 128
> > mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd
> > mpr0: IOCCapabilities:
> >
> 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex
> > ,HostDisc>
> >
> > ----
> >
> > 08.00.00.00 is the last available firmware.
> >
> >
> > Since day one 'dmesg' is cluttered with CAM errors:
> >
> > ----
> >
> > mpr1: Sending reset from mprsas_send_abort for target ID 5
> >         (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08
> > 00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0
> > (da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00
> > 01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0):
> > READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0
> state c
> > xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1:
> > (da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command
> > (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00
> > (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0):
> > SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT
> ATTENTION
> > asc:29,0 (Power on, reset, or bus device reset occurred)
> > (da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0):
> > READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM
> > status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check
> Condition
> > (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset,
> or
> > bus device reset occurred) (da11:mpr1:0:5:0): Retrying command (per
> sense
> > data) (noperiph:mpr1:0:4294967295:0): SMID 2 Aborting command
> > 0xfffffe0001601a30
> >
> > mpr1: Sending reset from mprsas_send_abort for target ID 2
> >         (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00
> length
> > 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0
> > (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length
> > 12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1:
> > Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS
> > THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
> > (da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0):
> > Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00
> > 00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error
> > (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI
> > sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset
> > occurred) (da8:mpr1:0:2:0): Retrying command (per sense data)
> > (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00
> > (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI
> > status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION
> > asc:29,0 (Power on, reset, or bus device reset occurred)
> > (da8:mpr1:0:2:0): Retrying command (per sense data)
> > (noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command
> > 0xfffffe000160b660
> >
> > ----
> >
> > ZFS doesn't like this and sees read errors or even write errors. In
> extreme cases
> > the device is marked as FAULTED:
> >
> > ----
> >
> >   pool: examplepool
> >  state: DEGRADED
> > status: One or more devices are faulted in response to persistent
> errors.
> > Sufficient replicas exist for the pool to continue functioning in a
> degraded state.
> > action: Replace the faulted device, or use 'zpool clear' to mark the
> device
> > repaired.
> >   scan: none requested
> > config:
> >
> > 	NAME        STATE     READ WRITE CKSUM
> > 	examplepool DEGRADED     0     0     0
> > 	  raidz1-0  ONLINE       0     0     0
> > 	    da3p1   ONLINE       0     0     0
> > 	    da4p1   ONLINE       0     0     0
> > 	    da5p1   ONLINE       0     0     0
> > 	logs
> > 	  da1p1     FAULTED      3     0     0  too many errors
> > 	cache
> > 	  da1p2     FAULTED      3     0     0  too many errors
> > 	spares
> > 	  da2p1     AVAIL
> >
> > errors: No known data errors
> >
> > ----
> >
> > The problems arise on all 3 machines all all SSDs nearly daily. So I
> highly suspect
> > a software issue. Has anyone an idea what's going on and what I can do
> to solve
> > this problems? More information can be provided if necessary.
> >
> > Regards,
> > Yamagi
> >
> > --
> > Homepage:  www.yamagi.org
> > XMPP:      yamagi@yamagi.org
> > GnuPG/GPG: 0xEFBCCBCB
> > _______________________________________________
> > freebsd-scsi@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"


-- 
Homepage:  www.yamagi.org
XMPP:      yamagi@yamagi.org
GnuPG/GPG: 0xEFBCCBCB



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150708074512.e676c8a9a5b7c6d56d357a02>