From owner-freebsd-scsi@freebsd.org Tue Jul 7 12:02:24 2015 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 83606995AAC for ; Tue, 7 Jul 2015 12:02:24 +0000 (UTC) (envelope-from lists@yamagi.org) Received: from mail1.yamagi.org (yugo.yamagi.org [212.48.122.103]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4AC59104F for ; Tue, 7 Jul 2015 12:02:23 +0000 (UTC) (envelope-from lists@yamagi.org) Received: from [192.168.100.101] (helo=aka) by mail1.yamagi.org with esmtpsa (TLSv1:DHE-RSA-AES256-SHA:256) (Exim 4.85 (FreeBSD)) (envelope-from ) id 1ZCQz7-0000LK-GC; Tue, 07 Jul 2015 13:24:22 +0200 Date: Tue, 7 Jul 2015 13:24:16 +0200 From: Yamagi Burmeister To: freebsd-scsi@freebsd.org Subject: Device timeouts(?) with LSI SAS3008 on mpr(4) Message-Id: <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org> X-Mailer: Sylpheed 3.4.2 (GTK+ 2.24.27; amd64-portbld-freebsd10.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jul 2015 12:02:24 -0000 Hello, I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform. Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each adapter serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE as of r283938 on 2 servers and r285196 on the last one. The controller identify themself as: ---- mpr0: port 0x6000-0x60ff mem 0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on pci2 mpr0: IOCFacts : MsgVersion: 0x205 HeaderVersion: 0x2300 IOCNumber: 0 IOCExceptions: 0x0 MaxChainDepth: 128 NumberOfPorts: 1 RequestCredit: 10240 ProductID: 0x2221 IOCRequestFrameSize: 32 MaxInitiators: 32 MaxTargets: 1024 MaxSasExpanders: 42 MaxEnclosures: 43 HighPriorityCredit: 128 MaxReplyDescriptorPostQueueDepth: 65504 ReplyFrameSize: 32 MaxVolumes: 0 MaxDevHandle: 1106 MaxPersistentEntries: 128 mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd mpr0: IOCCapabilities: 7a85c ---- 08.00.00.00 is the last available firmware. Since day one 'dmesg' is cluttered with CAM errors: ---- mpr1: Sending reset from mprsas_send_abort for target ID 5 (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08 00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0 (da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0 state c xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1: (da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) (da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) (da11:mpr1:0:5:0): Retrying command (per sense data) (noperiph:mpr1:0:4294967295:0): SMID 2 Aborting command 0xfffffe0001601a30 mpr1: Sending reset from mprsas_send_abort for target ID 2 (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00 length 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0 (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length 12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1: Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 (da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0): Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) (da8:mpr1:0:2:0): Retrying command (per sense data) (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) (da8:mpr1:0:2:0): Retrying command (per sense data) (noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command 0xfffffe000160b660 ---- ZFS doesn't like this and sees read errors or even write errors. In extreme cases the device is marked as FAULTED: ---- pool: examplepool state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: none requested config: NAME STATE READ WRITE CKSUM examplepool DEGRADED 0 0 0 raidz1-0 ONLINE 0 0 0 da3p1 ONLINE 0 0 0 da4p1 ONLINE 0 0 0 da5p1 ONLINE 0 0 0 logs da1p1 FAULTED 3 0 0 too many errors cache da1p2 FAULTED 3 0 0 too many errors spares da2p1 AVAIL errors: No known data errors ---- The problems arise on all 3 machines all all SSDs nearly daily. So I highly suspect a software issue. Has anyone an idea what's going on and what I can do to solve this problems? More information can be provided if necessary. Regards, Yamagi -- Homepage: www.yamagi.org XMPP: yamagi@yamagi.org GnuPG/GPG: 0xEFBCCBCB From owner-freebsd-scsi@freebsd.org Tue Jul 7 15:37:25 2015 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id F0A36995F34 for ; Tue, 7 Jul 2015 15:37:25 +0000 (UTC) (envelope-from stephen.mcconnell@avagotech.com) Received: from mail-vn0-x233.google.com (mail-vn0-x233.google.com [IPv6:2607:f8b0:400c:c0f::233]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id AE1451D7E for ; Tue, 7 Jul 2015 15:37:25 +0000 (UTC) (envelope-from stephen.mcconnell@avagotech.com) Received: by vnbf7 with SMTP id f7so16381826vnb.0 for ; Tue, 07 Jul 2015 08:37:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=avagotech.com; s=google; h=from:references:in-reply-to:mime-version:thread-index:date :message-id:subject:to:content-type; bh=bbcIyltYojc0gLxutuNbAofzT86Z78VJ6ws17oqPbPU=; b=B4xUhOhTvOxWa33Gsv69eLnhzzIv3uuAlWHOafq1Mw2gpD3fleLheXaa5iVXZm/B3p 9VvnfyZlGQ1gPrBHRr4+160pjCy5pmbjOEsBI7X0vyBirBGvg7ccG79PLAnZcnXNIxdO xcpWd7v9JGf/lEi0PG9TzR6r+dV7rh6reqY3c= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:references:in-reply-to:mime-version :thread-index:date:message-id:subject:to:content-type; bh=bbcIyltYojc0gLxutuNbAofzT86Z78VJ6ws17oqPbPU=; b=HHD1AV/cn6oD5fqL23ECpPaROMCyUtwwkUmJAuP3/QuEDxfpoTW/O7uA8hFr7zoaVW DRW5g6XOsUT8ltx96rX2qrNhsE7W9b7PlVC4UxcVPCjEoc16qUSu60OJoUrlN5sTQR+u 0ILiuiOG7MdHdE1kNyUuXlUyKDdXNmxmP9osRCVSfKCHs5Gaur19F2CJ0IP3wI085TEF j1I74ADABgyjt+XuNxUlhV5Bu70Q0FYaFU58CgQkVQDMo8yzfPhoE2gZa/wTznyeKF/4 p+4mbNCCACTIG5yKUPyNwgS2KxmGEU7jAWVRoWoxjr4RbqIoUYWX1Twtn3i9XYGiFWGc Tklg== X-Gm-Message-State: ALoCoQkrTMkMVjHo5jsGmJNUAQiAji4yWMYv1PzS7doSgku44xAKFmCMpEOYYNt93kry9b3W/NRk X-Received: by 10.52.114.230 with SMTP id jj6mr4871589vdb.66.1436283444321; Tue, 07 Jul 2015 08:37:24 -0700 (PDT) From: Stephen Mcconnell References: <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org> In-Reply-To: <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org> MIME-Version: 1.0 X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQGYg4GhL/MCbKjUmWvd1rpHhP9ph55AjWnQ Date: Tue, 7 Jul 2015 09:37:22 -0600 Message-ID: <9426ced85d7def424e106fdefd7448ae@mail.gmail.com> Subject: RE: Device timeouts(?) with LSI SAS3008 on mpr(4) To: Yamagi Burmeister , freebsd-scsi@freebsd.org Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jul 2015 15:37:26 -0000 Hi Yamagi, I see two drives that are having problems. Are there others? Can you try to remove those drives and let me know what happens. To me, it actually looks like those drives could be faulty. Steve > -----Original Message----- > From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd- > scsi@freebsd.org] On Behalf Of Yamagi Burmeister > Sent: Tuesday, July 07, 2015 5:24 AM > To: freebsd-scsi@freebsd.org > Subject: Device timeouts(?) with LSI SAS3008 on mpr(4) > > Hello, > I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform. > Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each adapter > serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE as of r283938 on > 2 servers and r285196 on the last one. > > The controller identify themself as: > > ---- > > mpr0: port 0x6000-0x60ff mem > 0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on > pci2 mpr0: IOCFacts : MsgVersion: 0x205 > HeaderVersion: 0x2300 > IOCNumber: 0 > IOCExceptions: 0x0 > MaxChainDepth: 128 > NumberOfPorts: 1 > RequestCredit: 10240 > ProductID: 0x2221 > IOCRequestFrameSize: 32 > MaxInitiators: 32 > MaxTargets: 1024 > MaxSasExpanders: 42 > MaxEnclosures: 43 > HighPriorityCredit: 128 > MaxReplyDescriptorPostQueueDepth: 65504 > ReplyFrameSize: 32 > MaxVolumes: 0 > MaxDevHandle: 1106 > MaxPersistentEntries: 128 > mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd > mpr0: IOCCapabilities: > 7a85c ,HostDisc> > > ---- > > 08.00.00.00 is the last available firmware. > > > Since day one 'dmesg' is cluttered with CAM errors: > > ---- > > mpr1: Sending reset from mprsas_send_abort for target ID 5 > (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08 > 00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0 > (da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 > 01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0): > READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0 state c > xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1: > (da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command > (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 > (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0): > SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION > asc:29,0 (Power on, reset, or bus device reset occurred) > (da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0): > READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM > status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check Condition > (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or > bus device reset occurred) (da11:mpr1:0:5:0): Retrying command (per sense > data) (noperiph:mpr1:0:4294967295:0): SMID 2 Aborting command > 0xfffffe0001601a30 > > mpr1: Sending reset from mprsas_send_abort for target ID 2 > (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00 length > 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0 > (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length > 12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1: > Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS > THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 > (da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0): > Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 > 00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error > (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI > sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset > occurred) (da8:mpr1:0:2:0): Retrying command (per sense data) > (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00 > (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI > status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION > asc:29,0 (Power on, reset, or bus device reset occurred) > (da8:mpr1:0:2:0): Retrying command (per sense data) > (noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command > 0xfffffe000160b660 > > ---- > > ZFS doesn't like this and sees read errors or even write errors. In extreme cases > the device is marked as FAULTED: > > ---- > > pool: examplepool > state: DEGRADED > status: One or more devices are faulted in response to persistent errors. > Sufficient replicas exist for the pool to continue functioning in a degraded state. > action: Replace the faulted device, or use 'zpool clear' to mark the device > repaired. > scan: none requested > config: > > NAME STATE READ WRITE CKSUM > examplepool DEGRADED 0 0 0 > raidz1-0 ONLINE 0 0 0 > da3p1 ONLINE 0 0 0 > da4p1 ONLINE 0 0 0 > da5p1 ONLINE 0 0 0 > logs > da1p1 FAULTED 3 0 0 too many errors > cache > da1p2 FAULTED 3 0 0 too many errors > spares > da2p1 AVAIL > > errors: No known data errors > > ---- > > The problems arise on all 3 machines all all SSDs nearly daily. So I highly suspect > a software issue. Has anyone an idea what's going on and what I can do to solve > this problems? More information can be provided if necessary. > > Regards, > Yamagi > > -- > Homepage: www.yamagi.org > XMPP: yamagi@yamagi.org > GnuPG/GPG: 0xEFBCCBCB > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" From owner-freebsd-scsi@freebsd.org Tue Jul 7 16:31:48 2015 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E0317996D42 for ; Tue, 7 Jul 2015 16:31:47 +0000 (UTC) (envelope-from lists@yamagi.org) Received: from mail1.yamagi.org (yugo.yamagi.org [212.48.122.103]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id A3ED21543 for ; Tue, 7 Jul 2015 16:31:47 +0000 (UTC) (envelope-from lists@yamagi.org) Received: from p4fed1304.dip0.t-ipconnect.de ([79.237.19.4] helo=kosei.home.yamagi.org.dhcp.yamagi.org) by mail1.yamagi.org with esmtpsa (TLSv1:DHE-RSA-AES256-SHA:256) (Exim 4.85 (FreeBSD)) (envelope-from ) id 1ZCVmX-0004Nc-7L; Tue, 07 Jul 2015 18:31:42 +0200 Date: Tue, 7 Jul 2015 18:31:35 +0200 From: Yamagi Burmeister To: stephen.mcconnell@avagotech.com Cc: freebsd-scsi@freebsd.org Subject: Re: Device timeouts(?) with LSI SAS3008 on mpr(4) Message-Id: <20150707183135.2c3f5aa45696b55a17e2f87f@yamagi.org> In-Reply-To: <9426ced85d7def424e106fdefd7448ae@mail.gmail.com> References: <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org> <9426ced85d7def424e106fdefd7448ae@mail.gmail.com> X-Mailer: Sylpheed 3.4.2 (GTK+ 2.24.28; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jul 2015 16:31:48 -0000 Hello Stephen, I'm seeing those errors on all 3 servers and on all 16 devices. The 2 dmesg entries were just an example. It seems to be random were they occure. Maybe the second controller mps1 has a higher chance then mps0, but I'm not sure. My co-worker suspected FreeBSDs power management. On on of the servers I forced c-states to C1 and deactivated powerd. In the last 2 hours no new errors arose but it's far too early to draw conclusions. Regards, Yamagi On Tue, 7 Jul 2015 09:37:22 -0600 Stephen Mcconnell wrote: > Hi Yamagi, > > I see two drives that are having problems. Are there others? Can you try > to remove those drives and let me know what happens. To me, it actually > looks like those drives could be faulty. > > Steve > > > -----Original Message----- > > From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd- > > scsi@freebsd.org] On Behalf Of Yamagi Burmeister > > Sent: Tuesday, July 07, 2015 5:24 AM > > To: freebsd-scsi@freebsd.org > > Subject: Device timeouts(?) with LSI SAS3008 on mpr(4) > > > > Hello, > > I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform. > > Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each > adapter > > serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE as of > r283938 on > > 2 servers and r285196 on the last one. > > > > The controller identify themself as: > > > > ---- > > > > mpr0: port 0x6000-0x60ff mem > > 0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on > > pci2 mpr0: IOCFacts : MsgVersion: 0x205 > > HeaderVersion: 0x2300 > > IOCNumber: 0 > > IOCExceptions: 0x0 > > MaxChainDepth: 128 > > NumberOfPorts: 1 > > RequestCredit: 10240 > > ProductID: 0x2221 > > IOCRequestFrameSize: 32 > > MaxInitiators: 32 > > MaxTargets: 1024 > > MaxSasExpanders: 42 > > MaxEnclosures: 43 > > HighPriorityCredit: 128 > > MaxReplyDescriptorPostQueueDepth: 65504 > > ReplyFrameSize: 32 > > MaxVolumes: 0 > > MaxDevHandle: 1106 > > MaxPersistentEntries: 128 > > mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd > > mpr0: IOCCapabilities: > > > 7a85c > ,HostDisc> > > > > ---- > > > > 08.00.00.00 is the last available firmware. > > > > > > Since day one 'dmesg' is cluttered with CAM errors: > > > > ---- > > > > mpr1: Sending reset from mprsas_send_abort for target ID 5 > > (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08 > > 00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0 > > (da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 > > 01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0): > > READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0 > state c > > xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1: > > (da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command > > (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 > > (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0): > > SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT > ATTENTION > > asc:29,0 (Power on, reset, or bus device reset occurred) > > (da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0): > > READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM > > status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check > Condition > > (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, > or > > bus device reset occurred) (da11:mpr1:0:5:0): Retrying command (per > sense > > data) (noperiph:mpr1:0:4294967295:0): SMID 2 Aborting command > > 0xfffffe0001601a30 > > > > mpr1: Sending reset from mprsas_send_abort for target ID 2 > > (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00 > length > > 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0 > > (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length > > 12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1: > > Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS > > THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 > > (da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0): > > Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 > > 00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error > > (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI > > sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset > > occurred) (da8:mpr1:0:2:0): Retrying command (per sense data) > > (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00 > > (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI > > status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION > > asc:29,0 (Power on, reset, or bus device reset occurred) > > (da8:mpr1:0:2:0): Retrying command (per sense data) > > (noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command > > 0xfffffe000160b660 > > > > ---- > > > > ZFS doesn't like this and sees read errors or even write errors. In > extreme cases > > the device is marked as FAULTED: > > > > ---- > > > > pool: examplepool > > state: DEGRADED > > status: One or more devices are faulted in response to persistent > errors. > > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > > action: Replace the faulted device, or use 'zpool clear' to mark the > device > > repaired. > > scan: none requested > > config: > > > > NAME STATE READ WRITE CKSUM > > examplepool DEGRADED 0 0 0 > > raidz1-0 ONLINE 0 0 0 > > da3p1 ONLINE 0 0 0 > > da4p1 ONLINE 0 0 0 > > da5p1 ONLINE 0 0 0 > > logs > > da1p1 FAULTED 3 0 0 too many errors > > cache > > da1p2 FAULTED 3 0 0 too many errors > > spares > > da2p1 AVAIL > > > > errors: No known data errors > > > > ---- > > > > The problems arise on all 3 machines all all SSDs nearly daily. So I > highly suspect > > a software issue. Has anyone an idea what's going on and what I can do > to solve > > this problems? More information can be provided if necessary. > > > > Regards, > > Yamagi > > > > -- > > Homepage: www.yamagi.org > > XMPP: yamagi@yamagi.org > > GnuPG/GPG: 0xEFBCCBCB > > _______________________________________________ > > freebsd-scsi@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-scsi > > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" -- Homepage: www.yamagi.org XMPP: yamagi@yamagi.org GnuPG/GPG: 0xEFBCCBCB From owner-freebsd-scsi@freebsd.org Tue Jul 7 16:42:49 2015 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DFA8A996EE0 for ; Tue, 7 Jul 2015 16:42:49 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-wg0-f44.google.com (mail-wg0-f44.google.com [74.125.82.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 7782C1BD7 for ; Tue, 7 Jul 2015 16:42:48 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by wgjx7 with SMTP id x7so173066878wgj.2 for ; Tue, 07 Jul 2015 09:42:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-type :content-transfer-encoding; bh=OPfJK/aoQr5oNpjlvg/f/SNciqznpgHA1PLpn3DvNGQ=; b=Z2AJ3hTnm/09A04soxjuMRc/BpFp0OWGBwY0xZHqzOFp7HekCcvD29fn32WivwJxL5 CYW7EEu5u9hMDrTxcGTuaNPEAfcRFisF7SDiu3kFZFNLJze9W0a1+CqcKU9wtxCacz9a 631nEDf78X2UobNPPusJ1Sg7qUJ85ZcP3gmxgzS2Mui6vf/HHc5gzG192q4BewQ0if8I zb5UAscjD/dyyleU2VQBv4eNl462G/N18KKoXjWJSLNzirLZlFMjVHKKnMhaVQ+TbTmC FsJQG0hW0ocBcquoIzlWBbO3Z+3tH3fHHzuYncAC9JhCPA1gMg2qYewhfuDDXiQp1kvQ c7PA== X-Gm-Message-State: ALoCoQlLqkQD9JZ0WSQtEwwF2jQ1VJIUz3ZDF3VNNWEZkmfe2LHVhdP1EZ0HWcr9UxKR46EEDPbm X-Received: by 10.194.185.8 with SMTP id ey8mr10351763wjc.118.1436287367064; Tue, 07 Jul 2015 09:42:47 -0700 (PDT) Received: from [10.10.1.68] (82-69-141-170.dsl.in-addr.zen.co.uk. [82.69.141.170]) by mx.google.com with ESMTPSA id pd7sm34212434wjb.27.2015.07.07.09.42.46 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 07 Jul 2015 09:42:46 -0700 (PDT) Subject: Re: Device timeouts(?) with LSI SAS3008 on mpr(4) To: freebsd-scsi@freebsd.org References: <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org> <9426ced85d7def424e106fdefd7448ae@mail.gmail.com> <20150707183135.2c3f5aa45696b55a17e2f87f@yamagi.org> From: Steven Hartland Message-ID: <559C0184.4050102@multiplay.co.uk> Date: Tue, 7 Jul 2015 17:42:44 +0100 User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:38.0) Gecko/20100101 Thunderbird/38.0.1 MIME-Version: 1.0 In-Reply-To: <20150707183135.2c3f5aa45696b55a17e2f87f@yamagi.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jul 2015 16:42:50 -0000 Have you eliminated the midplane / cabling as the issue as that's very common. On 07/07/2015 17:31, Yamagi Burmeister wrote: > Hello Stephen, > I'm seeing those errors on all 3 servers and on all 16 devices. The 2 > dmesg entries were just an example. It seems to be random were they > occure. Maybe the second controller mps1 has a higher chance then > mps0, but I'm not sure. > > My co-worker suspected FreeBSDs power management. On on of the servers > I forced c-states to C1 and deactivated powerd. In the last 2 hours no > new errors arose but it's far too early to draw conclusions. > > Regards, > Yamagi > > On Tue, 7 Jul 2015 09:37:22 -0600 > Stephen Mcconnell wrote: > >> Hi Yamagi, >> >> I see two drives that are having problems. Are there others? Can you try >> to remove those drives and let me know what happens. To me, it actually >> looks like those drives could be faulty. >> >> Steve >> >>> -----Original Message----- >>> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd- >>> scsi@freebsd.org] On Behalf Of Yamagi Burmeister >>> Sent: Tuesday, July 07, 2015 5:24 AM >>> To: freebsd-scsi@freebsd.org >>> Subject: Device timeouts(?) with LSI SAS3008 on mpr(4) >>> >>> Hello, >>> I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform. >>> Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each >> adapter >>> serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE as of >> r283938 on >>> 2 servers and r285196 on the last one. >>> >>> The controller identify themself as: >>> >>> ---- >>> >>> mpr0: port 0x6000-0x60ff mem >>> 0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on >>> pci2 mpr0: IOCFacts : MsgVersion: 0x205 >>> HeaderVersion: 0x2300 >>> IOCNumber: 0 >>> IOCExceptions: 0x0 >>> MaxChainDepth: 128 >>> NumberOfPorts: 1 >>> RequestCredit: 10240 >>> ProductID: 0x2221 >>> IOCRequestFrameSize: 32 >>> MaxInitiators: 32 >>> MaxTargets: 1024 >>> MaxSasExpanders: 42 >>> MaxEnclosures: 43 >>> HighPriorityCredit: 128 >>> MaxReplyDescriptorPostQueueDepth: 65504 >>> ReplyFrameSize: 32 >>> MaxVolumes: 0 >>> MaxDevHandle: 1106 >>> MaxPersistentEntries: 128 >>> mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd >>> mpr0: IOCCapabilities: >>> >> 7a85c>> ,HostDisc> >>> >>> ---- >>> >>> 08.00.00.00 is the last available firmware. >>> >>> >>> Since day one 'dmesg' is cluttered with CAM errors: >>> >>> ---- >>> >>> mpr1: Sending reset from mprsas_send_abort for target ID 5 >>> (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08 >>> 00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0 >>> (da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 >>> 01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0): >>> READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0 >> state c >>> xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1: >>> (da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command >>> (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 >>> (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0): >>> SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT >> ATTENTION >>> asc:29,0 (Power on, reset, or bus device reset occurred) >>> (da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0): >>> READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM >>> status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check >> Condition >>> (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, >> or >>> bus device reset occurred) (da11:mpr1:0:5:0): Retrying command (per >> sense >>> data) (noperiph:mpr1:0:4294967295:0): SMID 2 Aborting command >>> 0xfffffe0001601a30 >>> >>> mpr1: Sending reset from mprsas_send_abort for target ID 2 >>> (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00 >> length >>> 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0 >>> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length >>> 12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1: >>> Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS >>> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 >>> (da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0): >>> Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 >>> 00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error >>> (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI >>> sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset >>> occurred) (da8:mpr1:0:2:0): Retrying command (per sense data) >>> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00 >>> (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI >>> status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION >>> asc:29,0 (Power on, reset, or bus device reset occurred) >>> (da8:mpr1:0:2:0): Retrying command (per sense data) >>> (noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command >>> 0xfffffe000160b660 >>> >>> ---- >>> >>> ZFS doesn't like this and sees read errors or even write errors. In >> extreme cases >>> the device is marked as FAULTED: >>> >>> ---- >>> >>> pool: examplepool >>> state: DEGRADED >>> status: One or more devices are faulted in response to persistent >> errors. >>> Sufficient replicas exist for the pool to continue functioning in a >> degraded state. >>> action: Replace the faulted device, or use 'zpool clear' to mark the >> device >>> repaired. >>> scan: none requested >>> config: >>> >>> NAME STATE READ WRITE CKSUM >>> examplepool DEGRADED 0 0 0 >>> raidz1-0 ONLINE 0 0 0 >>> da3p1 ONLINE 0 0 0 >>> da4p1 ONLINE 0 0 0 >>> da5p1 ONLINE 0 0 0 >>> logs >>> da1p1 FAULTED 3 0 0 too many errors >>> cache >>> da1p2 FAULTED 3 0 0 too many errors >>> spares >>> da2p1 AVAIL >>> >>> errors: No known data errors >>> >>> ---- >>> >>> The problems arise on all 3 machines all all SSDs nearly daily. So I >> highly suspect >>> a software issue. Has anyone an idea what's going on and what I can do >> to solve >>> this problems? More information can be provided if necessary. >>> >>> Regards, >>> Yamagi >>> >>> -- >>> Homepage: www.yamagi.org >>> XMPP: yamagi@yamagi.org >>> GnuPG/GPG: 0xEFBCCBCB >>> _______________________________________________ >>> freebsd-scsi@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" > From owner-freebsd-scsi@freebsd.org Tue Jul 7 18:30:50 2015 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 90B42994428; Tue, 7 Jul 2015 18:30:50 +0000 (UTC) (envelope-from rdarbha@juniper.net) Received: from na01-bl2-obe.outbound.protection.outlook.com (mail-bl2on0105.outbound.protection.outlook.com [65.55.169.105]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "MSIT Machine Auth CA 2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 010AB1F28; Tue, 7 Jul 2015 18:30:49 +0000 (UTC) (envelope-from rdarbha@juniper.net) Received: from DM2PR0501MB1150.namprd05.prod.outlook.com (10.160.245.152) by DM2PR0501MB1151.namprd05.prod.outlook.com (10.160.245.153) with Microsoft SMTP Server (TLS) id 15.1.201.16; Tue, 7 Jul 2015 18:30:41 +0000 Received: from DM2PR0501MB1150.namprd05.prod.outlook.com ([10.160.245.152]) by DM2PR0501MB1150.namprd05.prod.outlook.com ([10.160.245.152]) with mapi id 15.01.0201.000; Tue, 7 Jul 2015 18:30:41 +0000 From: Raviprakash Darbha To: "freebsd-scsi@freebsd.org" , "freebsd-geom@freebsd.org" CC: Raviprakash Darbha Subject: questions about camcontrol eject Thread-Topic: questions about camcontrol eject Thread-Index: AQHQuOMHih5nUrGIREWDx+FWI6Z8KQ== Date: Tue, 7 Jul 2015 18:30:41 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: freebsd.org; dkim=none (message not signed) header.d=none; x-ms-exchange-messagesentrepresentingtype: 1 x-originating-ip: [66.129.239.14] x-microsoft-exchange-diagnostics: 1; DM2PR0501MB1151; 5:G//G3T20fMKE7BG/ESagbBtUs5sLap+IGaTdwTbiwtemfHgARKC8uRV/VZQUK5FASwMZ6/rta/87LnMG8WE9lY8q3BB81KJpgYyJMKSiEXEuLPnM47KqP3dqubUixP0rklsKr0pXSNkxefVR80x+4w==; 24:3FsK8Ba/EHQmABNIjxLjiO2KK14hjCyujOTd7+PKviOeW3LlwfA28SDCJIn/z7WFovWQHwHKMp+34ZAtJBqPJZJHVmNvSQOWqtUJb8yjlDU=; 20:wZDiY42HqZsHXpgzbqEB2ERJ+DGpD01DUjn+0oCsFvPGx8I3vvZYiTmFYMMWZqxuJ4o8Uq29hVDNTe44mRMbQA== x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:DM2PR0501MB1151; x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:; x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(601004)(5005006)(3002001); SRVR:DM2PR0501MB1151; BCL:0; PCL:0; RULEID:; SRVR:DM2PR0501MB1151; x-forefront-prvs: 0630013541 x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(6009001)(53754006)(46102003)(36756003)(229853001)(77156002)(62966003)(54356999)(50986999)(92566002)(16236675004)(77096005)(2656002)(102836002)(122556002)(83716003)(40100003)(2900100001)(99286002)(450100001)(2501003)(87936001)(19580395003)(86362001)(106116001)(5001770100001)(107886002)(5001960100002)(189998001)(82746002)(33656002)(66066001)(5002640100001)(158833001)(4001430100001)(104396002); DIR:OUT; SFP:1102; SCL:1; SRVR:DM2PR0501MB1151; H:DM2PR0501MB1150.namprd05.prod.outlook.com; FPR:; SPF:None; MLV:sfv; LANG:en; MIME-Version: 1.0 X-OriginatorOrg: juniper.net X-MS-Exchange-CrossTenant-originalarrivaltime: 07 Jul 2015 18:30:41.6811 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: bea78b3c-4cdb-4130-854a-1d193232e5f4 X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM2PR0501MB1151 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jul 2015 18:30:50 -0000 Hello All I am trying to get cam control eject working on my router with 2 drives for= sometime and have some observations from the code. While allocating memory for ccb we either have a malloc option or a memory = pool. In the eject case we choose the memory pool as its low priority. After getting the ccb and setting the relevant fields it is submitted to th= e ata_action routine but then it fails there returning an error code . //Code snippets from sys/cam/scsi/scsi-pass.c /* * Non-immediate CCBs need a CCB from the per-device pool * of CCBs, which is scheduled by the transport layer. * Immediate CCBs and user-supplied CCBs should just be * malloced. */ if ((inccb->ccb_h.func_code & XPT_FC_QUEUED) && ((inccb->ccb_h.func_code & XPT_FC_USER_CCB) =3D=3D 0)) = { ccb =3D cam_periph_getccb(periph, priority); ccb_malloced =3D 0; } else { ccb =3D xpt_alloc_ccb_nowait(); if (ccb !=3D NULL) xpt_setup_ccb(&ccb->ccb_h, periph->path, priority); ccb_malloced =3D 1; } if (ccb =3D=3D NULL) { xpt_print(periph->path, "unable to allocate CCB\n")= ; error =3D ENOMEM; break; } error =3D passsendccb(periph, ccb, inccb); from sys/cam/ata/ata/xpt.c { struct cam_ed *device; u_int maxlen =3D 0; device =3D start_ccb->ccb_h.path->device; if (device->protocol =3D=3D PROTO_SCSI && (device->flags & CAM_DEV_IDENTIFY_DATA_VALID)) { uint16_t p =3D device->ident_data.config & ATA_PROTO_MASK; maxlen =3D (device->ident_data.config =3D=3D ATA_PROTO_CFA= ) ? 0 : (p =3D=3D ATA_PROTO_ATAPI_16) ? 16 : (p =3D=3D ATA_PROTO_ATAPI_12) ? 12 : 0; ///// maxlen is still set to 0. } if (start_ccb->csio.cdb_len > maxlen) { start_ccb->ccb_h.status =3D CAM_REQ_INVALID; xpt_done(start_ccb); break; ///// hence returning from here. } xpt_action_default(start_ccb); break; } My question is if this is a code path thats expected to run this way in whi= ch case I am missing something or is this a bug ? In the later case I am as= suming the ccb_hdr is not set correctly in case we get the ccb from the poo= l so i m considering to set it by calling xpt_ccb_setup in that case too t= o get the right values in the device structure. Any help is greatly appreciated here. Please let me know if more informatio= n is needed. Thanks Ravi From owner-freebsd-scsi@freebsd.org Wed Jul 8 05:45:25 2015 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 6BE07995651 for ; Wed, 8 Jul 2015 05:45:25 +0000 (UTC) (envelope-from lists@yamagi.org) Received: from mail1.yamagi.org (yugo.yamagi.org [212.48.122.103]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 2EE151F48 for ; Wed, 8 Jul 2015 05:45:24 +0000 (UTC) (envelope-from lists@yamagi.org) Received: from p4fed1304.dip0.t-ipconnect.de ([79.237.19.4] helo=kosei.home.yamagi.org.dhcp.yamagi.org) by mail1.yamagi.org with esmtpsa (TLSv1:DHE-RSA-AES256-SHA:256) (Exim 4.85 (FreeBSD)) (envelope-from ) id 1ZCiAY-000GCZ-6Y; Wed, 08 Jul 2015 07:45:19 +0200 Date: Wed, 8 Jul 2015 07:45:12 +0200 From: Yamagi Burmeister To: stephen.mcconnell@avagotech.com Cc: freebsd-scsi@freebsd.org Subject: Re: Device timeouts(?) with LSI SAS3008 on mpr(4) Message-Id: <20150708074512.e676c8a9a5b7c6d56d357a02@yamagi.org> In-Reply-To: <9426ced85d7def424e106fdefd7448ae@mail.gmail.com> References: <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org> <9426ced85d7def424e106fdefd7448ae@mail.gmail.com> X-Mailer: Sylpheed 3.4.2 (GTK+ 2.24.28; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 Jul 2015 05:45:25 -0000 Good morning, it wasn't the power managment. This night the errors occured on da6, da7 and da9. This is the same machine as yesterday: Jul 8 05:06:21 mars kernel: (noperiph:mpr1:0:4294967295:0): SMID 83 Aborting command 0xfffffe0001a684e0 Jul 8 05:06:21 mars kernel: (da7:mpr1:0:1:0): READ(10). CDB: 28 00 48 0a 44 98 00 00 08 00 length 4096 SMID 556 terminated ioc 804b scsi 0 state c xfer 0 Jul 8 05:06:21 mars kernel: (da7:mpr1:0:1:0): READ(10). CDB: 28 00 48 10 bb a8 00 00 20 00 length 16384 SMID 745 terminated ioc 804b scsi 0 state c xfer 0 Jul 8 05:06:21 mars kernel: (da7:mpr1:0:1:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 680 term(da7:mpr1:0:1:0): WRITE(10). CDB: 2a 00 56 1b 1c 38 00 00 08 00 Jul 8 05:06:21 mars kernel: inated ioc 804b scsi 0 state c xfer 0 Jul 8 05:06:21 mars kernel: (da7:mpr1:0:1:0): CAM status: Command timeout Jul 8 05:06:21 mars kernel: (da7:mpr1:0:1:0): Retrying command Jul 8 05:06:21 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00) Jul 8 05:06:21 mars kernel: (da7:mpr1:0:1:0): READ(10). CDB: 28 00 48 0a 44 98 00 00 08 00 length 4096 SMID 696 terminated ioc 804b scsi 0 state c xfer 0 Jul 8 05:06:21 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00) Jul 8 05:06:21 mars kernel: (da7:mpr1:0:1:0): READ(10). CDB: 28 00 48 10 bb a8 00 00 20 00 length 16384 SMID 517 terminated ioc 804b scsi 0 state c xfer 0 Jul 8 05:06:21 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00) Jul 8 05:06:21 mars kernel: (da7:mpr1:0:1:0): WRITE(10). CDB: 2a 00 56 1b 1c 38 00 00 08 00 length 4096 SMID 905 terminated ioc 804b scsi 0 state c xfer 0 Jul 8 05:06:21 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00) Jul 8 05:06:21 mars kernel: (da7:mpr1:0:1:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 290 terminated ioc 804b scsi 0 state c xfer 0 Jul 8 05:06:22 mars kernel: (da7:mpr1:0:1:0): READ(10). CDB: 28 00 48 0a 44 98 00 00 08 00 Jul 8 05:06:22 mars kernel: (da7:mpr1:0:1:0): CAM status: SCSI Status Error Jul 8 05:06:22 mars kernel: (da7:mpr1:0:1:0): SCSI status: Check Condition Jul 8 05:06:22 mars kernel: (da7:mpr1:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jul 8 05:06:22 mars kernel: (da7:mpr1:0:1:0): Retrying command (per sense data) Jul 8 06:33:26 mars kernel: (noperiph:mpr1:0:4294967295:0): SMID 84 Aborting command 0xfffffe0001a32fc0 Jul 8 06:33:27 mars kernel: (da9:mpr1:0:3:0): READ(10). CDB: 28 00 48 0f bc 90 00 00 20 00 length 16384 SMID 703 terminated ioc 804b scsi 0 state c xfer 0 Jul 8 06:33:27 mars kernel: (da9:mpr1:0:3:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 719 term(da9:mpr1:0:3:0): WRITE(10). CDB: 2a 00 48 3c d0 58 00 00 10 00 Jul 8 06:33:27 mars kernel: inated ioc 804b scsi 0 state c xfer 0 Jul 8 06:33:27 mars kernel: (da9:mpr1:0:3:0): CAM status: Command timeout Jul 8 06:33:27 mars kernel: (da9:mpr1:0:3:0): Retrying command Jul 8 06:33:27 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00) Jul 8 06:33:27 mars kernel: (da9:mpr1:0:3:0): READ(10). CDB: 28 00 48 0f bc 90 00 00 20 00 length 16384 SMID 851 terminated ioc 804b scsi 0 state c xfer 0 Jul 8 06:33:27 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00) Jul 8 06:33:27 mars kernel: (da9:mpr1:0:3:0): WRITE(10). CDB: 2a 00 48 3c d0 58 00 00 10 00 length 8192 SMID 576 terminated ioc 804b scsi 0 state c xfer 0 Jul 8 06:33:27 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00) Jul 8 06:33:27 mars kernel: (da9:mpr1:0:3:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 854 terminated ioc 804b scsi 0 state c xfer 0 Jul 8 06:33:28 mars kernel: (da9:mpr1:0:3:0): READ(10). CDB: 28 00 48 0f bc 90 00 00 20 00 Jul 8 06:33:28 mars kernel: (da9:mpr1:0:3:0): CAM status: SCSI Status Error Jul 8 06:33:28 mars kernel: (da9:mpr1:0:3:0): SCSI status: Check Condition Jul 8 06:33:28 mars kernel: (da9:mpr1:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jul 8 06:33:28 mars kernel: (da9:mpr1:0:3:0): Retrying command (per sense data) Jul 8 06:35:10 mars kernel: (noperiph:mpr1:0:4294967295:0): SMID 85 Aborting command 0xfffffe0001a70c10 Jul 8 06:35:10 mars kernel: (da6:mpr1:0:0:0): READ(10). CDB: 28 00 48 30 4a 40 00 00 18 00 length 12288 SMID 541 terminated ioc 804b scsi 0 state c xfer 0 Jul 8 06:35:10 mars kernel: (da6:mpr1:0:0:0): WRITE(10). CDB: 2a 00 48 59 82 e8 00 00 10 00 length 8192 SMID 467 terminated ioc 804b scsi 0 state c xfer 0 Jul 8 06:35:10 mars kernel: (da6:mpr1:0:0:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 Jul 8 06:35:10 mars kernel: (da6:mpr1:0:0:0): CAM status: Command timeout Jul 8 06:35:10 mars kernel: (da6:mpr1:0:0:0): Retrying command Jul 8 06:35:10 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00) Jul 8 06:35:10 mars kernel: (da6:mpr1:0:0:0): READ(10). CDB: 28 00 48 30 4a 40 00 00 18 00 length 12288 SMID 870 terminated ioc 804b scsi 0 state c xfer 0 Jul 8 06:35:10 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00) Jul 8 06:35:10 mars kernel: (da6:mpr1:0:0:0): WRITE(10). CDB: 2a 00 48 59 82 e8 00 00 10 00 length 8192 SMID 478 terminated ioc 804b scsi 0 state c xfer 0 Jul 8 06:35:10 mars kernel: mpr1: log_info(0x31110e00): originator(PL), code(0x11), sub_code(0x0e00) Jul 8 06:35:10 mars kernel: (da6:mpr1:0:0:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 764 terminated ioc 804b scsi 0 state c xfer 0 Jul 8 06:35:11 mars kernel: (da6:mpr1:0:0:0): READ(10). CDB: 28 00 48 30 4a 40 00 00 18 00 Jul 8 06:35:11 mars kernel: (da6:mpr1:0:0:0): CAM status: SCSI Status Error Jul 8 06:35:11 mars kernel: (da6:mpr1:0:0:0): SCSI status: Check Condition Jul 8 06:35:11 mars kernel: (da6:mpr1:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jul 8 06:35:11 mars kernel: (da6:mpr1:0:0:0): Retrying command (per sense data) Regards, Yamagi On Tue, 7 Jul 2015 09:37:22 -0600 Stephen Mcconnell wrote: > Hi Yamagi, > > I see two drives that are having problems. Are there others? Can you try > to remove those drives and let me know what happens. To me, it actually > looks like those drives could be faulty. > > Steve > > > -----Original Message----- > > From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd- > > scsi@freebsd.org] On Behalf Of Yamagi Burmeister > > Sent: Tuesday, July 07, 2015 5:24 AM > > To: freebsd-scsi@freebsd.org > > Subject: Device timeouts(?) with LSI SAS3008 on mpr(4) > > > > Hello, > > I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform. > > Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each > adapter > > serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE as of > r283938 on > > 2 servers and r285196 on the last one. > > > > The controller identify themself as: > > > > ---- > > > > mpr0: port 0x6000-0x60ff mem > > 0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on > > pci2 mpr0: IOCFacts : MsgVersion: 0x205 > > HeaderVersion: 0x2300 > > IOCNumber: 0 > > IOCExceptions: 0x0 > > MaxChainDepth: 128 > > NumberOfPorts: 1 > > RequestCredit: 10240 > > ProductID: 0x2221 > > IOCRequestFrameSize: 32 > > MaxInitiators: 32 > > MaxTargets: 1024 > > MaxSasExpanders: 42 > > MaxEnclosures: 43 > > HighPriorityCredit: 128 > > MaxReplyDescriptorPostQueueDepth: 65504 > > ReplyFrameSize: 32 > > MaxVolumes: 0 > > MaxDevHandle: 1106 > > MaxPersistentEntries: 128 > > mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd > > mpr0: IOCCapabilities: > > > 7a85c > ,HostDisc> > > > > ---- > > > > 08.00.00.00 is the last available firmware. > > > > > > Since day one 'dmesg' is cluttered with CAM errors: > > > > ---- > > > > mpr1: Sending reset from mprsas_send_abort for target ID 5 > > (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08 > > 00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0 > > (da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 > > 01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0): > > READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0 > state c > > xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1: > > (da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command > > (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 > > (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0): > > SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT > ATTENTION > > asc:29,0 (Power on, reset, or bus device reset occurred) > > (da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0): > > READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM > > status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check > Condition > > (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, > or > > bus device reset occurred) (da11:mpr1:0:5:0): Retrying command (per > sense > > data) (noperiph:mpr1:0:4294967295:0): SMID 2 Aborting command > > 0xfffffe0001601a30 > > > > mpr1: Sending reset from mprsas_send_abort for target ID 2 > > (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00 > length > > 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0 > > (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length > > 12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1: > > Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS > > THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 > > (da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0): > > Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 > > 00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error > > (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI > > sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset > > occurred) (da8:mpr1:0:2:0): Retrying command (per sense data) > > (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00 > > (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI > > status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION > > asc:29,0 (Power on, reset, or bus device reset occurred) > > (da8:mpr1:0:2:0): Retrying command (per sense data) > > (noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command > > 0xfffffe000160b660 > > > > ---- > > > > ZFS doesn't like this and sees read errors or even write errors. In > extreme cases > > the device is marked as FAULTED: > > > > ---- > > > > pool: examplepool > > state: DEGRADED > > status: One or more devices are faulted in response to persistent > errors. > > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > > action: Replace the faulted device, or use 'zpool clear' to mark the > device > > repaired. > > scan: none requested > > config: > > > > NAME STATE READ WRITE CKSUM > > examplepool DEGRADED 0 0 0 > > raidz1-0 ONLINE 0 0 0 > > da3p1 ONLINE 0 0 0 > > da4p1 ONLINE 0 0 0 > > da5p1 ONLINE 0 0 0 > > logs > > da1p1 FAULTED 3 0 0 too many errors > > cache > > da1p2 FAULTED 3 0 0 too many errors > > spares > > da2p1 AVAIL > > > > errors: No known data errors > > > > ---- > > > > The problems arise on all 3 machines all all SSDs nearly daily. So I > highly suspect > > a software issue. Has anyone an idea what's going on and what I can do > to solve > > this problems? More information can be provided if necessary. > > > > Regards, > > Yamagi > > > > -- > > Homepage: www.yamagi.org > > XMPP: yamagi@yamagi.org > > GnuPG/GPG: 0xEFBCCBCB > > _______________________________________________ > > freebsd-scsi@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-scsi > > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" -- Homepage: www.yamagi.org XMPP: yamagi@yamagi.org GnuPG/GPG: 0xEFBCCBCB From owner-freebsd-scsi@freebsd.org Wed Jul 8 05:47:02 2015 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D503399570A for ; Wed, 8 Jul 2015 05:47:02 +0000 (UTC) (envelope-from lists@yamagi.org) Received: from mail1.yamagi.org (yugo.yamagi.org [212.48.122.103]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 976F21FBF for ; Wed, 8 Jul 2015 05:47:01 +0000 (UTC) (envelope-from lists@yamagi.org) Received: from p4fed1304.dip0.t-ipconnect.de ([79.237.19.4] helo=kosei.home.yamagi.org.dhcp.yamagi.org) by mail1.yamagi.org with esmtpsa (TLSv1:DHE-RSA-AES256-SHA:256) (Exim 4.85 (FreeBSD)) (envelope-from ) id 1ZCiC9-000GEB-N7; Wed, 08 Jul 2015 07:46:59 +0200 Date: Wed, 8 Jul 2015 07:46:52 +0200 From: Yamagi Burmeister To: killing@multiplay.co.uk Cc: freebsd-scsi@freebsd.org Subject: Re: Device timeouts(?) with LSI SAS3008 on mpr(4) Message-Id: <20150708074652.07a815e6aa08526d569f3077@yamagi.org> In-Reply-To: <559C0184.4050102@multiplay.co.uk> References: <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org> <9426ced85d7def424e106fdefd7448ae@mail.gmail.com> <20150707183135.2c3f5aa45696b55a17e2f87f@yamagi.org> <559C0184.4050102@multiplay.co.uk> X-Mailer: Sylpheed 3.4.2 (GTK+ 2.24.28; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 Jul 2015 05:47:02 -0000 Hello Steven, since the issue occures on all 3 servers it's at least unlikely. But I'll see what I can do. Regards, Yamagi On Tue, 7 Jul 2015 17:42:44 +0100 Steven Hartland wrote: > Have you eliminated the midplane / cabling as the issue as that's very > common. > > On 07/07/2015 17:31, Yamagi Burmeister wrote: > > Hello Stephen, > > I'm seeing those errors on all 3 servers and on all 16 devices. The 2 > > dmesg entries were just an example. It seems to be random were they > > occure. Maybe the second controller mps1 has a higher chance then > > mps0, but I'm not sure. > > > > My co-worker suspected FreeBSDs power management. On on of the servers > > I forced c-states to C1 and deactivated powerd. In the last 2 hours no > > new errors arose but it's far too early to draw conclusions. > > > > Regards, > > Yamagi > > > > On Tue, 7 Jul 2015 09:37:22 -0600 > > Stephen Mcconnell wrote: > > > >> Hi Yamagi, > >> > >> I see two drives that are having problems. Are there others? Can you try > >> to remove those drives and let me know what happens. To me, it actually > >> looks like those drives could be faulty. > >> > >> Steve > >> > >>> -----Original Message----- > >>> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd- > >>> scsi@freebsd.org] On Behalf Of Yamagi Burmeister > >>> Sent: Tuesday, July 07, 2015 5:24 AM > >>> To: freebsd-scsi@freebsd.org > >>> Subject: Device timeouts(?) with LSI SAS3008 on mpr(4) > >>> > >>> Hello, > >>> I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform. > >>> Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each > >> adapter > >>> serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE as of > >> r283938 on > >>> 2 servers and r285196 on the last one. > >>> > >>> The controller identify themself as: > >>> > >>> ---- > >>> > >>> mpr0: port 0x6000-0x60ff mem > >>> 0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on > >>> pci2 mpr0: IOCFacts : MsgVersion: 0x205 > >>> HeaderVersion: 0x2300 > >>> IOCNumber: 0 > >>> IOCExceptions: 0x0 > >>> MaxChainDepth: 128 > >>> NumberOfPorts: 1 > >>> RequestCredit: 10240 > >>> ProductID: 0x2221 > >>> IOCRequestFrameSize: 32 > >>> MaxInitiators: 32 > >>> MaxTargets: 1024 > >>> MaxSasExpanders: 42 > >>> MaxEnclosures: 43 > >>> HighPriorityCredit: 128 > >>> MaxReplyDescriptorPostQueueDepth: 65504 > >>> ReplyFrameSize: 32 > >>> MaxVolumes: 0 > >>> MaxDevHandle: 1106 > >>> MaxPersistentEntries: 128 > >>> mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd > >>> mpr0: IOCCapabilities: > >>> > >> 7a85c >>> ,HostDisc> > >>> > >>> ---- > >>> > >>> 08.00.00.00 is the last available firmware. > >>> > >>> > >>> Since day one 'dmesg' is cluttered with CAM errors: > >>> > >>> ---- > >>> > >>> mpr1: Sending reset from mprsas_send_abort for target ID 5 > >>> (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08 > >>> 00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0 > >>> (da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 > >>> 01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0): > >>> READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0 > >> state c > >>> xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1: > >>> (da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command > >>> (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 > >>> (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0): > >>> SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT > >> ATTENTION > >>> asc:29,0 (Power on, reset, or bus device reset occurred) > >>> (da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0): > >>> READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM > >>> status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check > >> Condition > >>> (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, > >> or > >>> bus device reset occurred) (da11:mpr1:0:5:0): Retrying command (per > >> sense > >>> data) (noperiph:mpr1:0:4294967295:0): SMID 2 Aborting command > >>> 0xfffffe0001601a30 > >>> > >>> mpr1: Sending reset from mprsas_send_abort for target ID 2 > >>> (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00 > >> length > >>> 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0 > >>> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length > >>> 12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1: > >>> Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS > >>> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 > >>> (da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0): > >>> Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 > >>> 00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error > >>> (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI > >>> sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset > >>> occurred) (da8:mpr1:0:2:0): Retrying command (per sense data) > >>> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00 > >>> (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI > >>> status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION > >>> asc:29,0 (Power on, reset, or bus device reset occurred) > >>> (da8:mpr1:0:2:0): Retrying command (per sense data) > >>> (noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command > >>> 0xfffffe000160b660 > >>> > >>> ---- > >>> > >>> ZFS doesn't like this and sees read errors or even write errors. In > >> extreme cases > >>> the device is marked as FAULTED: > >>> > >>> ---- > >>> > >>> pool: examplepool > >>> state: DEGRADED > >>> status: One or more devices are faulted in response to persistent > >> errors. > >>> Sufficient replicas exist for the pool to continue functioning in a > >> degraded state. > >>> action: Replace the faulted device, or use 'zpool clear' to mark the > >> device > >>> repaired. > >>> scan: none requested > >>> config: > >>> > >>> NAME STATE READ WRITE CKSUM > >>> examplepool DEGRADED 0 0 0 > >>> raidz1-0 ONLINE 0 0 0 > >>> da3p1 ONLINE 0 0 0 > >>> da4p1 ONLINE 0 0 0 > >>> da5p1 ONLINE 0 0 0 > >>> logs > >>> da1p1 FAULTED 3 0 0 too many errors > >>> cache > >>> da1p2 FAULTED 3 0 0 too many errors > >>> spares > >>> da2p1 AVAIL > >>> > >>> errors: No known data errors > >>> > >>> ---- > >>> > >>> The problems arise on all 3 machines all all SSDs nearly daily. So I > >> highly suspect > >>> a software issue. Has anyone an idea what's going on and what I can do > >> to solve > >>> this problems? More information can be provided if necessary. > >>> > >>> Regards, > >>> Yamagi > >>> > >>> -- > >>> Homepage: www.yamagi.org > >>> XMPP: yamagi@yamagi.org > >>> GnuPG/GPG: 0xEFBCCBCB > >>> _______________________________________________ > >>> freebsd-scsi@freebsd.org mailing list > >>> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi > >>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" > > > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" -- Homepage: www.yamagi.org XMPP: yamagi@yamagi.org GnuPG/GPG: 0xEFBCCBCB From owner-freebsd-scsi@freebsd.org Wed Jul 8 07:35:21 2015 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 88B5899698F for ; Wed, 8 Jul 2015 07:35:21 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-wg0-f49.google.com (mail-wg0-f49.google.com [74.125.82.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 1ED371D25 for ; Wed, 8 Jul 2015 07:35:20 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by wgck11 with SMTP id k11so187741550wgc.0 for ; Wed, 08 Jul 2015 00:35:18 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-type :content-transfer-encoding; bh=Bv1GYo55/ZOiCYSpeqLXrv+L7UWzkM6EcxbbJKWYUbc=; b=IqMzuvy3zvf1BRy47qiyOzSKI7RwDgIvsk5dE4QF6Slu1W/Eo7mfiBSsseD10hJCiu b8cNlsQCB5ftD8vW57wBp+yYFOyP6HYknGfodI2FgedtjYbgzjobUl8uFo8u7Nv1JfT/ ovMzpvHm8n+I0egWj8LfQlo5nPBtavBcbvu1MU7peqNgoXVyg87MIQE+9mgvAlIF6ZWh khP3LMjE04gHV0awFz57pc2kNqPH+E3Fl5YEoZicFK2AalEvzAsmlMH/ysnxBwsciC+M abv8ZJfsOu+ux+mTbS9U5GLlZG2de0qTeUE+aZJXX1o3U/oUN7TZmCCrZTSSH2XGFMx2 Nwnw== X-Gm-Message-State: ALoCoQkQ6fHpz42Ah2N/Ga1XVaqOf7IK4Wn3Sh6jDJJZn7VImFfXaMPtsKHUOpetAGZovidwmSgX X-Received: by 10.180.188.48 with SMTP id fx16mr71531067wic.35.1436340918422; Wed, 08 Jul 2015 00:35:18 -0700 (PDT) Received: from [10.10.1.68] (82-69-141-170.dsl.in-addr.zen.co.uk. [82.69.141.170]) by smtp.gmail.com with ESMTPSA id c2sm1945437wjf.18.2015.07.08.00.35.17 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 08 Jul 2015 00:35:17 -0700 (PDT) Subject: Re: Device timeouts(?) with LSI SAS3008 on mpr(4) To: Yamagi Burmeister References: <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org> <9426ced85d7def424e106fdefd7448ae@mail.gmail.com> <20150707183135.2c3f5aa45696b55a17e2f87f@yamagi.org> <559C0184.4050102@multiplay.co.uk> <20150708074652.07a815e6aa08526d569f3077@yamagi.org> Cc: freebsd-scsi@freebsd.org From: Steven Hartland Message-ID: <559CD2B3.7000404@multiplay.co.uk> Date: Wed, 8 Jul 2015 08:35:15 +0100 User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:38.0) Gecko/20100101 Thunderbird/38.0.1 MIME-Version: 1.0 In-Reply-To: <20150708074652.07a815e6aa08526d569f3077@yamagi.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 Jul 2015 07:35:21 -0000 Actually not, it could indicate a design problem with the midplane / backplane is the cause of the issue. We've had a number of Supermicro and Dell chassis when used in combination with 6Gbps+ devices particularly SSD's that exhibit timeouts like you describe, all turned out to be a backplane issue. We proved this in by connecting the drives direct to the controller with high quality cables eliminating the hotswap backplane, after which the timeouts stopped. This is a PITA to test as power is supplied by the hotswap backplane, but I wouldn't recommend you look anywhere else till you've eliminated this as a potential cause. Regards Steve On 08/07/2015 06:46, Yamagi Burmeister wrote: > Hello Steven, > since the issue occures on all 3 servers it's at least unlikely. But > I'll see what I can do. > > Regards, > Yamagi > > On Tue, 7 Jul 2015 17:42:44 +0100 > Steven Hartland wrote: > >> Have you eliminated the midplane / cabling as the issue as that's very >> common. >> From owner-freebsd-scsi@freebsd.org Wed Jul 8 13:55:34 2015 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 66E50995D9B for ; Wed, 8 Jul 2015 13:55:34 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 52D0C1BD8 for ; Wed, 8 Jul 2015 13:55:34 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.14.9/8.14.9) with ESMTP id t68DtYLH028983 for ; Wed, 8 Jul 2015 13:55:34 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-scsi@FreeBSD.org Subject: [Bug 200883] Installing FreeBSD 10.1-RELEASE-amd64-{disk1|dvd1}.iso fails to install on Dell C6220, bootonly.iso works Date: Wed, 08 Jul 2015 13:55:34 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: misc X-Bugzilla-Version: 10.1-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: bcr@FreeBSD.org X-Bugzilla-Status: In Progress X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-bugs@FreeBSD.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 Jul 2015 13:55:34 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200883 --- Comment #4 from Benedict Reuschling --- I just tested it with FreeBSD-10.2-PRERELEASE-amd64-20150625-r284813-disc1.iso . The same issue as before: the viewer connection gets terminated during the installation process, ejecting the media in the process. I can reliably reproduce the issue each time. Note: I did install the same machine a couple of times with the FreeBSD-11.0-CURRENT-amd64-r283577-20150526.disc1.iso . In one of these instances, the viewer crashed as well. But this was only one instance and next time, the installer completed just fine and I couldn't reproduce the error like in 10.X. We should try to identify which MFC is missing that makes a difference between 10.X and 11-CURRENT and the behaviour I'm experiencing. -- You are receiving this mail because: You are on the CC list for the bug.