Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 30 Oct 2016 07:52:00 -0400
From:      Jeremy Beker <gothmog@confusticate.com>
To:        freebsd-stable@freebsd.org
Subject:   FreeBSD 11.0 and LSI SAS3081E losing all devices
Message-ID:  <FF400F3A-350A-4133-BED1-78087F1657F3@confusticate.com>

next in thread | raw e-mail | index | archive | help

[-- Attachment #1 --]
Good Morning!

Since upgrading my home server from 10.3 to 11.0-RELEASE-p1 about a week ago, I have twice had a serious problem where my LSI adapter is having errors and dropping all the drives out of my ZFS pool.

Hardware:
- LSI SAS3081E-R PCI-E card with the IT firmware loaded 
- 6x2TB WD Black drives
- 1 SSD
- Supermicro X10SLL-F MB (not sure that is relevant) 

This system has been running with this exact hardware for about a year with no problems under the 10.X versions of FreeBSD. Last weekend, I upgraded the system to 11.0-RELEASE-p1. Since then, twice, all of the drives have been marked as unavailable to ZFS after generating a stream of errors.

The problems start with a number of errors like this:

Oct 26 03:28:29 rivendell kernel: mpt0: request 0xfffffe0000f73058:57643 timed out for ccb 0xfffff803456ea000 (req->ccb 0xfffff803456ea000) 
Oct 26 03:28:29 rivendell kernel: mpt0: attempting to abort req 0xfffffe0000f73058:57643 function 0 
Oct 26 03:28:29 rivendell kernel: mpt0: completing timedout/aborted req 0xfffffe0000f73058:57643 
Oct 26 03:28:29 rivendell kernel: (da0:mpt0:0:10:0): READ(10). CDB: 28 00 04 c4 91 c0 00 00 08 00 
Oct 26 03:28:29 rivendell kernel: (da0:mpt0:0:10:0): CAM status: CCB request terminated by the host 
Oct 26 03:28:29 rivendell kernel: (da0:mpt0:0:10:0): mpt0: Retrying command 
Oct 26 03:28:29 rivendell kernel: abort of req 0xfffffe0000f73058:0 completed 
Oct 26 03:28:49 rivendell kernel: mpt0: request 0xfffffe0000f6c3b0:57658 timed out for ccb 0xfffff803456ea000 (req->ccb 0xfffff803456ea000) 
Oct 26 03:28:49 rivendell kernel: mpt0: attempting to abort req 0xfffffe0000f6c3b0:57658 function 0 
Oct 26 03:28:49 rivendell kernel: mpt0: completing timedout/aborted req 0xfffffe0000f6c3b0:57658 
Oct 26 03:28:49 rivendell kernel: (da0:mpt0:0:10:0): READ(10). CDB: 28 00 04 c4 91 c0 00 00 08 00 
Oct 26 03:28:49 rivendell kernel: (da0:mpt0:0:10:0): CAM status: CCB request terminated by the host 
Oct 26 03:28:49 rivendell kernel: (da0:mpt0:0:10:0): Retrying command 
Oct 26 03:28:49 rivendell kernel: mpt0: abort of req 0xfffffe0000f6c3b0:0 completed 
Oct 26 03:28:51 rivendell kernel: (da0:mpt0:0:10:0): READ(10). CDB: 28 00 04 c4 91 c0 00 00 08 00 
Oct 26 03:28:51 rivendell kernel: (da0:mpt0:0:10:0): CAM status: SCSI Status Error 
Oct 26 03:28:51 rivendell kernel: (da0:mpt0:0:10:0): SCSI status: Check Condition 
Oct 26 03:28:51 rivendell kernel: (da0:mpt0:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) 
Oct 26 03:28:51 rivendell kernel: (da0:mpt0:0:10:0): Retrying command (per sense data) 

Also these:

Oct 26 03:29:55 rivendell kernel: (da1:mpt0:0:14:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Oct 26 03:29:55 rivendell kernel: (da1:mpt0:0:14:0): CAM status: SCSI Status Error
Oct 26 03:29:55 rivendell kernel: (da1:mpt0:0:14:0): SCSI status: Check Condition
Oct 26 03:29:55 rivendell kernel: (da1:mpt0:0:14:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 26 03:29:55 rivendell kernel: (da1:mpt0:0:14:0): Error 6, Retries exhausted
Oct 26 03:29:55 rivendell kernel: (da1:mpt0:0:14:0): Invalidating pack

After a bunch of rounds of the errors above, I get this:

Oct 26 03:35:17 rivendell kernel: mpt0: request 0xfffffe0000f73350:62027 timed out for ccb 0xfffff800160ce000 (req->ccb 0xfffff800160ce000)
Oct 26 03:35:17 rivendell kernel: mpt0: attempting to abort req 0xfffffe0000f73350:62027 function 0
Oct 26 03:35:18 rivendell kernel: mpt0: mpt_wait_req(1) timed out
Oct 26 03:35:18 rivendell kernel: mpt0: mpt_recover_commands: abort timed-out. Resetting controller
Oct 26 03:35:18 rivendell kernel: mpt0: mpt_cam_event: 0x0
Oct 26 03:35:18 rivendell kernel: mpt0: mpt_cam_event: 0x0
Oct 26 03:35:18 rivendell kernel: mpt0: completing timedout/aborted req 0xfffffe0000f73350:62027

After which all the drives seem to disappear and the system detaches all of them:

Oct 26 03:35:33 rivendell kernel: da1 at mpt0 bus 0 scbus0 target 14 lun 0
Oct 26 03:35:33 rivendell kernel: da1: <ATA WDC WD2002FAEX-0 1D05> s/n WD-WMAY01559141 detached
Oct 26 03:35:33 rivendell kernel: da2 at mpt0 bus 0 scbus0 target 15 lun 0
Oct 26 03:35:33 rivendell kernel: da2: <ATA WDC WD2002FAEX-0 1D05> s/n WD-WMAY01603430 detached
Oct 26 03:35:33 rivendell kernel: da5 at mpt0 bus 0 scbus0 target 18 lun 0
Oct 26 03:35:33 rivendell kernel: da5: <ATA WDC WD2002FAEX-0 1D05> s/n WD-WMAY01159727 detached
Oct 26 03:35:33 rivendell kernel: da6 at mpt0 bus 0 scbus0 target 19 lun 0
Oct 26 03:35:33 rivendell kernel: da6: <ATA WDC WD2002FAEX-0 1D05> s/n WD-WMAY02971691 detached
Oct 26 03:35:33 rivendell kernel: da4 at mpt0 bus 0 scbus0 target 17 lun 0
Oct 26 03:35:33 rivendell kernel: da4: <ATA WDC WD2002FAEX-0 1D05> s/n WD-WMAY01470856 detached
Oct 26 03:35:33 rivendell kernel: da3 at mpt0 bus 0 scbus0 target 16 lun 0
Oct 26 03:35:33 rivendell kernel: da3: <ATA WDC WD2002FAEX-0 1D05> s/n WD-WMAY01602648 detached

At this point I have had to reboot the server and then all the drives magically reappear.

Any help would be greatly appreciated.

-Jeremy

-- 
Jeremy Beker - @gothmog 
http://www.confusticate.com
Condensing fact from the vapor of nuance.


[-- Attachment #2 --]
0	*H
010	+0	*H
00>h3rاV}0
	*H
0u10	UIL10U

StartCom Ltd.1)0'U StartCom Certification Authority1#0!UStartCom Class 1 Client CA0
160327120425Z
170327120425Z0L1!0Ugothmog@confusticate.com1'0%	*H
	gothmog@confusticate.com0"0
	*H
0
J!u-Tּ~s=nQ񩼥x}l479`	I-QU?{%G8	aR/s:ۘ>%?-}MD}X[mqTLbV˘BM:8Q&k
b~)߇"%^VF$
y}~o>'0:ry+M
q&tt-OǓuEPZ0@Lk=RD7=I6
LO+9|3900U0U%0++0	U00UZ~v8+g5EA+0U#0$l9aIF+('Hmh0o+c0a0$+0http://ocsp.startssl.com09+0-http://aia.startssl.com/certs/sca.client1.crt08U10/0-+)'http://crl.startssl.com/sca-client1.crl0#U0gothmog@confusticate.com0#U0http://www.startssl.com/0FU ?0=0;+70,0*+http://www.startssl.com/policy0
	*H
[}=X2}'FPa~8|t&bUIH{Hʻ<c!Vcr1^s͍mgs*Ă!joA-`~o8o`qFM`
9ʉ/]?
ߤZc
yV;maBCW%	[[Na#dEz'fK"Kl;NN|ېd ].@jQ}P1N0J00u10	UIL10U

StartCom Ltd.1)0'U StartCom Certification Authority1#0!UStartCom Class 1 Client CA>h3rاV}0	+0	*H
	1	*H
0	*H
	1
161030115200Z0#	*H
	1a& u^GH;/0	+7100u10	UIL10U

StartCom Ltd.1)0'U StartCom Certification Authority1#0!UStartCom Class 1 Client CA>h3rاV}0*H
	10u10	UIL10U

StartCom Ltd.1)0'U StartCom Certification Authority1#0!UStartCom Class 1 Client CA>h3rاV}0
	*H
@Q aUк&ͬ\۰Qv
2%P_<t6Q٨Ou ST>&FF7fô8tɔi|qr$q=yd/}vםՖ;X>y"[uFmk?{kξڗQ!먴P'u/~ɿ~-ߎ8ɿnWaSSl&c8!Y댁qCQX`,*D\

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?FF400F3A-350A-4133-BED1-78087F1657F3>