Date: Tue, 5 Feb 2019 09:22:20 -0600 From: Karl Denninger <karl@denninger.net> To: freebsd-stable@freebsd.org Subject: Re: 9211 (LSI/SAS) issues on 11.2-STABLE Message-ID: <b50c527c-e7f7-3e64-af3a-e597ec77c021@denninger.net> In-Reply-To: <7bb25f55-fa77-f67e-11f3-b2240b01e25a@denninger.net> References: <7bb25f55-fa77-f67e-11f3-b2240b01e25a@denninger.net>
next in thread | previous in thread | raw e-mail | index | archive | help
[-- Attachment #1 --]
On 2/2/2019 12:02, Karl Denninger wrote:
> I recently started having some really oddball things happening under
> stress. This coincided with the machine being updated to 11.2-STABLE
> (FreeBSD 11.2-STABLE #1 r342918:) from 11.1.
>
> Specifically, I get "errors" like this:
>
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
> length 131072 SMID 269 Aborting command 0xfffffe0001179110
> mps0: Sending reset from mpssas_send_abort for target ID 37
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
> length 131072 SMID 924 terminated ioc 804b loginfo 31140000 scsi 0 state
> c xfer 0
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
> length 131072 SMID 161 terminated ioc 804b loginfo 31140000 scsi 0 state
> c xfer 0
> mps0: Unfreezing devq for target ID 37
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
> (da12:mps0:0:37:0): Retrying command
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: Command timeout
> (da12:mps0:0:37:0): Retrying command
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
> (da12:mps0:0:37:0): Retrying command
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: SCSI Status Error
> (da12:mps0:0:37:0): SCSI status: Check Condition
> (da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
> reset, or bus device reset occurred)
> (da12:mps0:0:37:0): Retrying command (per sense data)
>
> The "Unit Attention" implies the drive reset. It only occurs on certain
> drives under very heavy load (e.g. a scrub.) I've managed to provoke it
> on two different brands of disk across multiple firmware and capacities,
> however, which tends to point away from a drive firmware problem.
>
> A look at the pool data shows /no /errors (e.g. no checksum problems,
> etc) and a look at the disk itself (using smartctl) shows no problems
> either -- whatever is going on here the adapter is recovering from it
> without any data corruption or loss registered on *either end*!
>
> The configuration is an older SuperMicro Xeon board (X8DTL-IF) and shows:
>
> mps0: <Avago Technologies (LSI) SAS2008> port 0xc000-0xc0ff mem
> 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3
> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
> mps0: IOCCapabilities:
> 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
After considerable additional work this looks increasingly like either a
missed interrupt or a command is getting lost between the host adapter
and the expander.
I'm going to turn the driver debug level up and see if I can capture
more information..... whatever is behind this, however, it is
almost-certainly related to something that changed between 11.1 and
11.2, as I never saw these on the 11.1-STABLE build.
--
Karl Denninger
karl@denninger.net <mailto:karl@denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
[-- Attachment #2 --]
0 *H
010
`He 0 *H
00 H^Ōc!5
H0
*H
010 UUS10UFlorida10U Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA0
170817164217Z
270815164217Z0{10 UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA0"0
*H
0
h-5B>[;olӴ0~͎O9}9Ye*$g!ukvʶLzN`jL>MD'7U 45CB+kY`bd~b*c3Ny-78ju]9HeuέsӬDؽmgwER?&UURj'}9nWD i`XcbGz \gG=u%\Oi13ߝ4
K44pYQr]Ie/r0+eEޝݖ0C15Mݚ@JSZ(zȏ NTa(25DD5.l<g[[ZarQQ%Buȴ~~`IohRbʳڟu2MS8EdFUClCMaѳ !}ș+2k/bųE,n当ꖛ\(8WV8 d]b yXw ܊:I39
00U]^§Q\ӎ0U#0T039N0b010 UUS10UFlorida10U Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA @Ui0U0 0U0
*H
:P U!>vJnio-#ן]WyujǑR̀Q
nƇ!GѦFg\yLxgw=OPycehf[}ܷ['4ڝ\[p 6\o.B&JF"ZC{;*o*mcCcLY߾`
t*S!(`]DHP5A~/NPp6=mhk밣'doA$86hm5ӚS@jެEgl
)0JG`%k35PaC?σ
׳HEt}!P㏏%*BxbQwaKG$6h¦Mve;[o-Iی&
I,Tcߎ#t wPA@l0P+KXBպT zGv;NcI3&JĬUPNa?/%W6G۟N000 k#Xd\=0
*H
0{10 UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA0
170817212120Z
220816212120Z0W10 UUS10UFlorida10U
Cuda Systems LLC10Ukarl@denninger.net0"0
*H
0
T[I-ΆϏ dn;Å@שy.us~_ZG%<MYd\gvfnsa1'6Egyjs"C [{~_K Pn+<*pv#Q+H/7[-vqDV^U>f%GX)H.|l`M(Cr>е͇6#odc"YljҦln8@5SA0&ۖ"OGj?UDWZ5 dDB7k-)9Izs-JAv
J6L$Ն1SmY.Lqw*SH;EF'DĦH]MOgQQ|Mٙג2Z9y@y]}6ٽeY9Y2xˆ$T=eCǺǵbn֛{j|@LLt1[Dk5:$= ` M 00<+00.0,+0 http://ocsp.cudasystems.net:88880 U0 0 `HB0U0U%0++03 `HB
&$OpenSSL Generated Client Certificate0U%՞V=;bzQ0U#0]^§Q\ӎϡ010 UUS10UFlorida10U Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA H^Ōc!5
H0U0karl@denninger.net0
*H
۠A0-j%--$%g2#ޡ1^>{K+uGEv1ş7Af&b&O;.;A5*U)ND2bF|\=]<sˋL!wrw٧>YMÄ3\mWR hSv!_zvl? 3_ xU%\^#O*Gk̍YI_&Fꊛ@&1n } ͬ:{hTP3B.;bU8:Z=^Gw8!k-@xE@i,+'Iᐚ:fhztX7/(hY` O.1}a`%RW^akǂpCAufgDix UTЩ/7}%=jnVZvcF<M=
2^GKH5魉
_O4ެByʈySkw=5@h.0z>
W1000{10 UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA k#Xd\=0
`He E0 *H
1 *H
0 *H
1
190205152220Z0O *H
1B@H>FI-]F%P۾
px_]Ut:jL%ͬI-;_>Q3PsҜ0l *H
1_0]0 `He*0 `He0
*H
0*H
0
*H
@0+0
*H
(0 +7100{10 UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA k#Xd\=0*H
10{10 UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA k#Xd\=0
*H
NKQډnT/ /P!z*mj>SD9%C^A^F2]&`->:QI&dħYY#٩l,M^_O=Zt+9Z2f5'h꘤3U/KK~@8mmjya?`n>οBi4Dh8ŒFwK?f?Oa(W,=mFs? E"U zc'糀EJD>MMs2]ʚ'^,Stz-Szֵ١
amkakpFm!*X?s{'ZyfT[lgA
W=(,YϫQݦ3MDDa۪Jk9<hO>)K
ڭIaOf+?(, LFD±6iKl1Lٍii)Cv%ehv9JV~C.iX*zͱoǬw
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?b50c527c-e7f7-3e64-af3a-e597ec77c021>
