Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 5 Feb 2019 10:26:24 -0600
From:      Karl Denninger <karl@denninger.net>
To:        freebsd-stable@freebsd.org
Subject:   Re: 9211 (LSI/SAS) issues on 11.2-STABLE
Message-ID:  <9ea70420-0c06-ad9d-e8b7-f9d92fed20d8@denninger.net>
In-Reply-To: <b50c527c-e7f7-3e64-af3a-e597ec77c021@denninger.net>
References:  <7bb25f55-fa77-f67e-11f3-b2240b01e25a@denninger.net> <b50c527c-e7f7-3e64-af3a-e597ec77c021@denninger.net>

next in thread | previous in thread | raw e-mail | index | archive | help

[-- Attachment #1 --]
On 2/5/2019 09:22, Karl Denninger wrote:
> On 2/2/2019 12:02, Karl Denninger wrote:
>> I recently started having some really oddball things  happening under
>> stress.  This coincided with the machine being updated to 11.2-STABLE
>> (FreeBSD 11.2-STABLE #1 r342918:) from 11.1.
>>
>> Specifically, I get "errors" like this:
>>
>>         (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
>> length 131072 SMID 269 Aborting command 0xfffffe0001179110
>> mps0: Sending reset from mpssas_send_abort for target ID 37
>>         (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
>> length 131072 SMID 924 terminated ioc 804b loginfo 31140000 scsi 0 state
>> c xfer 0
>>         (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>> length 131072 SMID 161 terminated ioc 804b loginfo 31140000 scsi 0 state
>> c xfer 0
>> mps0: Unfreezing devq for target ID 37
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
>> (da12:mps0:0:37:0): Retrying command
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: Command timeout
>> (da12:mps0:0:37:0): Retrying command
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
>> (da12:mps0:0:37:0): Retrying command
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: SCSI Status Error
>> (da12:mps0:0:37:0): SCSI status: Check Condition
>> (da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
>> reset, or bus device reset occurred)
>> (da12:mps0:0:37:0): Retrying command (per sense data)
>>
>> The "Unit Attention" implies the drive reset.  It only occurs on certain
>> drives under very heavy load (e.g. a scrub.)  I've managed to provoke it
>> on two different brands of disk across multiple firmware and capacities,
>> however, which tends to point away from a drive firmware problem.
>>
>> A look at the pool data shows /no /errors (e.g. no checksum problems,
>> etc) and a look at the disk itself (using smartctl) shows no problems
>> either -- whatever is going on here the adapter is recovering from it
>> without any data corruption or loss registered on *either end*!
>>
>> The configuration is an older SuperMicro Xeon board (X8DTL-IF) and shows:
>>
>> mps0: <Avago Technologies (LSI) SAS2008> port 0xc000-0xc0ff mem
>> 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3
>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
>> mps0: IOCCapabilities:
>> 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
> After considerable additional work this looks increasingly like either a
> missed interrupt or a command is getting lost between the host adapter
> and the expander.
>
> I'm going to turn the driver debug level up and see if I can capture
> more information..... whatever is behind this, however, it is
> almost-certainly related to something that changed between 11.1 and
> 11.2, as I never saw these on the 11.1-STABLE build.
>
> --
> Karl Denninger
> karl@denninger.net <mailto:karl@denninger.net>
> /The Market Ticker/
> /[S/MIME encrypted email preferred]/
Pretty decent trace here -- any ideas?

mps0: timedout cm 0xfffffe00011b5020 allocated tm 0xfffffe00011812a0
        (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
length 131072 SMID 634 Aborting command 0xfffffe00011b5020
mps0: Sending reset from mpssas_send_abort for target ID 37
mps0: queued timedout cm 0xfffffe00011c2760 for processing by tm
0xfffffe00011812a0
mps0: queued timedout cm 0xfffffe00011a74f0 for processing by tm
0xfffffe00011812a0
mps0: queued timedout cm 0xfffffe00011cfd50 for processing by tm
0xfffffe00011812a0
mps0: EventReply        :
        EventDataLength: 2
        AckRequired: 0
        Event: SasDiscovery (0x16)
        EventContext: 0x0
        Flags: 1<InProgress>
        ReasonCode: Discovery Started
        PhysicalPort: 0
        DiscoveryStatus: 0
mps0: (0)->(mpssas_fw_work) Working on  Event: [16]
mps0: (1)->(mpssas_fw_work) Event Free: [16]
        (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00 01 00 00
length 131072 SMID 961 completed timedout cm 0xfffffe00011cfd50 ccb
0xfffff8019458e000 during recovery ioc 804b scsi 0 state c
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00 01 00 00 length
131072 SMID 961 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
        (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
length 131072 SMID 634 completed timedout cm
0xfffffe00011b5(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00
01 00 00
020 ccb 0xfffff80056fb5000 during recovery ioc 8048 scsi 0 state
c(da11:mps0:0:37:0): CAM status: CCB request completed with an error
(da11:mps0:0:37:0): Retrying command
        (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00
length 131072 SMID 798 completed timedout cm 0xfffffe00011c2760 ccb
0xfffff80054e86000 during recovery ioc 804b scsi 0 state c
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00 length
131072 SMID 798 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
        (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00
length 131072 SMID 467 completed timedout cm 0xfffffe00011a74f0 ccb
0xfffff8011efe3000 during recovery ioc 804b scsi 0 state c
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00 length
131072 SMID 467 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
        (xpt0:mps0:0:37:0): SMID 2 abort TaskMID 634 status 0x0 code 0x0
count 4
        (xpt0:mps0:0:37:0): SMID 2 finished recovery after aborting
TaskMID 634
mps0: Unfreezing devq for target ID 37
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
(da11:mps0:0:37:0): CAM status: Command timeout
(da11:mps0:0:37:0): Retrying command
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00
(da11:mps0:0:37:0): CAM status: CCB request completed with an error
(da11:mps0:0:37:0): Retrying command
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00
(da11:mps0:0:37:0): CAM status: CCB request completed with an error
(da11:mps0:0:37:0): Retrying command
mps0: EventReply        :
        EventDataLength: 4
        AckRequired: 0
        Event: SasTopologyChangeList (0x1c)
        EventContext: 0x0
        EnclosureHandle: 0x2
        ExpanderDevHandle: 0x9
        NumPhys: 25
        NumEntries: 1
        StartPhyNum: 12
        ExpStatus: Responding (0x3)
        PhysicalPort: 0
        PHY[12].AttachedDevHandle: 0x0013
        PHY[12].LinkRate: 6.0Gbps (0xaa)
        PHY[12].PhyStatus: PHYLinkStatusChange
mps0: (1)->(mpssas_fw_work) Working on  Event: [1c]
mps0: (2)->(mpssas_fw_work) Event Free: [1c]
mps0: EventReply        :
        EventDataLength: 2
        AckRequired: 0
        Event: SasDiscovery (0x16)
        EventContext: 0x0
        Flags: 0
        ReasonCode: Discovery Complete
        PhysicalPort: 0
        DiscoveryStatus: 0
mps0: (2)->(mpssas_fw_work) Working on  Event: [16]
mps0: (3)->(mpssas_fw_work) Event Free: [16]
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00
(da11:mps0:0:37:0): CAM status: SCSI Status Error
(da11:mps0:0:37:0): SCSI status: Check Condition
(da11:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
reset, or bus device reset occurred)
(da11:mps0:0:37:0): Retrying command (per sense data)

This is what's running right now that produced that -- note no errors
reported on the two providers (the degraded state is intentional; the
third unit is off-site in a bank vault):

[karl@NewFS /var/log]$ zpool status backup
  pool: backup
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub in progress since Tue Feb  5 06:57:14 2019
        840G scanned at 200M/s, 517G issued at 123M/s, 3.36T total
        0 repaired, 15.04% done, 0 days 06:44:17 to go
config:

        NAME                      STATE     READ WRITE CKSUM
        backup                    DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            gpt/backup61.eli      ONLINE       0     0     0
            gpt/backup62-1.eli    ONLINE       0     0     0
            13282812295755460479  OFFLINE      0     0     0  was
/dev/gpt/backup62-2.eli

errors: No known data errors

But, here's the SMART data on the disk -- it DOES show the reset:

Before....

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 2) ==
0x01  0x008  4              19  ---  Lifetime Power-On Resets
0x01  0x018  6     32002138534  ---  Logical Sectors Written
0x01  0x020  6       155895325  ---  Number of Write Commands
0x01  0x028  6    265457016500  ---  Logical Sectors Read
0x01  0x030  6      1132601490  ---  Number of Read Commands
0x01  0x038  6     61537515300  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4            6114  ---  Spindle Motor Power-on Hours
0x03  0x010  4            6114  ---  Head Flying Hours
0x03  0x018  4           30602  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4          283735  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4              17  ---  Resets Between Cmd Acceptance and
Completio
n
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              38  ---  Current Temperature
0x05  0x010  1              40  N--  Average Short Term Temperature
0x05  0x018  1              32  N--  Average Long Term Temperature
0x05  0x020  1              52  ---  Highest Temperature
0x05  0x028  1              23  ---  Lowest Temperature
0x05  0x030  1              48  N--  Highest Average Short Term Temperature
0x05  0x038  1              25  N--  Lowest Average Short Term Temperature
0x05  0x040  1              44  N--  Highest Average Long Term Temperature
0x05  0x048  1              25  N--  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4               6  ---  Number of Hardware Resets
0x06  0x010  4               0  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

And after....

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 2) ==
0x01  0x008  4              19  ---  Lifetime Power-On Resets
0x01  0x018  6     32004195070  ---  Logical Sectors Written
0x01  0x020  6       155946732  ---  Number of Write Commands
0x01  0x028  6    266340052468  ---  Logical Sectors Read
0x01  0x030  6      1136347463  ---  Number of Read Commands
0x01  0x038  6     61540814700  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4            6114  ---  Spindle Motor Power-on Hours
0x03  0x010  4            6114  ---  Head Flying Hours
0x03  0x018  4           30602  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4          283848  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4              17  ---  Resets Between Cmd Acceptance and
Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              41  ---  Current Temperature
0x05  0x010  1              40  N--  Average Short Term Temperature
0x05  0x018  1              32  N--  Average Long Term Temperature
0x05  0x020  1              52  ---  Highest Temperature
0x05  0x028  1              23  ---  Lowest Temperature
0x05  0x030  1              48  N--  Highest Average Short Term Temperature
0x05  0x038  1              25  N--  Lowest Average Short Term Temperature
0x05  0x040  1              44  N--  Highest Average Long Term Temperature
0x05  0x048  1              25  N--  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4               7  ---  Number of Hardware Resets
0x06  0x010  4               0  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Number of Hardware Resets has incremented.  There are no other errors shown:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  P-S---   137   137   054    -    104
  3 Spin_Up_Time            POS---   132   132   024    -    497
(Average 497)
  4 Start_Stop_Count        -O--C-   099   099   000    -    7074
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         PO-R--   100   100   067    -    0
  8 Seek_Time_Performance   P-S---   128   128   020    -    18
  9 Power_On_Hours          -O--C-   098   098   000    -    17094
 10 Spin_Retry_Count        PO--C-   100   100   060    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    19
192 Power-Off_Retract_Count -O--CK   075   075   000    -    30602
193 Load_Cycle_Count        -O--C-   075   075   000    -    30602
194 Temperature_Celsius     -O----   146   146   000    -    41 (Min/Max
23/52)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

I'd throw possible shade at the backplane or cable /but I have already
swapped both out for spares without any change in behavior./

-- 
Karl Denninger
karl@denninger.net <mailto:karl@denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

[-- Attachment #2 --]
0	*H
010
	`He0	*H

00H^Ōc!5
H0
	*H
010	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA0
170817164217Z
270815164217Z0{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA0"0
	*H
0
h-5B>[;olӴ0~͎O9}9Ye*$g!ukvʶLzN`jL>MD'7U45CB+kY`bd~b*c3Ny-78ju]9HeuέsӬDؽmgwER?&UURj'}9nWD i`XcbGz\gG=u%\Oi13ߝ4
K44pYQr]Ie/r0+eEޝݖ0C15Mݚ@JSZ(zȏNTa(25DD5.l<g[[ZarQQ%Buȴ~~`IohRbʳڟu2MS8EdFUClCMaѳ!}ș+2k/bųE,n当ꖛ\(8WV8	d]b	yXw	܊:I39
00U]^§Q\ӎ0U#0T039N0b010	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA	@Ui0U00U0
	*H
:P U!>vJnio-#ן]WyujǑR̀Q
nƇ!GѦFg\yLxgw=OPycehf[}ܷ['4ڝ\[p6\o.B&JF"ZC{;*o*mcCcLY߾`
t*S!񫶭(`]DHP5A~/NPp6=mhk밣'doA$86hm5ӚS@jެEgl
)0JG`%k35PaC?σ
׳HEt}!P㏏%*BxbQwaKG$6h¦Mve;[o-Iی&
I,Tcߎ#t wPA@l0P+KXBպT	zGv;NcI3&JĬUPNa?/%W6G۟N000k#Xd\=0
	*H
0{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA0
170817212120Z
220816212120Z0W10	UUS10UFlorida10U
Cuda Systems LLC10Ukarl@denninger.net0"0
	*H
0
T[I-ΆϏdn;Å@שy.us~_ZG%<MYd\gvfnsa1'6Egyjs"C [{~_KPn+<*pv#Q+H/7[-vqDV^U>f%GX)H.|l`M(Cr>е͇6#odc"YljҦln8@5SA0&ۖ"OGj?UDWZ5	dDB7k-)9Izs-JAv
J6L$Ն1SmY.Lqw*SH;EF'DĦH]MOgQQ|Mٙג2Z9y@y]}6ٽeY9Y2xˆ$T=eCǺǵbn֛{j|@LLt1[Dk5:$=	`	M00<+00.0,+0 http://ocsp.cudasystems.net:88880	U00	`HB0U0U%0++03	`HB
&$OpenSSL Generated Client Certificate0U%՞V=؁;bzQ0U#0]^§Q\ӎϡ010	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CAH^Ōc!5
H0U0karl@denninger.net0
	*H
۠A0-j%--$%g2#ޡ1^>{K+uGEv1ş7Af&b&O;.;A5*U)ND2bF|\=]<sˋL!wrw٧>YMÄ3\mWR hSv!_zvl? 3_ xU%\^#O*Gk̍YI_&Fꊛ@&1n”} ͬ:{hTP3B.;bU8:Z=^Gw8!k-@xE@i,+'Iᐚ:fhztX7/(hY` O.1}a`%RW^akǂpCAufgDixUTЩ/7}%=jnVZvcF<M=
2^GKH5魉
_O4ެByʈySkw=5@h.0z>
W1000{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CAk#Xd\=0
	`HeE0	*H
	1	*H
0	*H
	1
190205162624Z0O	*H
	1B@uNiP{	 aFC SJŮH
;]t9ErF,%AQ]0l	*H
	1_0]0	`He*0	`He0
*H
0*H
0
*H
@0+0
*H
(0	+7100{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CAk#Xd\=0*H
	10{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CAk#Xd\=0
	*H
&@k{D:f
0(ސԉ£s쳧PSKC-lW<qGf$4Ȝ2m^ӜC]}8!4v:^%.0&/1˗-Εd8%5]KIXs%
mA%S*e٨M+:ELcϦ=>^Uc~-u8.Kƒsh8#2AX[XAAkR֎{0ӒP1
"Bf"vϙRP	-CpWXbV-M	s*Z>]=M?ɐ	((84sqrPPb8T0#5yƃb+()H/S^}\j4J!\UumSQy8n1WFv!AW"dm]R*Ef.<DXO;o
ӇŞ-q}GqVwE
<@Aߺ2j8[p9s'f

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9ea70420-0c06-ad9d-e8b7-f9d92fed20d8>