Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Apr 2019 13:57:58 -0500
From:      Karl Denninger <karl@denninger.net>
To:        freebsd-stable@freebsd.org
Subject:   Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)
Message-ID:  <2c23c0de-1802-37be-323e-d390037c6a84@denninger.net>
In-Reply-To: <CACpH0MfmPzEO5BO2kFk8-F1hP9TsXEiXbfa1qxcvB8YkvAjWWw@mail.gmail.com>
References:  <f87f32f2-b8c5-75d3-4105-856d9f4752ef@denninger.net> <c96e31ad-6731-332e-5d2d-7be4889716e1@FreeBSD.org> <9a96b1b5-9337-fcae-1a2a-69d7bb24a5b3@denninger.net> <CACpH0MdLNQ_dqH%2Bto=amJbUuWprx3LYrOLO0rQi7eKw-ZcqWJw@mail.gmail.com> <1866e238-e2a1-ef4e-bee5-5a2f14e35b22@denninger.net> <3d2ad225-b223-e9db-cce8-8250571b92c9@FreeBSD.org> <2bc8a172-6168-5ba9-056c-80455eabc82b@denninger.net> <CACpH0MfmPzEO5BO2kFk8-F1hP9TsXEiXbfa1qxcvB8YkvAjWWw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

[-- Attachment #1 --]

On 4/11/2019 13:52, Zaphod Beeblebrox wrote:
> On Wed, Apr 10, 2019 at 10:41 AM Karl Denninger <karl@denninger.net> wrote:
>
>
>> In this specific case the adapter in question is...
>>
>> mps0: <Avago Technologies (LSI) SAS2116> port 0xc000-0xc0ff mem
>> 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3
>> mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
>> mps0: IOCCapabilities:
>> 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
>>
>> Which is indeed a "dumb" HBA (in IT mode), and Zeephod says he connects
>> his drives via dumb on-MoBo direct SATA connections.
>>
> Maybe I'm in good company.  My current setup has 8 of the disks connected
> to:
>
> mps0: <Avago Technologies (LSI) SAS2308> port 0xb000-0xb0ff mem
> 0xfe240000-0xfe24ffff,0xfe200000-0xfe23ffff irq 32 at device 0.0 on pci6
> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
> mps0: IOCCapabilities:
> 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc>
>
> ... just with a cable that breaks out each of the 2 connectors into 4
> SATA-style connectors, and the other 8 disks (plus boot disks and SSD
> cache/log) connected to ports on...
>
> - ahci0: <ASMedia ASM1062 AHCI SATA controller> port
> 0xd050-0xd057,0xd040-0xd043,0xd030-0xd037,0xd020-0xd023,0xd000-0xd01f mem
> 0xfe900000-0xfe9001ff irq 44 at device 0.0 on pci2
> - ahci2: <Marvell 88SE9230 AHCI SATA controller> port
> 0xa050-0xa057,0xa040-0xa043,0xa030-0xa037,0xa020-0xa023,0xa000-0xa01f mem
> 0xfe610000-0xfe6107ff irq 40 at device 0.0 on pci7
> - ahci3: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port
> 0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem
> 0xfea07000-0xfea073ff irq 19 at device 17.0 on pci0
>
> ... each drive connected to a single port.
>
> I can actually reproduce this at will.  Because I have 16 drives, when one
> fails, I need to find it.  I pull the sata cable for a drive, determine if
> it's the drive in question, if not, reconnect, "ONLINE" it and wait for
> resilver to stop... usually only a minute or two.
>
> ... if I do this 4 to 6 odd times to find a drive (I can tell, in general,
> that a drive is part of the SAS controller or the SATA controllers... so
> I'm only looking among 8, ever) ... then I "REPLACE" the problem drive.
> More often than not, the a scrub will find a few problems.  In fact, it
> appears that the most recent scrub is an example:
>
> [1:7:306]dgilbert@vr:~> zpool status
>   pool: vr1
>  state: ONLINE
>   scan: scrub repaired 32K in 47h16m with 0 errors on Mon Apr  1 23:12:03
> 2019
> config:
>
>         NAME            STATE     READ WRITE CKSUM
>         vr1             ONLINE       0     0     0
>           raidz2-0      ONLINE       0     0     0
>             gpt/v1-d0   ONLINE       0     0     0
>             gpt/v1-d1   ONLINE       0     0     0
>             gpt/v1-d2   ONLINE       0     0     0
>             gpt/v1-d3   ONLINE       0     0     0
>             gpt/v1-d4   ONLINE       0     0     0
>             gpt/v1-d5   ONLINE       0     0     0
>             gpt/v1-d6   ONLINE       0     0     0
>             gpt/v1-d7   ONLINE       0     0     0
>           raidz2-2      ONLINE       0     0     0
>             gpt/v1-e0c  ONLINE       0     0     0
>             gpt/v1-e1b  ONLINE       0     0     0
>             gpt/v1-e2b  ONLINE       0     0     0
>             gpt/v1-e3b  ONLINE       0     0     0
>             gpt/v1-e4b  ONLINE       0     0     0
>             gpt/v1-e5a  ONLINE       0     0     0
>             gpt/v1-e6a  ONLINE       0     0     0
>             gpt/v1-e7c  ONLINE       0     0     0
>         logs
>           gpt/vr1log    ONLINE       0     0     0
>         cache
>           gpt/vr1cache  ONLINE       0     0     0
>
> errors: No known data errors
>
> ... it doesn't say it now, but there were 5 CKSUM errors on one of the
> drives that I had trial-removed (and not on the one replaced).
> _______________________________________________

That is EXACTLY what I'm seeing; the "OFFLINE'd" drive is the one that,
after a scrub, comes up with the checksum errors.  It does *not* flag
any errors during the resilver and the drives *not* taken offline do not
(ever) show checksum errors either.

Interestingly enough you have 19.00.00.00 firmware on your card as well
-- which is what was on mine.

I have flashed my card forward to 20.00.07.00 -- we'll see if it still
does it when I do the next swap of the backup set.

-- 
Karl Denninger
karl@denninger.net <mailto:karl@denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

[-- Attachment #2 --]
0	*H
010
	`He0	*H

00H^Ōc!5
H0
	*H
010	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA0
170817164217Z
270815164217Z0{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA0"0
	*H
0
h-5B>[;olӴ0~͎O9}9Ye*$g!ukvʶLzN`jL>MD'7U45CB+kY`bd~b*c3Ny-78ju]9HeuέsӬDؽmgwER?&UURj'}9nWD i`XcbGz\gG=u%\Oi13ߝ4
K44pYQr]Ie/r0+eEޝݖ0C15Mݚ@JSZ(zȏNTa(25DD5.l<g[[ZarQQ%Buȴ~~`IohRbʳڟu2MS8EdFUClCMaѳ!}ș+2k/bųE,n当ꖛ\(8WV8	d]b	yXw	܊:I39
00U]^§Q\ӎ0U#0T039N0b010	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA	@Ui0U00U0
	*H
:P U!>vJnio-#ן]WyujǑR̀Q
nƇ!GѦFg\yLxgw=OPycehf[}ܷ['4ڝ\[p6\o.B&JF"ZC{;*o*mcCcLY߾`
t*S!񫶭(`]DHP5A~/NPp6=mhk밣'doA$86hm5ӚS@jެEgl
)0JG`%k35PaC?σ
׳HEt}!P㏏%*BxbQwaKG$6h¦Mve;[o-Iی&
I,Tcߎ#t wPA@l0P+KXBպT	zGv;NcI3&JĬUPNa?/%W6G۟N000k#Xd\=0
	*H
0{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA0
170817212120Z
220816212120Z0W10	UUS10UFlorida10U
Cuda Systems LLC10Ukarl@denninger.net0"0
	*H
0
T[I-ΆϏdn;Å@שy.us~_ZG%<MYd\gvfnsa1'6Egyjs"C [{~_KPn+<*pv#Q+H/7[-vqDV^U>f%GX)H.|l`M(Cr>е͇6#odc"YljҦln8@5SA0&ۖ"OGj?UDWZ5	dDB7k-)9Izs-JAv
J6L$Ն1SmY.Lqw*SH;EF'DĦH]MOgQQ|Mٙג2Z9y@y]}6ٽeY9Y2xˆ$T=eCǺǵbn֛{j|@LLt1[Dk5:$=	`	M00<+00.0,+0 http://ocsp.cudasystems.net:88880	U00	`HB0U0U%0++03	`HB
&$OpenSSL Generated Client Certificate0U%՞V=؁;bzQ0U#0]^§Q\ӎϡ010	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CAH^Ōc!5
H0U0karl@denninger.net0
	*H
۠A0-j%--$%g2#ޡ1^>{K+uGEv1ş7Af&b&O;.;A5*U)ND2bF|\=]<sˋL!wrw٧>YMÄ3\mWR hSv!_zvl? 3_ xU%\^#O*Gk̍YI_&Fꊛ@&1n”} ͬ:{hTP3B.;bU8:Z=^Gw8!k-@xE@i,+'Iᐚ:fhztX7/(hY` O.1}a`%RW^akǂpCAufgDixUTЩ/7}%=jnVZvcF<M=
2^GKH5魉
_O4ެByʈySkw=5@h.0z>
W1000{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CAk#Xd\=0
	`HeE0	*H
	1	*H
0	*H
	1
190411185759Z0O	*H
	1B@qqY@2P$8##4O/]ڦ]S
]D}EP8FQ0l	*H
	1_0]0	`He*0	`He0
*H
0*H
0
*H
@0+0
*H
(0	+7100{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CAk#Xd\=0*H
	10{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CAk#Xd\=0
	*H
3
 ayjsˇ%L%&tTHN9<n$h" _1@a?:2F	kwlYX@B	mðyy%xQ{tOI)EhfN߯v[DMH3TJ@(]vA2Z|H ks#7tH|T-Ḓ=%)˜`^~z8S"]}0=dxN$,h%}^<0WN\_9
z1-1(PCuֶ9JcctI='m+_s'R"t8@0GFG0ďp}fd&%=Y>DAb.H"A;
.N}rPn	wl`%Jpa0*[SL\i9]hHJ*(WZ.)Xoah+YZ#:

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2c23c0de-1802-37be-323e-d390037c6a84>