From owner-freebsd-current@freebsd.org  Tue Dec 12 22:19:21 2017
Return-Path: <owner-freebsd-current@freebsd.org>
Delivered-To: freebsd-current@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0AFEDE86E89
 for <freebsd-current@mailman.ysv.freebsd.org>;
 Tue, 12 Dec 2017 22:19:21 +0000 (UTC)
 (envelope-from ohartmann@walstatt.org)
Received: from mout.gmx.net (mout.gmx.net [212.227.15.15])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "mout.gmx.net", Issuer "TeleSec ServerPass DE-2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 8162675CEB;
 Tue, 12 Dec 2017 22:19:19 +0000 (UTC)
 (envelope-from ohartmann@walstatt.org)
Received: from thor.intern.walstatt.dynvpn.de ([85.182.112.82]) by
 mail.gmx.com (mrgmx002 [212.227.17.190]) with ESMTPSA (Nemesis) id
 0M4nYT-1fEuTX3iPt-00z0Ah; Tue, 12 Dec 2017 23:19:07 +0100
Date: Tue, 12 Dec 2017 23:18:31 +0100
From: "O. Hartmann" <ohartmann@walstatt.org>
To: "Rodney W. Grimes" <freebsd-rwg@pdx.rh.CN85.dnsmgr.net>
Cc: "O. Hartmann" <ohartmann@walstatt.org>, FreeBSD CURRENT
 <freebsd-current@freebsd.org>, Freddie Cash <fjwcash@gmail.com>, Alan
 Somers <asomers@freebsd.org>
Subject: Re: SMART: disk problems on RAIDZ1 pool: (ada6:ahcich6:0:0:0): CAM
 status: ATA Status Error
Message-ID: <20171212231858.294a2cb5@thor.intern.walstatt.dynvpn.de>
In-Reply-To: <201712121852.vBCIqRuZ087701@pdx.rh.CN85.dnsmgr.net>
References: <20171212192220.119ca2d3@thor.intern.walstatt.dynvpn.de>
 <201712121852.vBCIqRuZ087701@pdx.rh.CN85.dnsmgr.net>
Organization: WALSTATT
User-Agent: OutScare 3.1415926
X-Operating-System: ImNotAnOperatingSystem 3.141592527
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha512;
 boundary="Sig_/QYT9Wra0yqX6hyEFFfRyrQZ"; protocol="application/pgp-signature"
X-Provags-ID: V03:K0:myZB9N9JH20ktU2ZApBhXaPRneJSvtUIg7Yd9z104q/XOonLeMW
 7H1eN+d9xFexlwvihp0EHldiD6VcT/abgMkDnr9NENB2Jg/7NzI8mPYVC/wpq0rApHZ7XeG
 zQ9eAf7+50yjUbMoh8KlGjCSxyHj8RQOEj2Gm9TbcduEQGhjFWlQ6B+db4SyxJqGZkDCD0/
 3tcFk6A3oBF9VM5jfP69g==
X-UI-Out-Filterresults: notjunk:1;V01:K0:MH2zaT0Ottg=:Enqb5Wvr8CV8Sx2X5vDXRI
 YpipOXa/gMqCo9tzQeC0RZH6BJOMYV8Dnri9VzrGnXnnk8DEziS0jp+u/9mfyXhkmSo8Qvth2
 oZMJth04QZxjbAFe5T3tIJBVsaC+wLhIT6HBF6P1yytSC2r/rs7KXiBJEPKpt71qv29/4NaJP
 CHn8y00PIOhM86awgSIwUwdjKpUQw9OcqiVUmvnezME8VPeKOyMlO6xOi3HXgjk0692T8NWvU
 7B/a4wZb7tk/boolVlu/tY6+aConXomITz6Vslbdbfdzu8zgs6n8FGCeEUH1pSRVdPtk+LX7A
 dFn0GH8ohiOMaO+5vwKJ2IFXwsJoi5f4nheEjzYqLRUeHWLvUAHvB83MovK8Pkq+K2qWQP0+U
 EH1KNSfCPH5A4znSFeIvLhcwd69dzmsTOLt5AGvsiX0UcvQ7C1YONhyhgp5LE5lRxCtqyfekx
 jn2YqIun+pRFG2oBXQ5qMnF6EglWyTAsxEeRsEyG5uQlmFvJvaem1MWLCqmFxKIv2/MTdT24R
 maoG1rAsUlSGPxuJtaUsdC0Q7WYj4aJTtcjcT+81NfGXAZDHuAjAPud3PIBRyMuIn8JQ9zqtF
 R9E7XmnimZBQQMnnSbfZIBiWTTXnb1xNMI0ouoxejEfapyGUlIk0xlPcpDJu3mzF73C9CBHZd
 CKABTkzsMCrbc1zT2ImiAMGAPcH8oczmigy5Y74gsPBcE/6rqLEaUj7AXZlySMWjq5xkcfx50
 sIjQXK5hs22ZI0COa8gv7u+EOCCzvu/VZ9FSCom+NSbrDn8TwlsyLzuqGrUXv6Kbeo+D2Lhal
 X3+YfWHa6DSUevXJVvPaDq07sm/0w==
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current/>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Dec 2017 22:19:21 -0000

--Sig_/QYT9Wra0yqX6hyEFFfRyrQZ
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Am Tue, 12 Dec 2017 10:52:27 -0800 (PST)
"Rodney W. Grimes" <freebsd-rwg@pdx.rh.CN85.dnsmgr.net> schrieb:


Thank you for answering that fast!

> > Hello,
> >=20
> > running CURRENT (recent r326769), I realised that smartmond sends out s=
ome console
> > messages when booting the box:
> >=20
> > [...]
> > Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /dev/ada6, 1 Currentl=
y unreadable
> > (pending) sectors Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /de=
v/ada6, 1
> > Offline uncorrectable sectors
> > [...]
> >=20
> > Checking the drive's SMART log with smartctl (it is one of four 3TB dis=
k drives), I
> > gather these informations:
> >=20
> > [... smartctl -x /dev/ada6 ...]
> > Error 42 [17] occurred at disk power-on lifetime: 25335 hours (1055 day=
s + 15 hours)
> >   When the command that caused the error occurred, the device was activ=
e or idle.
> >=20
> >   After command completion occurred, registers were:
> >   ER -- ST COUNT  LBA_48  LH LM LL DV DC
> >   -- -- -- =3D=3D -- =3D=3D =3D=3D =3D=3D -- -- -- -- --
> >   40 -- 51 00 00 00 00 c2 7a 72 98 40 00  Error: UNC at LBA =3D 0xc27a7=
298 =3D 3262804632
> >=20
> >   Commands leading to the command that caused the error were:
> >   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feat=
ure_Name
> >   -- =3D=3D -- =3D=3D -- =3D=3D =3D=3D =3D=3D -- -- -- -- --  ---------=
------  --------------------
> >   60 00 b0 00 88 00 00 c2 7a 73 20 40 08     23:38:12.195  READ FPDMA Q=
UEUED
> >   60 00 b0 00 80 00 00 c2 7a 72 70 40 08     23:38:12.195  READ FPDMA Q=
UEUED
> >   2f 00 00 00 01 00 00 00 00 00 10 40 08     23:38:12.195  READ LOG EXT
> >   60 00 b0 00 70 00 00 c2 7a 73 20 40 08     23:38:09.343  READ FPDMA Q=
UEUED
> >   60 00 b0 00 68 00 00 c2 7a 72 70 40 08     23:38:09.343  READ FPDMA Q=
UEUED
> > [...]
> >=20
> > and
> >=20
> > [...]
> > SMART Attributes Data Structure revision number: 16
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
> >   1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    64
> >   3 Spin_Up_Time            POS--K   178   170   021    -    6075
> >   4 Start_Stop_Count        -O--CK   098   098   000    -    2406
> >   5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
> >   7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
> >   9 Power_On_Hours          -O--CK   066   066   000    -    25339
> >  10 Spin_Retry_Count        -O--CK   100   100   000    -    0
> >  11 Calibration_Retry_Count -O--CK   100   100   000    -    0
> >  12 Power_Cycle_Count       -O--CK   098   098   000    -    2404
> > 192 Power-Off_Retract_Count -O--CK   200   200   000    -    154
> > 193 Load_Cycle_Count        -O--CK   001   001   000    -    2055746
> > 194 Temperature_Celsius     -O---K   122   109   000    -    28
> > 196 Reallocated_Event_Count -O--CK   200   200   000    -    0
> > 197 Current_Pending_Sector  -O--CK   200   200   000    -    1
> > 198 Offline_Uncorrectable   ----CK   200   200   000    -    1
> > 199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
> > 200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    5
> >                             ||||||_ K auto-keep
> >                             |||||__ C event count
> >                             ||||___ R error rate
> >                             |||____ S speed/performance
> >                             ||_____ O updated online
> >                             |______ P prefailure warning
> >=20
> > [...] =20
>=20
> The data up to this point informs us that you have 1 bad sector
> on a 3TB drive, that is actually an expected event given the data
> error rate on this stuff is such that your gona have these now
> and again.
>=20
> Given you have 1 single event I would not suspect that this drive
> is dying, but it would be prudent to prepare for that possibility.

Hello.

Well, I copied simply "one single event" that has been logged so far.

As you (and I) can see, it is error #42. After I posted here, a reboot has =
taken place
because the "repair" process on the Pool suddenly increased time and now I'=
m with error
#47, but interestingly, it is a new block that is damaged, but the SMART at=
tribute fields
show this for now:

[...]
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    69
  3 Spin_Up_Time            POS--K   178   170   021    -    6075
  4 Start_Stop_Count        -O--CK   098   098   000    -    2406
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   066   066   000    -    25343
 10 Spin_Retry_Count        -O--CK   100   100   000    -    0
 11 Calibration_Retry_Count -O--CK   100   100   000    -    0
 12 Power_Cycle_Count       -O--CK   098   098   000    -    2404
192 Power-Off_Retract_Count -O--CK   200   200   000    -    154
193 Load_Cycle_Count        -O--CK   001   001   000    -    2055746
194 Temperature_Celsius     -O---K   122   109   000    -    28
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   200   200   000    -    1
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    5
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning
[...]


197 Current_Pending_Sector decreased to zero so far, but with every reboot,=
 the error
count seems to increase:


[...]
Error 47 [22] occurred at disk power-on lifetime: 25343 hours (1055 days + =
23 hours)
  When the command that caused the error occurred, the device was active or=
 idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- =3D=3D -- =3D=3D =3D=3D =3D=3D -- -- -- -- --
  40 -- 51 00 00 00 00 c2 19 d9 88 40 00  Error: UNC at LBA =3D 0xc219d988 =
=3D 3256473992

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_=
Name
  -- =3D=3D -- =3D=3D -- =3D=3D =3D=3D =3D=3D -- -- -- -- --  -------------=
--  --------------------
  60 00 b0 00 d0 00 00 c2 19 da 28 40 08  1d+07:12:34.336  READ FPDMA QUEUED
  60 00 b0 00 c8 00 00 c2 19 d9 78 40 08  1d+07:12:34.336  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08  1d+07:12:34.336  READ LOG EXT
  60 00 b0 00 b8 00 00 c2 19 da 28 40 08  1d+07:12:31.484  READ FPDMA QUEUED
  60 00 b0 00 b0 00 00 c2 19 d9 78 40 08  1d+07:12:31.483  READ FPDMA QUEUED


I think this is watching a HDD dying, isn't it?

I'd say, a broken cabling would produce different errors, wouldn't it?

The Western Digital Green series HDD is a useful fellow when the HDD is use=
d as a single
drive. I think there might be an issue with paring 4 HDDs, 3 of them "GREEN=
", in a RAIDZ
and physically sitting next to each other. Maybe it is time to replace them=
 one by one ...

>=20
>=20
> >=20
> > The ZFS pool is RAIDZ1, comprised of 3 WD Green 3TB HDD and one WD RED =
3 TB HDD. The
> > failure occured is on one of the WD Green 3 TB HDD. =20
> Ok, so the data is redundantly protected.  This helps a lot.
>=20
> > The pool is marked as "resilvered" - I do scrubbing on a regular basis =
and the
> > "resilvering" message has now aapeared the second time in row. Searchin=
g the net
> > recommend on SMART attribute 197 errors, in my case it is one, and in c=
ombination with
> > the problems occured that I should replace the disk. =20
>=20
> It is probably putting the RAIDZ in that state as the scrub is finding a =
block
> it can not read.
>=20
> >=20
> > Well, here comes the problem. The box is comprised from "electronical w=
aste" made by
> > ASRock - it is a Socket 1150/IvyBridge board, which has its last Firmwa=
re/BIOS update
> > got in 2013 and since then UEFI booting FreeBSD from a HDD isn't possib=
le (just to
> > indicate that I'm aware of having issues with crap, but that is some ot=
her issue
> > right now). The board's SATA connectors are all populated.
> >=20
> > So: Due to the lack of adequate backup space I can only selectively bac=
kup portions,
> > most of the space is occupied by scientific modelling data, which I had=
 worked on. So
> > backup exists! In one way or the other. My concern is how to replace th=
e faulty HDD!
> > Most HowTo's indicate a replacement disk being prepared and then "repla=
ced" via ZFS's
> > replace command. This isn't applicable here.
> >=20
> > Question: is it possible to simply pull the faulty disk (implies I know=
 exactly which
> > one to pull!) and then prepare and add the replacement HDD and let the =
system do its
> > job resilvering the pool? =20
>=20
> That may work, but I think I have a simpler solution.
>=20
> >=20
> > Next question is: I'm about to replace the 3 TB HDD with a more recent =
and modern 4 TB
> > HDD (WD RED 4TB). I'm aware of the fact that I can only use 3 TB as the=
 other disks
> > are 3 TB, but I'd like to know whether FreeBSD's ZFS is capable of hand=
ling it?  =20
>=20
> Someone else?
>=20
> >=20
> > This is the first time I have issues with ZFS and a faulty drive, so if=
 some of my
> > questions sound naive, please forgive me. =20
>=20
> One thing to try is to see if we can get the drive to fix itself, first o=
rder
> of business is can you take this server out of service?  If so I would
> simply try to do a
> repeat 100 dd if=3D/dev/whicheverhdisbad of=3D/dev/null conv=3Dnoerror, s=
ync iseek=3D3262804632
>=20
> That is trying to read that block 100 times, if it successful even 1 time
> smart should remap the block and you are all done.

Given the fact, that this errorneous block is like a moving target, it this=
 solution
still the favorite one? I'll try, but I already have the replacement 4 TB H=
DD at hand.

>=20
> If that fails we can try to zero the block, there is a risk here, but rai=
dz should just
> handle this as a data corruption of a block.  This could possibly lead to=
 data loss,
> so USE AT YOUR OWN RISK ASSESMENT.
> dd if=3D/dev/zero of=3D/dev/whateverdrivehasissues bs=3D512 count=3D1 ose=
ek=3D3262804632

I would then be  oseek=3D3256473992, too.

>=20
> That should forceable overwrite the bad block with 0's, the smart firmware
> well see this in the pending list, write the data, read it back, if succe=
ssful
> remove it from the pending list, if failed reallocate the block and write
> the 0's to the reallocation and add 1 to the remapped block count.
>=20
> You might google for "how to fix a pending reallocation"
>=20
> > Thanks in advance,
> > Oliver
> > --=20
> > O. Hartmann =20
>=20

Kind regards,

Oliver


--=20
O. Hartmann

Ich widerspreche der Nutzung oder =C3=9Cbermittlung meiner Daten f=C3=BCr
Werbezwecke oder f=C3=BCr die Markt- oder Meinungsforschung (=C2=A7 28 Abs.=
 4 BDSG).

--Sig_/QYT9Wra0yqX6hyEFFfRyrQZ
Content-Type: application/pgp-signature
Content-Description: OpenPGP digital signature

-----BEGIN PGP SIGNATURE-----

iLUEARMKAB0WIQQZVZMzAtwC2T/86TrS528fyFhYlAUCWjBV0gAKCRDS528fyFhY
lOZDAf0fajjJMeGcTKvmMoTlc8AoxCH1Sh8FOWqMdhgMplaEctPUYNd0KeXfoXX3
Pah6/Rs1N1Ypzj7BM/wybCGSbrf4Af9asahtCJ66Qnr+HydE4hby6aJcLLehGpqc
URxnHDu0kanXce95f2+/1q7smr0Mdf28dVj0qccN5vIkGEn4Nkh3
=Irs3
-----END PGP SIGNATURE-----

--Sig_/QYT9Wra0yqX6hyEFFfRyrQZ--