Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 8 May 2019 12:31:54 -0400 (EDT)
From:      Walter Cramer <wfc@mintsol.com>
To:        Paul Mather <paul@gromit.dlib.vt.edu>
Cc:        Michelle Sullivan <michelle@sorbs.net>, freebsd-stable <freebsd-stable@freebsd.org>
Subject:   Re: ZFS...
Message-ID:  <20190508104026.C58567@mulder.mintsol.com>
In-Reply-To: <453BCBAC-A992-4E7D-B2F8-959B5C33510E@gromit.dlib.vt.edu>
References:  <30506b3d-64fb-b327-94ae-d9da522f3a48@sorbs.net> <70fac2fe3f23f85dd442d93ffea368e1@ultra-secure.de> <70C87D93-D1F9-458E-9723-19F9777E6F12@sorbs.net> <CAGMYy3tYqvrKgk2c==WTwrH03uTN1xQifPRNxXccMsRE1spaRA@mail.gmail.com> <5ED8BADE-7B2C-4B73-93BC-70739911C5E3@sorbs.net> <d0118f7e-7cfc-8bf1-308c-823bce088039@denninger.net> <2e4941bf-999a-7f16-f4fe-1a520f2187c0@sorbs.net> <20190430102024.E84286@mulder.mintsol.com> <41FA461B-40AE-4D34-B280-214B5C5868B5@punkt.de> <20190506080804.Y87441@mulder.mintsol.com> <08E46EBF-154F-4670-B411-482DCE6F395D@sorbs.net> <33D7EFC4-5C15-4FE0-970B-E6034EF80BEF@gromit.dlib.vt.edu> <A535026E-F9F6-4BBA-8287-87EFD02CF207@sorbs.net> <26B407D8-3EED-47CA-81F6-A706CF424567@gromit.dlib.vt.edu> <42ba468a-2f87-453c-0c54-32edc98e83b8@sorbs.net> <4A485B46-1C3F-4EE0-8193-ADEB88F322E8@gromit.dlib.vt.edu> <14ed4197-7af7-f049-2834-1ae6aa3b2ae3@sorbs.net> <453BCBAC-A992-4E7D-B2F8-959B5C33510E@gromit.dlib.vt.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 8 May 2019, Paul Mather wrote:

> On May 8, 2019, at 9:59 AM, Michelle Sullivan <michelle@sorbs.net> wrote:
>
>> Paul Mather wrote:
>>>> due to lack of space.  Interestingly have had another drive die in the 
>>>> array - and it doesn't just have one or two sectors down it has a *lot* - 
>>>> which was not noticed by the original machine - I moved the drive to a 
>>>> byte copier which is where it's reporting 100's of sectors damaged... 
>>>> could this be compounded by zfs/mfi driver/hba not picking up errors like 
>>>> it should?
>>> 
>>> 
>>> Did you have regular pool scrubs enabled?  It would have picked up silent 
>>> data corruption like this.  It does for me.
>> Yes, every month (once a month because, (1) the data doesn't change much 
>> (new data is added, old it not touched), and (2) because to complete it 
>> took 2 weeks.)
>
>
> Do you also run sysutils/smartmontools to monitor S.M.A.R.T. attributes? 
> Although imperfect, it can sometimes signal trouble brewing with a drive 
> (e.g., increasing Reallocated_Sector_Ct and Current_Pending_Sector counts) 
> that can lead to proactive remediation before catastrophe strikes.
>
> Unless you have been gathering periodic drive metrics, you have no way of 
> knowing whether these hundreds of bad sectors have happened suddenly or 
> slowly over a period of time.
>

+1

Use `smartctl` from a cron script to do regular (say, weekly) *long* 
self-tests of hard drives, and also log (say, daily) all the SMART 
information from each drive.  Then if a drive fails, you can at least 
check the logs for whether SMART noticed symptoms, and (if so) for other 
drives with symptoms.  Or enhance this with a slightly longer script, 
which watches the logs for symptoms, and alerts you.

(My experience is that SMART's *long* self-test checks the entire disk for 
read errors, without neither downside of `zpool scrub` - it does a fast, 
sequential read of the HD, including free space.  That makes it a nice 
test for failing disk hardware; not a replacement for `zpool scrub`.)

> Cheers,
>
> Paul.
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190508104026.C58567>