Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 04 Apr 2025 18:59:35 +0000
From:      "Dave Cottlehuber" <dch@skunkwerks.at>
To:        "Andrea Venturoli" <ml@netfence.it>
Cc:        freebsd-questions <freebsd-questions@freebsd.org>
Subject:   Re: Sudden zpool checksums errors
Message-ID:  <3ddfecf7-2cb3-472c-bfce-93356e57b898@app.fastmail.com>
In-Reply-To: <6aeb488d-b3c3-4393-80ca-0b89c1ebc446@netfence.it>

index | next in thread | previous in thread | raw e-mail

On Fri, 4 Apr 2025, at 15:42, Andrea Venturoli wrote:
> Hello.
> I'm finding it hard to believe that 7 disks out of 12 are failing or 
> just happened to misbehave all on the same day.
> BTW, SMART says they are OK.

Not saying its not zfs, but its probably not zfs.... fingers crossed!

> I'm reluctant to blame RAM (since it's ECC) and power supply (as it's 
> redundant 2x800W).

If its memory, and your mainboard supports it, you'll see failures in dmesg,
MCA ... some good examples:

https://lists.freebsd.org/pipermail/freebsd-hackers/2015-January/046878.html
https://forums.freebsd.org/threads/mca-errors.88909/
https://forums.freebsd.org/threads/solved-weird-mca-errors.94800/

> Disks are 16TB TOSHIBA MG09ACA1 connected to a MegaRAID SAS-3 3108 (of 
> course not operating as RAID and with mrsas driver).

Look for SCSI or CAM errors in your logs too, disconnects.
 
I have seen storms of checksum errors in at least these situations:

- faulty or failing storage / scsi controller
- insufficient power (or failing power supplies) under load
- overclocking
- overheating on mainboard, or controller, or drives
- actually really bad ECC memory
- drive cables that have worked loose over time
- over 50 disks failing within 2 days in a 200+ disk array
- all disks failing within 20 days of deployment in 24 disk chassis

Sometimes, vendors produce batches of Bad Disks - firmware bugs, physical
defects, unexpected dust inside the sealed platters. Failures are far more
correlated than you'd want to believe. External vibrations can cause
problems.

A slow process of upgrading firmware & checking each component, resetting
all cables, is the best way to deal with this.

A+
Dave


help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3ddfecf7-2cb3-472c-bfce-93356e57b898>