Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 7 Apr 2025 09:07:11 -0400
From:      mike tancsa <mike@sentex.net>
To:        Andrea Venturoli <ml@netfence.it>, freebsd-questions <freebsd-questions@freebsd.org>
Subject:   Re: Sudden zpool checksums errors
Message-ID:  <4c6b64ec-0e59-4f64-8faf-117c7686a87d@sentex.net>
In-Reply-To: <032776db-a8a1-4134-a395-a59effbc4c30@netfence.it>
References:  <6aeb488d-b3c3-4393-80ca-0b89c1ebc446@netfence.it> <3ddfecf7-2cb3-472c-bfce-93356e57b898@app.fastmail.com> <032776db-a8a1-4134-a395-a59effbc4c30@netfence.it>

index | next in thread | previous in thread | raw e-mail

On 4/5/2025 5:01 AM, Andrea Venturoli wrote:
> On 4/4/25 20:59, Dave Cottlehuber wrote:
>
>
> Thanks to all.
> I'll answer here collectively.
>
>
>
>
>
>> I have had marginal power supplies, backplane issues or break out 
>> cables from the controller manifest errors like that.  I would check 
>> the power supply first, backplane next, controller 3rd.
>
> How would I go about this? How do I check these components?
> Does IPMI provide something useful?
>
ipmitool sensors. The ipmitool sel list  will tell you actual errors 
logged.  What does the smartctl -a /dev/da# show for the temperatures of 
the hard drives ?  Does smartctl -x show any interesting log entries for 
the drives that threw errors vs the ones that did not ?

>> - actually really bad ECC memory
>
> Any way to test?
>
memtest will help a bit.  But if its ECC errors typically do get logged 
by the BMC and the ipmitool sel list will typically log those.

>
>
>> does ipmitool sel list show anything btw ? (kldload ipmi and pkg 
>> install ipmitools if you dont have it already) 
>
>> # ipmitool sel list
>>    1 | 05/06/24 | 18:16:23 CEST | Temperature #0xcc | Upper 
>> Non-critical going high | Asserted
>>    2 | 05/06/24 | 21:25:42 CEST | Temperature #0xcc | Upper Critical 
>> going high | Asserted
>>    3 | 05/07/24 | 15:49:00 CEST | Temperature #0xcc | Upper Critical 
>> going high | Deasserted
>>    4 | 05/07/24 | 16:00:43 CEST | Temperature #0xcc | Upper 
>> Non-critical going high | Deasserted
>>    5 | 06/13/24 | 11:54:52 CEST | Drive Slot / Bay #0x77 | Drive 
>> Present | Asserted
>>    6 | 06/13/24 | 11:55:24 CEST | Drive Slot / Bay #0x73 | Drive 
>> Present | Asserted
>>    7 | 06/13/24 | 14:21:04 CEST | Drive Slot / Bay #0x73 | Drive 
>> Present | Deasserted
>>    8 | 06/13/24 | 14:21:04 CEST | Drive Slot / Bay #0x77 | Drive 
>> Present | Deasserted
>
>


help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4c6b64ec-0e59-4f64-8faf-117c7686a87d>