Date: Sat, 5 Apr 2025 11:01:15 +0200 From: Andrea Venturoli <ml@netfence.it> To: freebsd-questions <freebsd-questions@freebsd.org> Subject: Re: Sudden zpool checksums errors Message-ID: <032776db-a8a1-4134-a395-a59effbc4c30@netfence.it> In-Reply-To: <3ddfecf7-2cb3-472c-bfce-93356e57b898@app.fastmail.com> References: <6aeb488d-b3c3-4393-80ca-0b89c1ebc446@netfence.it> <3ddfecf7-2cb3-472c-bfce-93356e57b898@app.fastmail.com>
index | next in thread | previous in thread | raw e-mail
On 4/4/25 20:59, Dave Cottlehuber wrote: Thanks to all. I'll answer here collectively. > I have had marginal power supplies, backplane issues or break out cables from the controller manifest errors like that. I would check the power supply first, backplane next, controller 3rd. How would I go about this? How do I check these components? Does IPMI provide something useful? > If its memory, and your mainboard supports it, you'll see failures in dmesg, > MCA ... some good examples: No such things. Either the MB does not support it (is it possible? likely?) or it's not RAM. > Look for SCSI or CAM errors in your logs too, disconnects. No such thing either. > - overclocking No overclocking. > - overheating on mainboard, or controller, or drives I monitor temperature with Nagios and received no alarm. > - actually really bad ECC memory Any way to test? > - drive cables that have worked loose over time Server is quite new (not even an year), but I can eventually check. > External vibrations can cause problems. This is possible, since the building is being expanded and construction of a new block is underway. However, there are four servers which still have hard disks and only this one showed the problem. > A slow process of upgrading firmware I checked on Toshiba website and found no download; I'll eventually check with the supplier. Is there a way I can check the controller firmware version via software? I mean in FreeBSD, without rebooting? dmesg.boot doesn't say. > does ipmitool sel list show anything btw ? (kldload ipmi and pkg install ipmitools if you dont have it already) > # ipmitool sel list > 1 | 05/06/24 | 18:16:23 CEST | Temperature #0xcc | Upper Non-critical going high | Asserted > 2 | 05/06/24 | 21:25:42 CEST | Temperature #0xcc | Upper Critical going high | Asserted > 3 | 05/07/24 | 15:49:00 CEST | Temperature #0xcc | Upper Critical going high | Deasserted > 4 | 05/07/24 | 16:00:43 CEST | Temperature #0xcc | Upper Non-critical going high | Deasserted > 5 | 06/13/24 | 11:54:52 CEST | Drive Slot / Bay #0x77 | Drive Present | Asserted > 6 | 06/13/24 | 11:55:24 CEST | Drive Slot / Bay #0x73 | Drive Present | Asserted > 7 | 06/13/24 | 14:21:04 CEST | Drive Slot / Bay #0x73 | Drive Present | Deasserted > 8 | 06/13/24 | 14:21:04 CEST | Drive Slot / Bay #0x77 | Drive Present | Deasserted Logs are from May/June, but the problem I'm talking about appeared some days ago, so it's not related. bye & Thanks av.help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?032776db-a8a1-4134-a395-a59effbc4c30>
