From nobody Mon Feb 17 20:52:11 2025 X-Original-To: freebsd-questions@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4YxZbB4JhHz5p8Zd for ; Mon, 17 Feb 2025 20:52:06 +0000 (UTC) (envelope-from freebsd-doc@fjl.co.uk) Received: from bs1.fjl.org.uk (bs1.fjl.org.uk [84.45.41.196]) by mx1.freebsd.org (Postfix) with ESMTP id 4YxZb94Vbnz4JPq for ; Mon, 17 Feb 2025 20:52:05 +0000 (UTC) (envelope-from freebsd-doc@fjl.co.uk) Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of freebsd-doc@fjl.co.uk designates 84.45.41.196 as permitted sender) smtp.mailfrom=freebsd-doc@fjl.co.uk Received: from [192.168.1.109] (host86-168-81-187.range86-168.btcentralplus.com [86.168.81.187]) (authenticated bits=0) by bs1.fjl.org.uk (8.14.4/8.14.4) with ESMTP id 51HKq49r071287 for ; Mon, 17 Feb 2025 20:52:04 GMT (envelope-from freebsd-doc@fjl.co.uk) Message-ID: Date: Mon, 17 Feb 2025 20:52:11 +0000 List-Id: User questions List-Archive: https://lists.freebsd.org/archives/freebsd-questions List-Help: List-Post: List-Subscribe: List-Unsubscribe: X-BeenThere: freebsd-questions@freebsd.org Sender: owner-freebsd-questions@FreeBSD.org MIME-Version: 1.0 User-Agent: Mozilla Thunderbird From: Frank Leonhardt Content-Language: en-GB To: freebsd-questions@freebsd.org Subject: Detecting failing drives - ZFS carries on regardless Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spamd-Result: default: False [1.71 / 15.00]; RBL_SENDERSCORE_REPUT_9(-1.00)[84.45.41.196:from]; NEURAL_SPAM_LONG(1.00)[1.000]; NEURAL_SPAM_MEDIUM(0.98)[0.984]; NEURAL_SPAM_SHORT(0.73)[0.727]; R_SPF_ALLOW(-0.20)[+ip4:84.45.41.196:c]; ONCE_RECEIVED(0.20)[]; RCVD_NO_TLS_LAST(0.10)[]; MIME_GOOD(-0.10)[text/plain]; ARC_NA(0.00)[]; RCVD_COUNT_ONE(0.00)[1]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:25577, ipnet:84.45.0.0/17, country:GB]; FROM_HAS_DN(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; R_DKIM_NA(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; TO_DN_NONE(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[freebsd-questions@freebsd.org]; RCPT_COUNT_ONE(0.00)[1]; DMARC_NA(0.00)[fjl.co.uk]; MLMMJ_DEST(0.00)[freebsd-questions@freebsd.org] X-Rspamd-Queue-Id: 4YxZb94Vbnz4JPq X-Spamd-Bar: + I've been investigating what the current ZFS on 14.2 does with failing drives. It's a bit worrying. ZFS doesn’t "fault" a drive until it's taken offline by the OS. So if you've got a flaky drive you have to wait for FreeBSD to disconnect it, and then ZFS will notice. At least that's how I understand it. I used to test ZFS by pulling drives, but now I have a collection of flaky drives (data centre discards) that are unreliable, and it turns out that ZFS will wait a very long time for a SAS drive to complete an operation. If the operation fails through retries, FreeBSD logs a cam error but ZFS still doesn't fail the drive. You can have a SAS drive rattling and groaning away, but FreeBSD patiently waits for it to complete by relocating the block or multiple retries and ZFS is none the wiser. Or maybe ZFS is relocating the block after the CAM error. Either way, ZFS says the drive is "ONLINE" and carries on using it. Yikes! ************ So my question is this: Is there a way of telling FreeBSD to fail a drive at the first sign of trouble? Or better yet, if it's had more than one operation take more than ten seconds in the last hour? ************ If anyone else is interested in sharing research please get in touch. Incidentally, smartmon doesn't show failing drives unless an operation actually fails. I've found nothing using camcontrol. If you use a stethoscope on the drive (one of my favourite tricks) it's obvious it's not happy but FreeBSD won't offline it until it catches fire. In fact I suspect it would need to explode before it noticed. Thanks, Frank.