From nobody Sun Feb 26 02:56:54 2023
X-Original-To: freebsd-arm@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4PPSw01gW2z3vCMY
	for <freebsd-arm@mlmmj.nyi.freebsd.org>; Sun, 26 Feb 2023 02:57:04 +0000 (UTC)
	(envelope-from fbsd@www.zefox.net)
Received: from www.zefox.net (www.zefox.net [50.1.20.27])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "www.zefox.com", Issuer "www.zefox.com" (not verified))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4PPSvy59hnz3R7l
	for <freebsd-arm@freebsd.org>; Sun, 26 Feb 2023 02:57:02 +0000 (UTC)
	(envelope-from fbsd@www.zefox.net)
Authentication-Results: mx1.freebsd.org;
	dkim=none;
	spf=none (mx1.freebsd.org: domain of fbsd@www.zefox.net has no SPF policy when checking 50.1.20.27) smtp.mailfrom=fbsd@www.zefox.net;
	dmarc=none
Received: from www.zefox.net (localhost [127.0.0.1])
	by www.zefox.net (8.17.1/8.15.2) with ESMTPS id 31Q2usvh012815
	(version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO);
	Sat, 25 Feb 2023 18:56:55 -0800 (PST)
	(envelope-from fbsd@www.zefox.net)
Received: (from fbsd@localhost)
	by www.zefox.net (8.17.1/8.15.2/Submit) id 31Q2usSs012814;
	Sat, 25 Feb 2023 18:56:54 -0800 (PST)
	(envelope-from fbsd)
Date: Sat, 25 Feb 2023 18:56:54 -0800
From: bob prohaska <fbsd@www.zefox.net>
To: Mark Millard <marklmi@yahoo.com>
Cc: freebsd-arm@freebsd.org
Subject: Re: fsck segfaults on rpi3 running 13-stable (and on 14-CURRENT
 analyzing the same file system that resulted from the 13-STABLE crash)
Message-ID: <20230226025654.GA12702@www.zefox.net>
References: <202302192054.31JKsq7w079295@chez.mckusick.com>
 <3DD8EEC2-6135-42A0-A80C-F195CAAC025E@yahoo.com>
 <20230219222328.GA55941@www.zefox.net>
 <2F5B20E9-AFF8-42F6-9E1F-50BBDF4E1B79@yahoo.com>
 <20230220044544.GB57936@www.zefox.net>
 <9CEF4E7A-2F13-454F-A04A-A6C5A80FD4B7@yahoo.com>
List-Id: Porting FreeBSD to ARM processors <freebsd-arm.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-arm
List-Help: <mailto:freebsd-arm+help@freebsd.org>
List-Post: <mailto:freebsd-arm@freebsd.org>
List-Subscribe: <mailto:freebsd-arm+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-arm+unsubscribe@freebsd.org>
Sender: owner-freebsd-arm@freebsd.org
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <9CEF4E7A-2F13-454F-A04A-A6C5A80FD4B7@yahoo.com>
X-Spamd-Result: default: False [-1.10 / 15.00];
	NEURAL_HAM_MEDIUM(-1.00)[-1.000];
	AUTH_NA(1.00)[];
	NEURAL_HAM_LONG(-1.00)[-1.000];
	NEURAL_HAM_SHORT(-1.00)[-1.000];
	MID_RHS_WWW(0.50)[];
	WWW_DOT_DOMAIN(0.50)[];
	MIME_GOOD(-0.10)[text/plain];
	FREEMAIL_TO(0.00)[yahoo.com];
	MLMMJ_DEST(0.00)[freebsd-arm@freebsd.org];
	ASN(0.00)[asn:7065, ipnet:50.1.16.0/20, country:US];
	FROM_EQ_ENVFROM(0.00)[];
	R_DKIM_NA(0.00)[];
	RCPT_COUNT_TWO(0.00)[2];
	MIME_TRACE(0.00)[0:+];
	RCVD_COUNT_THREE(0.00)[3];
	RCVD_TLS_LAST(0.00)[];
	FROM_HAS_DN(0.00)[];
	ARC_NA(0.00)[];
	TO_MATCH_ENVRCPT_SOME(0.00)[];
	TO_DN_SOME(0.00)[];
	R_SPF_NA(0.00)[no SPF record];
	DMARC_NA(0.00)[zefox.net];
	MID_RHS_MATCH_FROM(0.00)[]
X-Rspamd-Queue-Id: 4PPSvy59hnz3R7l
X-Spamd-Bar: -
X-ThisMailContainsUnwantedMimeParts: N

On Sun, Feb 19, 2023 at 09:50:45PM -0800, Mark Millard wrote:
> On Feb 19, 2023, at 20:45, bob prohaska <fbsd@www.zefox.net> wrote:
> 
> > 
> > To a casual glance, it looks like a hardware error.
> > But, the machine seems to work fine until it's running
> > buildworld, and then crashes during a relatively easy
> > part of buildworld. The initial error message is:
> > 
> > bob@pelorus:/usr/src % (da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 43 29 d6 40 00 00 40 00 
> > (da0:umass-sim0:0:0:0): CAM status: SCSI Status Error
> > (da0:umass-sim0:0:0:0): SCSI status: Check Condition
> > (da0:umass-sim0:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
> > (da0:umass-sim0:0:0:0): Error 5, Unretryable error
> 
> A description of "Media Error" from seagate is:
> 
> Medium Error - Indicates the command terminated with a nonrecovered error condition, probably caused by a flaw in the medium or an error in the recorded data.
> 
> To compare/contrast with other alternatives, see:
> 
> https://www.seagate.com/support/kb/scsi-sense-key-chart-196259en/
> 
> A more extensive list with asc/ascq involved as well is at:
> 
> https://en.wikipedia.org/wiki/Key_Code_Qualifier/
> 
> Allowing more comparison/contrast with other classifications.
> 
> It indicates:
> 
> 3 11 00 Medium Error - unrecovered read error
> 
> (matching the reported text).
> 
> > SCSI errors are not unknown, but they usually succeed on retry.
> > It's not obvious why this is treated as un-retryable. 
> 
> Because that is what the "3 11 00" combination involved
> means. The drive is reporting that. It is not a FreeBSD
> driver choice of handling.
> 
> (I'm not expert at drive internals, so I take it at face
> value.)
> 
> > Are there any simple tests that might help decide what's wrong?
> > It's likely that re-running buildworld will reproduce the crash.
> 
> See the https://en.wikipedia.org/wiki/Key_Code_Qualifier/
> description material for some background information?
> 
> > I've placed the results of smartctl -a at the end of the notes. 
> > The interpretation isn't self evident, hopefully someone else
> > can lend an eye. I'll try smartctl -t after a good night's sleep. 
> 
> man smartctl reports:
> 
>                  UNC:   UNCorrectable Error in Data
> 
> The 3 examples of:
> 
> After command completion occurred, registers were:
> ER ST SC SN CL CH DH
> -- -- -- -- -- -- --
> 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
> 
> indicate UNC. All 3 list the same LBA value.
> 
> Error 4 occurred at disk power-on lifetime: 11121 hours (463 days + 9 hours)
> Error 3 occurred at disk power-on lifetime: 11098 hours (462 days + 10 hours)
> Error 2 occurred at disk power-on lifetime: 11096 hours (462 days + 8 hours)
> 
> So spread over a little over a day overall, with 2 and 3
> spread over a couple of hours.
> 
> It suggests to me that the drive is no longer usable.
> But I'm no expert.

You were correct. After a few re-installations the
disk failed in an obvious way, reporting 395-odd errors. All the
while, SMART seemed to claim the disk "passed" its self-tests.

I was baffled, since the experiments with dd failed to replicate
the error. Evidently there was more to the failure than met the eye.

Thanks for writing!

bob prohaska