From nobody Sun Feb 26 02:56:54 2023 X-Original-To: freebsd-arm@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4PPSw01gW2z3vCMY for ; Sun, 26 Feb 2023 02:57:04 +0000 (UTC) (envelope-from fbsd@www.zefox.net) Received: from www.zefox.net (www.zefox.net [50.1.20.27]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "www.zefox.com", Issuer "www.zefox.com" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4PPSvy59hnz3R7l for ; Sun, 26 Feb 2023 02:57:02 +0000 (UTC) (envelope-from fbsd@www.zefox.net) Authentication-Results: mx1.freebsd.org; dkim=none; spf=none (mx1.freebsd.org: domain of fbsd@www.zefox.net has no SPF policy when checking 50.1.20.27) smtp.mailfrom=fbsd@www.zefox.net; dmarc=none Received: from www.zefox.net (localhost [127.0.0.1]) by www.zefox.net (8.17.1/8.15.2) with ESMTPS id 31Q2usvh012815 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO); Sat, 25 Feb 2023 18:56:55 -0800 (PST) (envelope-from fbsd@www.zefox.net) Received: (from fbsd@localhost) by www.zefox.net (8.17.1/8.15.2/Submit) id 31Q2usSs012814; Sat, 25 Feb 2023 18:56:54 -0800 (PST) (envelope-from fbsd) Date: Sat, 25 Feb 2023 18:56:54 -0800 From: bob prohaska To: Mark Millard Cc: freebsd-arm@freebsd.org Subject: Re: fsck segfaults on rpi3 running 13-stable (and on 14-CURRENT analyzing the same file system that resulted from the 13-STABLE crash) Message-ID: <20230226025654.GA12702@www.zefox.net> References: <202302192054.31JKsq7w079295@chez.mckusick.com> <3DD8EEC2-6135-42A0-A80C-F195CAAC025E@yahoo.com> <20230219222328.GA55941@www.zefox.net> <2F5B20E9-AFF8-42F6-9E1F-50BBDF4E1B79@yahoo.com> <20230220044544.GB57936@www.zefox.net> <9CEF4E7A-2F13-454F-A04A-A6C5A80FD4B7@yahoo.com> List-Id: Porting FreeBSD to ARM processors List-Archive: https://lists.freebsd.org/archives/freebsd-arm List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-arm@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <9CEF4E7A-2F13-454F-A04A-A6C5A80FD4B7@yahoo.com> X-Spamd-Result: default: False [-1.10 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; AUTH_NA(1.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; MID_RHS_WWW(0.50)[]; WWW_DOT_DOMAIN(0.50)[]; MIME_GOOD(-0.10)[text/plain]; FREEMAIL_TO(0.00)[yahoo.com]; MLMMJ_DEST(0.00)[freebsd-arm@freebsd.org]; ASN(0.00)[asn:7065, ipnet:50.1.16.0/20, country:US]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; MIME_TRACE(0.00)[0:+]; RCVD_COUNT_THREE(0.00)[3]; RCVD_TLS_LAST(0.00)[]; FROM_HAS_DN(0.00)[]; ARC_NA(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_NA(0.00)[no SPF record]; DMARC_NA(0.00)[zefox.net]; MID_RHS_MATCH_FROM(0.00)[] X-Rspamd-Queue-Id: 4PPSvy59hnz3R7l X-Spamd-Bar: - X-ThisMailContainsUnwantedMimeParts: N On Sun, Feb 19, 2023 at 09:50:45PM -0800, Mark Millard wrote: > On Feb 19, 2023, at 20:45, bob prohaska wrote: > > > > > To a casual glance, it looks like a hardware error. > > But, the machine seems to work fine until it's running > > buildworld, and then crashes during a relatively easy > > part of buildworld. The initial error message is: > > > > bob@pelorus:/usr/src % (da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 43 29 d6 40 00 00 40 00 > > (da0:umass-sim0:0:0:0): CAM status: SCSI Status Error > > (da0:umass-sim0:0:0:0): SCSI status: Check Condition > > (da0:umass-sim0:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) > > (da0:umass-sim0:0:0:0): Error 5, Unretryable error > > A description of "Media Error" from seagate is: > > Medium Error - Indicates the command terminated with a nonrecovered error condition, probably caused by a flaw in the medium or an error in the recorded data. > > To compare/contrast with other alternatives, see: > > https://www.seagate.com/support/kb/scsi-sense-key-chart-196259en/ > > A more extensive list with asc/ascq involved as well is at: > > https://en.wikipedia.org/wiki/Key_Code_Qualifier/ > > Allowing more comparison/contrast with other classifications. > > It indicates: > > 3 11 00 Medium Error - unrecovered read error > > (matching the reported text). > > > SCSI errors are not unknown, but they usually succeed on retry. > > It's not obvious why this is treated as un-retryable. > > Because that is what the "3 11 00" combination involved > means. The drive is reporting that. It is not a FreeBSD > driver choice of handling. > > (I'm not expert at drive internals, so I take it at face > value.) > > > Are there any simple tests that might help decide what's wrong? > > It's likely that re-running buildworld will reproduce the crash. > > See the https://en.wikipedia.org/wiki/Key_Code_Qualifier/ > description material for some background information? > > > I've placed the results of smartctl -a at the end of the notes. > > The interpretation isn't self evident, hopefully someone else > > can lend an eye. I'll try smartctl -t after a good night's sleep. > > man smartctl reports: > > UNC: UNCorrectable Error in Data > > The 3 examples of: > > After command completion occurred, registers were: > ER ST SC SN CL CH DH > -- -- -- -- -- -- -- > 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 > > indicate UNC. All 3 list the same LBA value. > > Error 4 occurred at disk power-on lifetime: 11121 hours (463 days + 9 hours) > Error 3 occurred at disk power-on lifetime: 11098 hours (462 days + 10 hours) > Error 2 occurred at disk power-on lifetime: 11096 hours (462 days + 8 hours) > > So spread over a little over a day overall, with 2 and 3 > spread over a couple of hours. > > It suggests to me that the drive is no longer usable. > But I'm no expert. You were correct. After a few re-installations the disk failed in an obvious way, reporting 395-odd errors. All the while, SMART seemed to claim the disk "passed" its self-tests. I was baffled, since the experiments with dd failed to replicate the error. Evidently there was more to the failure than met the eye. Thanks for writing! bob prohaska