Date: Sun, 19 Feb 2023 21:50:45 -0800 From: Mark Millard <marklmi@yahoo.com> To: bob prohaska <fbsd@www.zefox.net> Cc: freebsd-arm@freebsd.org Subject: Re: fsck segfaults on rpi3 running 13-stable (and on 14-CURRENT analyzing the same file system that resulted from the 13-STABLE crash) Message-ID: <9CEF4E7A-2F13-454F-A04A-A6C5A80FD4B7@yahoo.com> In-Reply-To: <20230220044544.GB57936@www.zefox.net> References: <202302192054.31JKsq7w079295@chez.mckusick.com> <3DD8EEC2-6135-42A0-A80C-F195CAAC025E@yahoo.com> <20230219222328.GA55941@www.zefox.net> <2F5B20E9-AFF8-42F6-9E1F-50BBDF4E1B79@yahoo.com> <20230220044544.GB57936@www.zefox.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On Feb 19, 2023, at 20:45, bob prohaska <fbsd@www.zefox.net> wrote: > On Sun, Feb 19, 2023 at 02:35:15PM -0800, Mark Millard wrote: >>=20 >> Kirk likely monitors the freebsd-fs list. >=20 > I didn't notice there was such a list 8-\ >=20 >> Kirk likely does not monitor the freebsd-arm list. >> None of us thought to switch to freebsd-fs at the >> time. The only part of your context that ended up >> to be arm specific was original buildworld crash. >> You definitely started in an appropriate place >> (freebsd-arm). After the crash, the rest was more >> general relative to platforms and more specific >> relative to file system handling (UFS support). >>=20 >> I do not see any reason for any of this exchange >> to go to any lists, given the current status. >=20 > Alas, the story's not over yet 8-( =20 >=20 > After getting the disk fsck'd and booting once more, > an attempt to buildworld using a fresh /usr/src > and empty /usr/obj crashed again, I'm confused. The original crash was reported to be on a RPi2B using a armv7 kernel, or so I thought. (The RPi3B was for later fsck_ffs activity for the media's UFS.) This new material indicates a RPi3B arm64 (aarch64) context for this buildworld failure. Is it the same media as for the prior buildworld failure? > in I think the > same way. This time some notes have been collected > at > http://www.zefox.net/~fbsd/rpi3/scsi_status_error/readme >=20 > To a casual glance, it looks like a hardware error. > But, the machine seems to work fine until it's running > buildworld, and then crashes during a relatively easy > part of buildworld. The initial error message is: >=20 > bob@pelorus:/usr/src % (da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 43 = 29 d6 40 00 00 40 00=20 > (da0:umass-sim0:0:0:0): CAM status: SCSI Status Error > (da0:umass-sim0:0:0:0): SCSI status: Check Condition > (da0:umass-sim0:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered = read error) > (da0:umass-sim0:0:0:0): Error 5, Unretryable error A description of "Media Error" from seagate is: Medium Error - Indicates the command terminated with a nonrecovered = error condition, probably caused by a flaw in the medium or an error in = the recorded data. To compare/contrast with other alternatives, see: https://www.seagate.com/support/kb/scsi-sense-key-chart-196259en/ A more extensive list with asc/ascq involved as well is at: https://en.wikipedia.org/wiki/Key_Code_Qualifier/ Allowing more comparison/contrast with other classifications. It indicates: 3 11 00 Medium Error - unrecovered read error (matching the reported text). > SCSI errors are not unknown, but they usually succeed on retry. > It's not obvious why this is treated as un-retryable.=20 Because that is what the "3 11 00" combination involved means. The drive is reporting that. It is not a FreeBSD driver choice of handling. (I'm not expert at drive internals, so I take it at face value.) > Are there any simple tests that might help decide what's wrong? > It's likely that re-running buildworld will reproduce the crash. See the https://en.wikipedia.org/wiki/Key_Code_Qualifier/ description material for some background information? > I've placed the results of smartctl -a at the end of the notes.=20 > The interpretation isn't self evident, hopefully someone else > can lend an eye. I'll try smartctl -t after a good night's sleep.=20 man smartctl reports: UNC: UNCorrectable Error in Data The 3 examples of: After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA =3D 0x0fffffff =3D 268435455 indicate UNC. All 3 list the same LBA value. Error 4 occurred at disk power-on lifetime: 11121 hours (463 days + 9 = hours) Error 3 occurred at disk power-on lifetime: 11098 hours (462 days + 10 = hours) Error 2 occurred at disk power-on lifetime: 11096 hours (462 days + 8 = hours) So spread over a little over a day overall, with 2 and 3 spread over a couple of hours. It suggests to me that the drive is no longer usable. But I'm no expert. =3D=3D=3D Mark Millard marklmi at yahoo.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9CEF4E7A-2F13-454F-A04A-A6C5A80FD4B7>