Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 20 Feb 2023 11:47:30 +0000
From:      John F Carr <jfc@mit.edu>
To:        Mark Millard <marklmi@yahoo.com>, bob prohaska <fbsd@www.zefox.net>
Cc:        "freebsd-arm@freebsd.org" <freebsd-arm@freebsd.org>
Subject:   Re: fsck segfaults on rpi3 running 13-stable (and on 14-CURRENT analyzing the same file system that resulted from the 13-STABLE crash)
Message-ID:  <1DB17CD4-63B5-4FA2-ADC6-6ED817A09CCB@mit.edu>
In-Reply-To: <268392B4-58FE-49EE-9B1D-6DA632757DFA@yahoo.com>
References:  <202302192054.31JKsq7w079295@chez.mckusick.com> <3DD8EEC2-6135-42A0-A80C-F195CAAC025E@yahoo.com> <20230219222328.GA55941@www.zefox.net> <2F5B20E9-AFF8-42F6-9E1F-50BBDF4E1B79@yahoo.com> <20230220044544.GB57936@www.zefox.net> <9CEF4E7A-2F13-454F-A04A-A6C5A80FD4B7@yahoo.com> <268392B4-58FE-49EE-9B1D-6DA632757DFA@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help


> On Feb 20, 2023, at 01:00, Mark Millard <marklmi@yahoo.com> wrote:
>=20
> On Feb 19, 2023, at 21:50, Mark Millard <marklmi@yahoo.com> wrote:
>=20
>> On Feb 19, 2023, at 20:45, bob prohaska <fbsd@www.zefox.net> wrote:
>>=20
>>> On Sun, Feb 19, 2023 at 02:35:15PM -0800, Mark Millard wrote:
>>>>=20
>>>> Kirk likely monitors the freebsd-fs list.
>>>=20
>>> I didn't notice there was such a list 8-\
>>>=20
>>>> Kirk likely does not monitor the freebsd-arm list.
>>>> None of us thought to switch to freebsd-fs at the
>>>> time. The only part of your context that ended up
>>>> to be arm specific was original buildworld crash.
>>>> You definitely started in an appropriate place
>>>> (freebsd-arm). After the crash, the rest was more
>>>> general relative to platforms and more specific
>>>> relative to file system handling (UFS support).
>>>>=20
>>>> I do not see any reason for any of this exchange
>>>> to go to any lists, given the current status.
>>>=20
>>> Alas, the story's not over yet 8-( =20
>>>=20
>>> After getting the disk fsck'd and booting once more,
>>> an attempt to buildworld using a fresh /usr/src
>>> and empty /usr/obj crashed again,
>>=20
>> I'm confused. The original crash was reported to be
>> on a RPi2B using a armv7 kernel, or so I thought.
>> (The RPi3B was for later fsck_ffs activity for the
>> media's UFS.)
>>=20
>> This new material indicates a RPi3B arm64 (aarch64)
>> context for this buildworld failure. Is it the same
>> media as for the prior buildworld failure?
>>=20
>>> in I think the
>>> same way. This time some notes have been collected
>>> at
>>> http://www.zefox.net/~fbsd/rpi3/scsi_status_error/readme
>>>=20
>>> To a casual glance, it looks like a hardware error.
>>> But, the machine seems to work fine until it's running
>>> buildworld, and then crashes during a relatively easy
>>> part of buildworld. The initial error message is:
>>>=20
>>> bob@pelorus:/usr/src % (da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 43 =
29 d6 40 00 00 40 00=20
>>> (da0:umass-sim0:0:0:0): CAM status: SCSI Status Error
>>> (da0:umass-sim0:0:0:0): SCSI status: Check Condition
>>> (da0:umass-sim0:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered =
read error)
>>> (da0:umass-sim0:0:0:0): Error 5, Unretryable error
>>=20
>> A description of "Media Error" from seagate is:
>>=20
>> Medium Error - Indicates the command terminated with a nonrecovered erro=
r condition, probably caused by a flaw in the medium or an error in the rec=
orded data.
>>=20
>> To compare/contrast with other alternatives, see:
>>=20
>> https://www.seagate.com/support/kb/scsi-sense-key-chart-196259en/
>>=20
>> A more extensive list with asc/ascq involved as well is at:
>>=20
>> https://en.wikipedia.org/wiki/Key_Code_Qualifier/
>>=20
>> Allowing more comparison/contrast with other classifications.
>>=20
>> It indicates:
>>=20
>> 3 11 00 Medium Error - unrecovered read error
>>=20
>> (matching the reported text).
>>=20
>>> SCSI errors are not unknown, but they usually succeed on retry.
>>> It's not obvious why this is treated as un-retryable.=20
>>=20
>> Because that is what the "3 11 00" combination involved
>> means. The drive is reporting that. It is not a FreeBSD
>> driver choice of handling.
>>=20
>> (I'm not expert at drive internals, so I take it at face
>> value.)
>>=20
>>> Are there any simple tests that might help decide what's wrong?
>>> It's likely that re-running buildworld will reproduce the crash.
>>=20
>> See the https://en.wikipedia.org/wiki/Key_Code_Qualifier/
>> description material for some background information?
>>=20
>>> I've placed the results of smartctl -a at the end of the notes.=20
>>> The interpretation isn't self evident, hopefully someone else
>>> can lend an eye. I'll try smartctl -t after a good night's sleep.=20
>>=20
>> man smartctl reports:
>>=20
>>                UNC:   UNCorrectable Error in Data
>>=20
>> The 3 examples of:
>>=20
>> After command completion occurred, registers were:
>> ER ST SC SN CL CH DH
>> -- -- -- -- -- -- --
>> 40 51 00 ff ff ff 0f Error: UNC at LBA =3D 0x0fffffff =3D 268435455
>>=20
>> indicate UNC. All 3 list the same LBA value.
>=20
> Turns out that the LBA value is likely garbage, given the
> size of your drive (> 128 GiBytes):

But we have an address from the SCSI command: READ(10). CDB: 28 00 43 29 d6=
 40 00 00 40 00=20

Decoded that says read, starting block 0x4329d640, length 0x40 blocks.  If =
block size is 512 bytes that is about half a terabyte into the disk.

This shell command should replicate the read:

# dd if=3D/dev/da0 of=3D/dev/null bs=3D32768 count=3D1 skip=3D17606489

The device name (if=3D) comes from the error message "da0:umass-sim0:0:0:0"=
.  The block size (bs=3D) matches the read request in the failed SCSI comma=
nd.  The skip count is 0x4329d640 (disk block) / 64 (number of disk blocks =
per dd block).

If you reproduce the error with dd you can try a binary search over the 64 =
block range until you find the block that failed.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1DB17CD4-63B5-4FA2-ADC6-6ED817A09CCB>