Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 22 Jul 2024 20:49:08 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        Michal Meloun <meloun.michal@gmail.com>
Cc:        mmel@freebsd.org, FreeBSD Current <freebsd-current@freebsd.org>, "freebsd-arm@freebsd.org" <freebsd-arm@freebsd.org>, "kib@freebsd.org >> Konstantin Belousov" <kib@freebsd.org>
Subject:   Re: armv7-on-aarch64 stuck at urdlck
Message-ID:  <0DD19771-3AAB-469E-981B-1203F1C28233@yahoo.com>
In-Reply-To: <33251aa3-681f-4d17-afe9-953490afeaf0@gmail.com>
References:  <724db42b-5550-4381-8277-2971e6b3e8f1@freebsd.org> <B5E2275D-21F0-43C8-AF06-A45DB7448D66@yahoo.com> <86185657-e521-466b-89e2-f291aaac10a6@freebsd.org> <0EF18174-8735-46A4-BD71-FFA3472B319F@yahoo.com> <a1b978fe-ff54-4112-860c-b09500d89d0b@freebsd.org> <C0B42CBB-8F12-4597-A04B-26F2107E176E@yahoo.com> <33251aa3-681f-4d17-afe9-953490afeaf0@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Jul 22, 2024, at 12:36, Michal Meloun <meloun.michal@gmail.com> =
wrote:
> On 22. 7. 2024 19:27, Mark Millard wrote:
>> On Jul 22, 2024, at 09:41, meloun.michal@gmail.com wrote:
>>=20
>>=20
>>> On 22.07.2024 18:26, Mark Millard wrote:
>>>=20
>>>> On Jul 22, 2024, at 06:40, Michal Meloun <meloun.michal@gmail.com> =
wrote:
>>>>=20
>>>>> On 22.07.2024 13:46, Mark Millard wrote:
>>>>>=20
>>>>>> On Jul 21, 2024, at 22:59, Michal Meloun =
<meloun.michal@gmail.com> wrote:
>>>>>>=20
>>>>>>> I don't want to hijack the original thread, so I'm replying in a =
new one.
>>>>>>>=20
>>>>>>> My tegra track current, has been running 24/7 by building =
kernel/world and kde5 in a loop for a few years now. But I have never =
encountered the aforementioned lockup in native armv7.
>>>>>>>=20
>>>>>>> I have seen usermode mutex lockup in arm32 jail on aarch64, but =
only very rarely (once a month or so) and all my attempts to reproduce =
it in a more deterministic way have failed. Also, I don't think I've =
ever seen this with the debug version of libc.
>>>>>>>=20
>>>>>>> Unfortunately I also failed to reproduce given lockup using =
dlopen_test.c, neither on native armv7 or arm32 jail.
>>>>>>>=20
>>>>>>> Michal Meloun
>>>>>>>=20
>>>>>> What is the output of:
>>>>>> # readelf -a /libexec/ld-elf.so.1 | grep -E "(^[^ =
0-9]|.*_rtld_get_stack_prot)"
>>>>>> in your armv7 context(s)? Does it include for likes of:
>>>>>> QUOTE
>>>>>> Symbol table '.symtab' contains 911 entries:
>>>>>> 903: 000000000001b9ac 16 FUNC GLOBAL DEFAULT 11 =
_rtld_get_stack_prot
>>>>>> END QUOTE
>>>>>> `
>>>>>> vs. not?
>>>>>> Note that the "debug version of libc" being involved likely means =
that
>>>>>> DEBUG_FLAGS was defined. That in turn likely means that strip is =
not
>>>>>> being used. In such a case, I expect that the .symtab entry for
>>>>>> _rtld_get_stack_prot (and more) exists for such a context.
>>>>>>=20
>>>>> At tis time, I have standard (thus stripped, non-debug) version of =
runtime linker library installed. Thus it have only dynamic relocation =
record for _rtld_get_stack_prot:
>>>>>=20
>>>>> root@tegra124:~/dlopen_test # readelf -a /libexec/ld-elf.so.1 | =
grep -E "(^[^ 0-9]|.*_rtld_get_stack_prot)"
>>>>> ELF Header:
>>>>> Elf file type is DYN (Shared object file)
>>>>> Entry point 0x1449c
>>>>> There are 10 program headers, starting at offset 52
>>>>> Program Headers:
>>>>> There are 23 section headers, starting at offset 0x1a448:
>>>>> Section Headers:
>>>>> Key to Flags:
>>>>> Dynamic section at offset 0x19fa4 contains 15 entries:
>>>>> Relocation section (.rel.dyn):
>>>>> r_offset r_info r_type st_value st_name
>>>>> Symbol table '.dynsym' contains 27 entries:
>>>>> 5: 000000000001ba0c 16 FUNC GLOBAL DEFAULT 12 =
_rtld_get_stack_prot@@FBSDprivate_1.0 (11)
>>>>> Notes at offset 0x00000174 with length 0x00000018:
>>>>> Histogram for bucket list length (total of 6 buckets):
>>>>> Histogram for bucket list length (total of 27 buckets):
>>>>> Version symbol section (.gnu.version):
>>>>> Version definition section (.gnu.version_d):
>>>>> Attribute Section: aeabi
>>>>>=20
>>>>> ------
>>>>>=20
>>>>> root@tegra124:~/dlopen_test # ./dlopen_test
>>>>> root@tegra124:~/dlopen_test #
>>>>>=20
>>>> Just to be sure . . .
>>>> Did you at some point "pkg install cairo" (or analogous) so that
>>>> the following (or some vintage) were in place?
>>>> # ls -lodT /usr/local/lib/libcairo.so*
>>>> lrwxr-xr-x 1 root wheel - 21 Apr 29 19:45:15 2024 =
/usr/local/lib/libcairo.so -> libcairo.so.2.11704.0
>>>> lrwxr-xr-x 1 root wheel - 21 Apr 29 19:45:15 2024 =
/usr/local/lib/libcairo.so.2 -> libcairo.so.2.11704.0
>>>> -rwxr-xr-x 1 root wheel - 1118272 Apr 29 19:45:15 2024 =
/usr/local/lib/libcairo.so.2.11704.0
>>>> # file /usr/local/lib/libcairo.so.2.11704.0
>>>> /usr/local/lib/libcairo.so.2.11704.0: ELF 32-bit LSB shared object, =
ARM, EABI5 version 1 (FreeBSD), dynamically linked, for FreeBSD 15.0 =
(1500018), stripped
>>>> (Installing cairo would also install other things it needs.)
>>>> For the failing contexts, the a.out from dlopen_test.c will only
>>>> hang if the library (and what it requires) is actually there to
>>>> load.
>>>>=20
>>> Yep, i have cairo installed (but compiled from sources, not =
installed by pkg). And i have verified that dlopen() return success.
>>> In the meantime I tried all combinations (debud/stripped) of ld_elf =
and libthr. All combinations work without problems on the native system =
and in arm323 jail.
>>>=20
>> Thanks for the information. My personal builds, which are the
>> ones that work in my testing, are built on aarch64 as armv7
>> instead of on amd64. The known failing ones are built on amd64.
>> But I've no more specific information suggesting a tie to the
>> type of build host for the world used.
>>=20
>>=20
>>> Btw, gdb has long had problems with stepping inside ld_elf. It's =
better to run the test program without it and connect to the test =
program to get the "correct" stack trace.
>>>=20
>>>=20
>> In part I was deliberately exploring what sequence leads to the
>> hangups vs. lack of hangups and the like: more context than a
>> backtrace of the stuck state can provide.
>>=20
>> But doing "./a.out &" and then "gdb -p..." to attach to it:
>>=20
>> _umtx_op () at _umtx_op.S:4
>>=20
>> warning: 4 _umtx_op.S: No such file or directory
>> (gdb) bt
>> #0 _umtx_op () at _umtx_op.S:4
>> #1 0x2036845c in _umtx_op_err (obj=3D0x4, op=3D12, val=3D0, =
uaddr=3D0x0, uaddr2=3D0x0) at =
/home/pkgbuild/worktrees/main/lib/libsys/_umtx_op_err.c:36
>> #2 0x20115da8 in __thr_rwlock_rdlock (rwlock=3D0x4, =
rwlock@entry=3D0x20137c40, flags=3D3, tsp=3D<optimized out>) at =
/home/pkgbuild/worktrees/main/lib/libthr/thread/thr_umtx.c:294
>> #3 0x2010ebf4 in _thr_rwlock_rdlock (rwlock=3D0x20137c40, flags=3D0, =
tsp=3D0x0) at =
/home/pkgbuild/worktrees/main/lib/libthr/thread/thr_umtx.h:229
>> #4 _thr_rtld_rlock_acquire (lock=3D0x20137c40) at =
/home/pkgbuild/worktrees/main/lib/libthr/thread/thr_rtld.c:121
>> #5 0x20060788 in rlock_acquire (lock=3D0x2008af10 <rtld_locks>, =
lockstate=3Dlockstate@entry=3D0xffffd114) at =
/home/pkgbuild/worktrees/main/libexec/rtld-elf/rtld_lock.c:259
>> #6 0x20059098 in _rtld_bind (obj=3D0x2008f404, reloff=3D496) at =
/home/pkgbuild/worktrees/main/libexec/rtld-elf/rtld.c:1035
>> #7 0x2005483c in _rtld_bind_start () at =
/home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89
>> #8 0x2005483c in _rtld_bind_start () at =
/home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89
>> #9 0x2005483c in _rtld_bind_start () at =
/home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89
>> . . .
>>=20
>> It does not seem significantly different than I'd reported
>> for the hungup state.
>>=20
>> An issue here is that the pkgbase world possibly is -O2 based
>> despite having debug information (but is stripped). This can
>> make details less reliable. So, for example, the rwlock=3D0x4
>> vs. rwlock@entry=3D0x20137c40 for __thr_rwlock_rdlock could well
>> be suspect.
>>=20
>>=20
>=20
> IMHO, -O2 shouldn't be able to modify function arguments for public =
functions, so <guessing> this memory corruption fits perfectly with the =
observed behavior</guessing>.

It is not a memory corruption. r0 is "argument 1/scratch =
register/result" and
the code in question in my example is (__thr_rwlock_rdlock via disass /s =
use):

280 {
   0x20115d50 <+0>: push {r11, lr}
   0x20115d54 <+4>: mov r11, sp
   0x20115d58 <+8>: sub sp, sp, #32
   0x20115d5c <+12>: mov r12, r1
. . .
291 tm_p =3D &timeout;
292 tm_size =3D sizeof(timeout);
293 }
294 return (_umtx_op_err(rwlock, UMTX_OP_RW_RDLOCK, flags,
   0x20115d98 <+72>: str r1, [sp]
   0x20115d9c <+76>: mov r1, #12
   0x20115da0 <+80>: mov r2, r12
   0x20115da4 <+84>: bl 0x201167a0
=3D> 0x20115da8 <+88>: mov sp, r11
   0x20115dac <+92>: pop {r11, pc}

After the "bl 0x201167a0" the value of r0 is the return
value from 0x201167a0, not the first argument value
for 0x20115d50 . A better reporting would indicate that
rwlock was <optimized out>  at  that point: locally
the value has not been preserved at that point because
there is no more use of the value.

But such is the kind of thing I expect to run into for
the likes of -O2 use with debug information.

Anyway, _umtx_op_err returned the 0x4 value that is shown
for rwlock .

> But , out of curiosity, a quick look at _thr_rwlock_tryrdlock() in =
thr_umtx.h:208 makes me wonder: How is the "state" variable inside the =
loop guaranteed to be updated? IMHO nothing inside the loop emits a =
global memory modification attribute, so the compiler is free to move =
the assignment to a "state" variable outside the loop.=20
> Kib, please, do you have any comment on this?=20
> MIchal Meloun



=3D=3D=3D
Mark Millard
marklmi at yahoo.com




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0DD19771-3AAB-469E-981B-1203F1C28233>