Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 22 Jul 2024 21:36:00 +0200
From:      Michal Meloun <meloun.michal@gmail.com>
To:        Mark Millard <marklmi@yahoo.com>, mmel@freebsd.org
Cc:        FreeBSD Current <freebsd-current@freebsd.org>, "freebsd-arm@freebsd.org" <freebsd-arm@freebsd.org>, "kib@freebsd.org >> Konstantin Belousov" <kib@freebsd.org>
Subject:   Re: armv7-on-aarch64 stuck at urdlck
Message-ID:  <33251aa3-681f-4d17-afe9-953490afeaf0@gmail.com>
In-Reply-To: <C0B42CBB-8F12-4597-A04B-26F2107E176E@yahoo.com>
References:  <724db42b-5550-4381-8277-2971e6b3e8f1@freebsd.org> <B5E2275D-21F0-43C8-AF06-A45DB7448D66@yahoo.com> <86185657-e521-466b-89e2-f291aaac10a6@freebsd.org> <0EF18174-8735-46A4-BD71-FFA3472B319F@yahoo.com> <a1b978fe-ff54-4112-860c-b09500d89d0b@freebsd.org> <C0B42CBB-8F12-4597-A04B-26F2107E176E@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
This is a multi-part message in MIME format.
--------------RASKq3j20OKMmUvOOePU2MTK
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit


On 22. 7. 2024 19:27, Mark Millard wrote:
> On Jul 22, 2024, at 09:41,meloun.michal@gmail.com  wrote:
>
>> On 22.07.2024 18:26, Mark Millard wrote:
>>> On Jul 22, 2024, at 06:40, Michal Meloun<meloun.michal@gmail.com>  wrote:
>>>> On 22.07.2024 13:46, Mark Millard wrote:
>>>>> On Jul 21, 2024, at 22:59, Michal Meloun<meloun.michal@gmail.com>  wrote:
>>>>>> I don't want to hijack the original thread, so I'm replying in a new one.
>>>>>>
>>>>>> My tegra track current, has been running 24/7 by building kernel/world and kde5 in a loop for a few years now. But I have never encountered the aforementioned lockup in native armv7.
>>>>>>
>>>>>> I have seen usermode mutex lockup in arm32 jail on aarch64, but only very rarely (once a month or so) and all my attempts to reproduce it in a more deterministic way have failed. Also, I don't think I've ever seen this with the debug version of libc.
>>>>>>
>>>>>> Unfortunately I also failed to reproduce given lockup using dlopen_test.c, neither on native armv7 or arm32 jail.
>>>>>>
>>>>>> Michal Meloun
>>>>> What is the output of:
>>>>> # readelf -a /libexec/ld-elf.so.1 | grep -E "(^[^ 0-9]|.*_rtld_get_stack_prot)"
>>>>> in your armv7 context(s)? Does it include for likes of:
>>>>> QUOTE
>>>>> Symbol table '.symtab' contains 911 entries:
>>>>>   903: 000000000001b9ac    16 FUNC    GLOBAL DEFAULT   11 _rtld_get_stack_prot
>>>>> END QUOTE
>>>>> `
>>>>> vs. not?
>>>>> Note that the "debug version of libc" being involved likely means that
>>>>> DEBUG_FLAGS was defined. That in turn likely means that strip is not
>>>>> being used. In such a case, I expect that the .symtab entry for
>>>>> _rtld_get_stack_prot (and more) exists for such a context.
>>>> At tis time, I have standard (thus stripped, non-debug) version of runtime linker library installed. Thus it have only dynamic relocation record for _rtld_get_stack_prot:
>>>>
>>>> root@tegra124:~/dlopen_test # readelf -a /libexec/ld-elf.so.1 | grep -E "(^[^ 0-9]|.*_rtld_get_stack_prot)"
>>>> ELF Header:
>>>> Elf file type is DYN (Shared object file)
>>>> Entry point 0x1449c
>>>> There are 10 program headers, starting at offset 52
>>>> Program Headers:
>>>> There are 23 section headers, starting at offset 0x1a448:
>>>> Section Headers:
>>>> Key to Flags:
>>>> Dynamic section at offset 0x19fa4 contains 15 entries:
>>>> Relocation section (.rel.dyn):
>>>> r_offset r_info   r_type              st_value st_name
>>>> Symbol table '.dynsym' contains 27 entries:
>>>>      5: 000000000001ba0c    16 FUNC    GLOBAL DEFAULT   12 _rtld_get_stack_prot@@FBSDprivate_1.0 (11)
>>>> Notes at offset 0x00000174 with length 0x00000018:
>>>> Histogram for bucket list length (total of 6 buckets):
>>>> Histogram for bucket list length (total of 27 buckets):
>>>> Version symbol section (.gnu.version):
>>>> Version definition section (.gnu.version_d):
>>>> Attribute Section: aeabi
>>>>
>>>> ------
>>>>
>>>> root@tegra124:~/dlopen_test # ./dlopen_test
>>>> root@tegra124:~/dlopen_test #
>>> Just to be sure . . .
>>> Did you at some point "pkg install cairo" (or analogous) so that
>>> the following (or some vintage) were in place?
>>> # ls -lodT /usr/local/lib/libcairo.so*
>>> lrwxr-xr-x  1 root wheel -      21 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so -> libcairo.so.2.11704.0
>>> lrwxr-xr-x  1 root wheel -      21 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so.2 -> libcairo.so.2.11704.0
>>> -rwxr-xr-x  1 root wheel - 1118272 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so.2.11704.0
>>> # file /usr/local/lib/libcairo.so.2.11704.0
>>> /usr/local/lib/libcairo.so.2.11704.0: ELF 32-bit LSB shared object, ARM, EABI5 version 1 (FreeBSD), dynamically linked, for FreeBSD 15.0 (1500018), stripped
>>> (Installing cairo would also install other things it needs.)
>>> For the failing contexts, the a.out from dlopen_test.c will only
>>> hang if the library (and what it requires) is actually there to
>>> load.
>> Yep, i have cairo installed (but compiled from sources, not installed by pkg). And i have verified that dlopen() return success.
>> In the meantime I tried all combinations (debud/stripped) of ld_elf and libthr. All combinations work without problems on the native system and in arm323 jail.
> Thanks for the information. My personal builds, which are the
> ones that work in my testing, are built on aarch64 as armv7
> instead of on amd64. The known failing ones are built on amd64.
> But I've no more specific information suggesting a tie to the
> type of build host for the world used.
>
>> Btw, gdb has long had problems with stepping inside ld_elf. It's better to run the test program without it and connect to the test program to get the "correct" stack trace.
>>
> In part I was deliberately exploring what sequence leads to the
> hangups vs. lack of hangups and the like: more context than a
> backtrace of the stuck state can provide.
>
> But doing "./a.out &" and then "gdb -p..." to attach to it:
>
> _umtx_op () at _umtx_op.S:4
>
> warning: 4 _umtx_op.S: No such file or directory
> (gdb) bt
> #0  _umtx_op () at _umtx_op.S:4
> #1  0x2036845c in _umtx_op_err (obj=0x4, op=12, val=0, uaddr=0x0, uaddr2=0x0) at /home/pkgbuild/worktrees/main/lib/libsys/_umtx_op_err.c:36
> #2  0x20115da8 in __thr_rwlock_rdlock (rwlock=0x4, rwlock@entry=0x20137c40, flags=3, tsp=<optimized out>) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_umtx.c:294
> #3  0x2010ebf4 in _thr_rwlock_rdlock (rwlock=0x20137c40, flags=0, tsp=0x0) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_umtx.h:229
> #4  _thr_rtld_rlock_acquire (lock=0x20137c40) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_rtld.c:121
> #5  0x20060788 in rlock_acquire (lock=0x2008af10 <rtld_locks>, lockstate=lockstate@entry=0xffffd114) at /home/pkgbuild/worktrees/main/libexec/rtld-elf/rtld_lock.c:259
> #6  0x20059098 in _rtld_bind (obj=0x2008f404, reloff=496) at /home/pkgbuild/worktrees/main/libexec/rtld-elf/rtld.c:1035
> #7  0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89
> #8  0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89
> #9  0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89
> . . .
>
> It does not seem significantly different than I'd reported
> for the hungup state.
>
> An issue here is that the pkgbase world possibly is -O2 based
> despite having debug information (but is stripped). This can
> make details less reliable. So, for example, the rwlock=0x4
> vs. rwlock@entry=0x20137c40 for __thr_rwlock_rdlock could well
> be suspect.
>

IMHO, -O2 shouldn't be able to modify function arguments for public 
functions, so <guessing> this memory corruption fits perfectly with the 
observed behavior</guessing>.

But , out of curiosity, a quick look at _thr_rwlock_tryrdlock() in 
thr_umtx.h:208 makes me wonder: How is the "state" variable inside the 
loop guaranteed to be updated? IMHO nothing inside the loop emits a 
global memory modification attribute, so the compiler is free to move 
the assignment to a "state" variable outside the loop.

Kib, please, do you have any comment on this?

MIchal Meloun

--------------RASKq3j20OKMmUvOOePU2MTK
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit

<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 22. 7. 2024 19:27, Mark Millard
      wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:C0B42CBB-8F12-4597-A04B-26F2107E176E@yahoo.com">
      <pre class="moz-quote-pre" wrap="">On Jul 22, 2024, at 09:41, <a class="moz-txt-link-abbreviated" href="mailto:meloun.michal@gmail.com">meloun.michal@gmail.com</a> wrote:

</pre>
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">On 22.07.2024 18:26, Mark Millard wrote:
</pre>
        <blockquote type="cite">
          <pre class="moz-quote-pre" wrap="">On Jul 22, 2024, at 06:40, Michal Meloun <a class="moz-txt-link-rfc2396E" href="mailto:meloun.michal@gmail.com">&lt;meloun.michal@gmail.com&gt;</a> wrote:
</pre>
          <blockquote type="cite">
            <pre class="moz-quote-pre" wrap="">On 22.07.2024 13:46, Mark Millard wrote:
</pre>
            <blockquote type="cite">
              <pre class="moz-quote-pre" wrap="">On Jul 21, 2024, at 22:59, Michal Meloun <a class="moz-txt-link-rfc2396E" href="mailto:meloun.michal@gmail.com">&lt;meloun.michal@gmail.com&gt;</a> wrote:
</pre>
              <blockquote type="cite">
                <pre class="moz-quote-pre" wrap="">I don't want to hijack the original thread, so I'm replying in a new one.

My tegra track current, has been running 24/7 by building kernel/world and kde5 in a loop for a few years now. But I have never encountered the aforementioned lockup in native armv7.

I have seen usermode mutex lockup in arm32 jail on aarch64, but only very rarely (once a month or so) and all my attempts to reproduce it in a more deterministic way have failed. Also, I don't think I've ever seen this with the debug version of libc.

Unfortunately I also failed to reproduce given lockup using dlopen_test.c, neither on native armv7 or arm32 jail.

Michal Meloun
</pre>
              </blockquote>
              <pre class="moz-quote-pre" wrap="">What is the output of:
# readelf -a /libexec/ld-elf.so.1 | grep -E "(^[^ 0-9]|.*_rtld_get_stack_prot)"
in your armv7 context(s)? Does it include for likes of:
QUOTE
Symbol table '.symtab' contains 911 entries:
 903: 000000000001b9ac    16 FUNC    GLOBAL DEFAULT   11 _rtld_get_stack_prot
END QUOTE
`
vs. not?
Note that the "debug version of libc" being involved likely means that
DEBUG_FLAGS was defined. That in turn likely means that strip is not
being used. In such a case, I expect that the .symtab entry for
_rtld_get_stack_prot (and more) exists for such a context.
</pre>
            </blockquote>
            <pre class="moz-quote-pre" wrap="">At tis time, I have standard (thus stripped, non-debug) version of runtime linker library installed. Thus it have only dynamic relocation record for _rtld_get_stack_prot:

root@tegra124:~/dlopen_test # readelf -a /libexec/ld-elf.so.1 | grep -E "(^[^ 0-9]|.*_rtld_get_stack_prot)"
ELF Header:
Elf file type is DYN (Shared object file)
Entry point 0x1449c
There are 10 program headers, starting at offset 52
Program Headers:
There are 23 section headers, starting at offset 0x1a448:
Section Headers:
Key to Flags:
Dynamic section at offset 0x19fa4 contains 15 entries:
Relocation section (.rel.dyn):
r_offset r_info   r_type              st_value st_name
Symbol table '.dynsym' contains 27 entries:
    5: 000000000001ba0c    16 FUNC    GLOBAL DEFAULT   12 _rtld_get_stack_prot@@FBSDprivate_1.0 (11)
Notes at offset 0x00000174 with length 0x00000018:
Histogram for bucket list length (total of 6 buckets):
Histogram for bucket list length (total of 27 buckets):
Version symbol section (.gnu.version):
Version definition section (.gnu.version_d):
Attribute Section: aeabi

------

root@tegra124:~/dlopen_test # ./dlopen_test
root@tegra124:~/dlopen_test #
</pre>
          </blockquote>
          <pre class="moz-quote-pre" wrap="">Just to be sure . . .
Did you at some point "pkg install cairo" (or analogous) so that
the following (or some vintage) were in place?
# ls -lodT /usr/local/lib/libcairo.so*
lrwxr-xr-x  1 root wheel -      21 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so -&gt; libcairo.so.2.11704.0
lrwxr-xr-x  1 root wheel -      21 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so.2 -&gt; libcairo.so.2.11704.0
-rwxr-xr-x  1 root wheel - 1118272 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so.2.11704.0
# file /usr/local/lib/libcairo.so.2.11704.0
/usr/local/lib/libcairo.so.2.11704.0: ELF 32-bit LSB shared object, ARM, EABI5 version 1 (FreeBSD), dynamically linked, for FreeBSD 15.0 (1500018), stripped
(Installing cairo would also install other things it needs.)
For the failing contexts, the a.out from dlopen_test.c will only
hang if the library (and what it requires) is actually there to
load.
</pre>
        </blockquote>
        <pre class="moz-quote-pre" wrap="">Yep, i have cairo installed (but compiled from sources, not installed by pkg). And i have verified that dlopen() return success.
In the meantime I tried all combinations (debud/stripped) of ld_elf and libthr. All combinations work without problems on the native system and in arm323 jail.
</pre>
      </blockquote>
      <pre class="moz-quote-pre" wrap="">
Thanks for the information. My personal builds, which are the
ones that work in my testing, are built on aarch64 as armv7
instead of on amd64. The known failing ones are built on amd64.
But I've no more specific information suggesting a tie to the
type of build host for the world used.

</pre>
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">Btw, gdb has long had problems with stepping inside ld_elf. It's better to run the test program without it and connect to the test program to get the "correct" stack trace.

</pre>
      </blockquote>
      <pre class="moz-quote-pre" wrap="">
In part I was deliberately exploring what sequence leads to the
hangups vs. lack of hangups and the like: more context than a
backtrace of the stuck state can provide.

But doing "./a.out &amp;" and then "gdb -p..." to attach to it:

_umtx_op () at _umtx_op.S:4

warning: 4 _umtx_op.S: No such file or directory
(gdb) bt
#0  _umtx_op () at _umtx_op.S:4
#1  0x2036845c in _umtx_op_err (obj=0x4, op=12, val=0, uaddr=0x0, uaddr2=0x0) at /home/pkgbuild/worktrees/main/lib/libsys/_umtx_op_err.c:36
#2  0x20115da8 in __thr_rwlock_rdlock (rwlock=0x4, rwlock@entry=0x20137c40, flags=3, tsp=&lt;optimized out&gt;) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_umtx.c:294
#3  0x2010ebf4 in _thr_rwlock_rdlock (rwlock=0x20137c40, flags=0, tsp=0x0) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_umtx.h:229
#4  _thr_rtld_rlock_acquire (lock=0x20137c40) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_rtld.c:121
#5  0x20060788 in rlock_acquire (lock=0x2008af10 &lt;rtld_locks&gt;, lockstate=lockstate@entry=0xffffd114) at /home/pkgbuild/worktrees/main/libexec/rtld-elf/rtld_lock.c:259
#6  0x20059098 in _rtld_bind (obj=0x2008f404, reloff=496) at /home/pkgbuild/worktrees/main/libexec/rtld-elf/rtld.c:1035
#7  0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89
#8  0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89
#9  0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89
. . .

It does not seem significantly different than I'd reported
for the hungup state.

An issue here is that the pkgbase world possibly is -O2 based
despite having debug information (but is stripped). This can
make details less reliable. So, for example, the rwlock=0x4
vs. rwlock@entry=0x20137c40 for __thr_rwlock_rdlock could well
be suspect.

</pre>
    </blockquote>
    <br>
    <p><span style="white-space: pre-wrap">IMHO, -O2 shouldn't be able to modify function arguments for public functions, so &lt;guessing&gt; this memory corruption fits perfectly with the observed behavior&lt;/guessing&gt;.</span></p>
    <p><span style="white-space: pre-wrap"></span></p>
    <p><span style="white-space: pre-wrap">But , out of curiosity, a quick look at _thr_rwlock_tryrdlock() in thr_umtx.h:208 makes me wonder: How is the "state" variable inside the loop guaranteed to be updated? IMHO nothing inside the loop emits a global memory modification attribute, so the compiler is free to move the assignment to a "state" variable outside the loop. </span></p>
    <p><span style="white-space: pre-wrap">Kib,  please, do you have any comment on this?
</span></p>
    <p><span style="white-space: pre-wrap">MIchal Meloun</span></p>
    <p><span style="white-space: pre-wrap">
</span></p>
  </body>
</html>

--------------RASKq3j20OKMmUvOOePU2MTK--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?33251aa3-681f-4d17-afe9-953490afeaf0>