From nobody Mon Jul 22 19:36:00 2024 X-Original-To: freebsd-arm@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4WSVrQ2FYQz5Q2jT; Mon, 22 Jul 2024 19:36:06 +0000 (UTC) (envelope-from meloun.michal@gmail.com) Received: from mail-ej1-x62e.google.com (mail-ej1-x62e.google.com [IPv6:2a00:1450:4864:20::62e]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "WR4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4WSVrQ0CNgz4TPP; Mon, 22 Jul 2024 19:36:06 +0000 (UTC) (envelope-from meloun.michal@gmail.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-ej1-x62e.google.com with SMTP id a640c23a62f3a-a7a8577fd84so13789066b.1; Mon, 22 Jul 2024 12:36:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1721676962; x=1722281762; darn=freebsd.org; h=in-reply-to:from:content-language:references:cc:to:subject :user-agent:mime-version:date:message-id:sender:from:to:cc:subject :date:message-id:reply-to; bh=/ak0926wSLDyMz2ZifAehwppfDO52ddQY9QH0UgpzAA=; b=VQdgWtMHP5/81OhnS/HWF+rueA9UMGxi/rRJypMVtdaAOK4Igu6iOzZEon0jZdUuP4 ONHAhkLCQh51TU2Jch0LdrDM5U4csycOXTwWdWTVwcLgikmqvWj5cYlQJGTRuVfMdfoK hfa59m67ju0Bmx7lnjKyN0os6S3Y64VvHpUkx4QAwznVDi6G1S6+1lpsBNh/eyjii3aN VR55MCjUBijPsy+IZbzHjsBYpPGZXjsCZRGQengUnPD+U+8zJoL5joc03p7OcybZBQ4N ZwDf0eTp2QJku+q+AEa8Hi8/GfSe5TamBMG4VB1GdSzC1pu6kiESyePU8emDwm/yAX5h GiZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721676962; x=1722281762; h=in-reply-to:from:content-language:references:cc:to:subject :user-agent:mime-version:date:message-id:sender:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=/ak0926wSLDyMz2ZifAehwppfDO52ddQY9QH0UgpzAA=; b=jImio2s2++zL5obE/wk9yVV+XGPM/pdxqewbXBcsmMpFsM42nvQfOzQRxfmtT8Lmtp K3RwogBfeunMcAX9HdJXePNTYnetQ0OLNNho9XyGSkODXl7xHZTRV0Q0cNtM0WR2b3mb z0YcBxceBPXramLjzaUAbJ22hy9lHvCbwW9lwq7JKh5eN0wg7y3z6pJBAr2u5P8aLOoH xsCkO3A7BKOrUA9GX5ihmed/+lDIbVLZAwy9BTsr4aPaR4YJQARUDQQDGoyVhXoSunw1 zRmGs/RoryISNyj/2SyUABE/MULQi11MxqkqR0NbM0rSMteXUNd6irxP1sxuDnR1bLmW yqkA== X-Forwarded-Encrypted: i=1; AJvYcCXMLKuntuHaZnL/u3nF5r5KPvzzcAMuRA/sQ+Pr86yqgCJg7cyVC2aNUXDRANlZMUJNDX9nS1zcv7vrhMzci1IyYwm4ucZ3ExSZ+tQWFSf3efon7912AcBSKLxbcyZN19LFozIC/IdRjvl5 X-Gm-Message-State: AOJu0Yygl7E0kP10CD0jEs8psSincf+7usIY1jg5ozRSldTAs7vRmAiy vN/RAAqtD4F8c8Sb3dDVTTSJLmy9F+8jQ9OkDph3TltWJHqZVHSGukCuxGhZ X-Google-Smtp-Source: AGHT+IGmo8EQfWEVz8tDjay2JMmS4KbEcnNWJ7IP4I587jUMqQxG1lya5Wp45aiqqe47Xb0C7kjz8Q== X-Received: by 2002:a17:906:328c:b0:a72:8c53:1798 with SMTP id a640c23a62f3a-a7a0f79b5b3mr1399469166b.30.1721676961871; Mon, 22 Jul 2024 12:36:01 -0700 (PDT) Received: from ?IPV6:2001:67c:14a0:5fe0:6941:fb05:2984:bd80? ([2001:67c:14a0:5fe0:6941:fb05:2984:bd80]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a7a3c8bf079sm455044166b.104.2024.07.22.12.36.01 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 22 Jul 2024 12:36:01 -0700 (PDT) Content-Type: multipart/alternative; boundary="------------RASKq3j20OKMmUvOOePU2MTK" Message-ID: <33251aa3-681f-4d17-afe9-953490afeaf0@gmail.com> Date: Mon, 22 Jul 2024 21:36:00 +0200 List-Id: Porting FreeBSD to ARM processors List-Archive: https://lists.freebsd.org/archives/freebsd-arm List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-arm@FreeBSD.org MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: armv7-on-aarch64 stuck at urdlck To: Mark Millard , mmel@freebsd.org Cc: FreeBSD Current , "freebsd-arm@freebsd.org" , "kib@freebsd.org >> Konstantin Belousov" References: <724db42b-5550-4381-8277-2971e6b3e8f1@freebsd.org> <86185657-e521-466b-89e2-f291aaac10a6@freebsd.org> <0EF18174-8735-46A4-BD71-FFA3472B319F@yahoo.com> Content-Language: cs, en-US From: Michal Meloun In-Reply-To: X-Spamd-Bar: ---- X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; TAGGED_FROM(0.00)[]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US] X-Rspamd-Queue-Id: 4WSVrQ0CNgz4TPP This is a multi-part message in MIME format. --------------RASKq3j20OKMmUvOOePU2MTK Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 22. 7. 2024 19:27, Mark Millard wrote: > On Jul 22, 2024, at 09:41,meloun.michal@gmail.com wrote: > >> On 22.07.2024 18:26, Mark Millard wrote: >>> On Jul 22, 2024, at 06:40, Michal Meloun wrote: >>>> On 22.07.2024 13:46, Mark Millard wrote: >>>>> On Jul 21, 2024, at 22:59, Michal Meloun wrote: >>>>>> I don't want to hijack the original thread, so I'm replying in a new one. >>>>>> >>>>>> My tegra track current, has been running 24/7 by building kernel/world and kde5 in a loop for a few years now. But I have never encountered the aforementioned lockup in native armv7. >>>>>> >>>>>> I have seen usermode mutex lockup in arm32 jail on aarch64, but only very rarely (once a month or so) and all my attempts to reproduce it in a more deterministic way have failed. Also, I don't think I've ever seen this with the debug version of libc. >>>>>> >>>>>> Unfortunately I also failed to reproduce given lockup using dlopen_test.c, neither on native armv7 or arm32 jail. >>>>>> >>>>>> Michal Meloun >>>>> What is the output of: >>>>> # readelf -a /libexec/ld-elf.so.1 | grep -E "(^[^ 0-9]|.*_rtld_get_stack_prot)" >>>>> in your armv7 context(s)? Does it include for likes of: >>>>> QUOTE >>>>> Symbol table '.symtab' contains 911 entries: >>>>> 903: 000000000001b9ac 16 FUNC GLOBAL DEFAULT 11 _rtld_get_stack_prot >>>>> END QUOTE >>>>> ` >>>>> vs. not? >>>>> Note that the "debug version of libc" being involved likely means that >>>>> DEBUG_FLAGS was defined. That in turn likely means that strip is not >>>>> being used. In such a case, I expect that the .symtab entry for >>>>> _rtld_get_stack_prot (and more) exists for such a context. >>>> At tis time, I have standard (thus stripped, non-debug) version of runtime linker library installed. Thus it have only dynamic relocation record for _rtld_get_stack_prot: >>>> >>>> root@tegra124:~/dlopen_test # readelf -a /libexec/ld-elf.so.1 | grep -E "(^[^ 0-9]|.*_rtld_get_stack_prot)" >>>> ELF Header: >>>> Elf file type is DYN (Shared object file) >>>> Entry point 0x1449c >>>> There are 10 program headers, starting at offset 52 >>>> Program Headers: >>>> There are 23 section headers, starting at offset 0x1a448: >>>> Section Headers: >>>> Key to Flags: >>>> Dynamic section at offset 0x19fa4 contains 15 entries: >>>> Relocation section (.rel.dyn): >>>> r_offset r_info r_type st_value st_name >>>> Symbol table '.dynsym' contains 27 entries: >>>> 5: 000000000001ba0c 16 FUNC GLOBAL DEFAULT 12 _rtld_get_stack_prot@@FBSDprivate_1.0 (11) >>>> Notes at offset 0x00000174 with length 0x00000018: >>>> Histogram for bucket list length (total of 6 buckets): >>>> Histogram for bucket list length (total of 27 buckets): >>>> Version symbol section (.gnu.version): >>>> Version definition section (.gnu.version_d): >>>> Attribute Section: aeabi >>>> >>>> ------ >>>> >>>> root@tegra124:~/dlopen_test # ./dlopen_test >>>> root@tegra124:~/dlopen_test # >>> Just to be sure . . . >>> Did you at some point "pkg install cairo" (or analogous) so that >>> the following (or some vintage) were in place? >>> # ls -lodT /usr/local/lib/libcairo.so* >>> lrwxr-xr-x 1 root wheel - 21 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so -> libcairo.so.2.11704.0 >>> lrwxr-xr-x 1 root wheel - 21 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so.2 -> libcairo.so.2.11704.0 >>> -rwxr-xr-x 1 root wheel - 1118272 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so.2.11704.0 >>> # file /usr/local/lib/libcairo.so.2.11704.0 >>> /usr/local/lib/libcairo.so.2.11704.0: ELF 32-bit LSB shared object, ARM, EABI5 version 1 (FreeBSD), dynamically linked, for FreeBSD 15.0 (1500018), stripped >>> (Installing cairo would also install other things it needs.) >>> For the failing contexts, the a.out from dlopen_test.c will only >>> hang if the library (and what it requires) is actually there to >>> load. >> Yep, i have cairo installed (but compiled from sources, not installed by pkg). And i have verified that dlopen() return success. >> In the meantime I tried all combinations (debud/stripped) of ld_elf and libthr. All combinations work without problems on the native system and in arm323 jail. > Thanks for the information. My personal builds, which are the > ones that work in my testing, are built on aarch64 as armv7 > instead of on amd64. The known failing ones are built on amd64. > But I've no more specific information suggesting a tie to the > type of build host for the world used. > >> Btw, gdb has long had problems with stepping inside ld_elf. It's better to run the test program without it and connect to the test program to get the "correct" stack trace. >> > In part I was deliberately exploring what sequence leads to the > hangups vs. lack of hangups and the like: more context than a > backtrace of the stuck state can provide. > > But doing "./a.out &" and then "gdb -p..." to attach to it: > > _umtx_op () at _umtx_op.S:4 > > warning: 4 _umtx_op.S: No such file or directory > (gdb) bt > #0 _umtx_op () at _umtx_op.S:4 > #1 0x2036845c in _umtx_op_err (obj=0x4, op=12, val=0, uaddr=0x0, uaddr2=0x0) at /home/pkgbuild/worktrees/main/lib/libsys/_umtx_op_err.c:36 > #2 0x20115da8 in __thr_rwlock_rdlock (rwlock=0x4, rwlock@entry=0x20137c40, flags=3, tsp=) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_umtx.c:294 > #3 0x2010ebf4 in _thr_rwlock_rdlock (rwlock=0x20137c40, flags=0, tsp=0x0) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_umtx.h:229 > #4 _thr_rtld_rlock_acquire (lock=0x20137c40) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_rtld.c:121 > #5 0x20060788 in rlock_acquire (lock=0x2008af10 , lockstate=lockstate@entry=0xffffd114) at /home/pkgbuild/worktrees/main/libexec/rtld-elf/rtld_lock.c:259 > #6 0x20059098 in _rtld_bind (obj=0x2008f404, reloff=496) at /home/pkgbuild/worktrees/main/libexec/rtld-elf/rtld.c:1035 > #7 0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89 > #8 0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89 > #9 0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89 > . . . > > It does not seem significantly different than I'd reported > for the hungup state. > > An issue here is that the pkgbase world possibly is -O2 based > despite having debug information (but is stripped). This can > make details less reliable. So, for example, the rwlock=0x4 > vs. rwlock@entry=0x20137c40 for __thr_rwlock_rdlock could well > be suspect. > IMHO, -O2 shouldn't be able to modify function arguments for public functions, so this memory corruption fits perfectly with the observed behavior. But , out of curiosity, a quick look at _thr_rwlock_tryrdlock() in thr_umtx.h:208 makes me wonder: How is the "state" variable inside the loop guaranteed to be updated? IMHO nothing inside the loop emits a global memory modification attribute, so the compiler is free to move the assignment to a "state" variable outside the loop. Kib, please, do you have any comment on this? MIchal Meloun --------------RASKq3j20OKMmUvOOePU2MTK Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 7bit


On 22. 7. 2024 19:27, Mark Millard wrote:
On Jul 22, 2024, at 09:41, meloun.michal@gmail.com wrote:

On 22.07.2024 18:26, Mark Millard wrote:
On Jul 22, 2024, at 06:40, Michal Meloun <meloun.michal@gmail.com> wrote:
On 22.07.2024 13:46, Mark Millard wrote:
On Jul 21, 2024, at 22:59, Michal Meloun <meloun.michal@gmail.com> wrote:
I don't want to hijack the original thread, so I'm replying in a new one.

My tegra track current, has been running 24/7 by building kernel/world and kde5 in a loop for a few years now. But I have never encountered the aforementioned lockup in native armv7.

I have seen usermode mutex lockup in arm32 jail on aarch64, but only very rarely (once a month or so) and all my attempts to reproduce it in a more deterministic way have failed. Also, I don't think I've ever seen this with the debug version of libc.

Unfortunately I also failed to reproduce given lockup using dlopen_test.c, neither on native armv7 or arm32 jail.

Michal Meloun
What is the output of:
# readelf -a /libexec/ld-elf.so.1 | grep -E "(^[^ 0-9]|.*_rtld_get_stack_prot)"
in your armv7 context(s)? Does it include for likes of:
QUOTE
Symbol table '.symtab' contains 911 entries:
 903: 000000000001b9ac    16 FUNC    GLOBAL DEFAULT   11 _rtld_get_stack_prot
END QUOTE
`
vs. not?
Note that the "debug version of libc" being involved likely means that
DEBUG_FLAGS was defined. That in turn likely means that strip is not
being used. In such a case, I expect that the .symtab entry for
_rtld_get_stack_prot (and more) exists for such a context.
At tis time, I have standard (thus stripped, non-debug) version of runtime linker library installed. Thus it have only dynamic relocation record for _rtld_get_stack_prot:

root@tegra124:~/dlopen_test # readelf -a /libexec/ld-elf.so.1 | grep -E "(^[^ 0-9]|.*_rtld_get_stack_prot)"
ELF Header:
Elf file type is DYN (Shared object file)
Entry point 0x1449c
There are 10 program headers, starting at offset 52
Program Headers:
There are 23 section headers, starting at offset 0x1a448:
Section Headers:
Key to Flags:
Dynamic section at offset 0x19fa4 contains 15 entries:
Relocation section (.rel.dyn):
r_offset r_info   r_type              st_value st_name
Symbol table '.dynsym' contains 27 entries:
    5: 000000000001ba0c    16 FUNC    GLOBAL DEFAULT   12 _rtld_get_stack_prot@@FBSDprivate_1.0 (11)
Notes at offset 0x00000174 with length 0x00000018:
Histogram for bucket list length (total of 6 buckets):
Histogram for bucket list length (total of 27 buckets):
Version symbol section (.gnu.version):
Version definition section (.gnu.version_d):
Attribute Section: aeabi

------

root@tegra124:~/dlopen_test # ./dlopen_test
root@tegra124:~/dlopen_test #
Just to be sure . . .
Did you at some point "pkg install cairo" (or analogous) so that
the following (or some vintage) were in place?
# ls -lodT /usr/local/lib/libcairo.so*
lrwxr-xr-x  1 root wheel -      21 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so -> libcairo.so.2.11704.0
lrwxr-xr-x  1 root wheel -      21 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so.2 -> libcairo.so.2.11704.0
-rwxr-xr-x  1 root wheel - 1118272 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so.2.11704.0
# file /usr/local/lib/libcairo.so.2.11704.0
/usr/local/lib/libcairo.so.2.11704.0: ELF 32-bit LSB shared object, ARM, EABI5 version 1 (FreeBSD), dynamically linked, for FreeBSD 15.0 (1500018), stripped
(Installing cairo would also install other things it needs.)
For the failing contexts, the a.out from dlopen_test.c will only
hang if the library (and what it requires) is actually there to
load.
Yep, i have cairo installed (but compiled from sources, not installed by pkg). And i have verified that dlopen() return success.
In the meantime I tried all combinations (debud/stripped) of ld_elf and libthr. All combinations work without problems on the native system and in arm323 jail.
Thanks for the information. My personal builds, which are the
ones that work in my testing, are built on aarch64 as armv7
instead of on amd64. The known failing ones are built on amd64.
But I've no more specific information suggesting a tie to the
type of build host for the world used.

Btw, gdb has long had problems with stepping inside ld_elf. It's better to run the test program without it and connect to the test program to get the "correct" stack trace.

In part I was deliberately exploring what sequence leads to the
hangups vs. lack of hangups and the like: more context than a
backtrace of the stuck state can provide.

But doing "./a.out &" and then "gdb -p..." to attach to it:

_umtx_op () at _umtx_op.S:4

warning: 4 _umtx_op.S: No such file or directory
(gdb) bt
#0  _umtx_op () at _umtx_op.S:4
#1  0x2036845c in _umtx_op_err (obj=0x4, op=12, val=0, uaddr=0x0, uaddr2=0x0) at /home/pkgbuild/worktrees/main/lib/libsys/_umtx_op_err.c:36
#2  0x20115da8 in __thr_rwlock_rdlock (rwlock=0x4, rwlock@entry=0x20137c40, flags=3, tsp=<optimized out>) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_umtx.c:294
#3  0x2010ebf4 in _thr_rwlock_rdlock (rwlock=0x20137c40, flags=0, tsp=0x0) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_umtx.h:229
#4  _thr_rtld_rlock_acquire (lock=0x20137c40) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_rtld.c:121
#5  0x20060788 in rlock_acquire (lock=0x2008af10 <rtld_locks>, lockstate=lockstate@entry=0xffffd114) at /home/pkgbuild/worktrees/main/libexec/rtld-elf/rtld_lock.c:259
#6  0x20059098 in _rtld_bind (obj=0x2008f404, reloff=496) at /home/pkgbuild/worktrees/main/libexec/rtld-elf/rtld.c:1035
#7  0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89
#8  0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89
#9  0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89
. . .

It does not seem significantly different than I'd reported
for the hungup state.

An issue here is that the pkgbase world possibly is -O2 based
despite having debug information (but is stripped). This can
make details less reliable. So, for example, the rwlock=0x4
vs. rwlock@entry=0x20137c40 for __thr_rwlock_rdlock could well
be suspect.


IMHO, -O2 shouldn't be able to modify function arguments for public functions, so <guessing> this memory corruption fits perfectly with the observed behavior</guessing>.

But , out of curiosity, a quick look at _thr_rwlock_tryrdlock() in thr_umtx.h:208 makes me wonder: How is the "state" variable inside the loop guaranteed to be updated? IMHO nothing inside the loop emits a global memory modification attribute, so the compiler is free to move the assignment to a "state" variable outside the loop.

Kib, please, do you have any comment on this?

MIchal Meloun

--------------RASKq3j20OKMmUvOOePU2MTK--