Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 27 Apr 2026 12:54:22 +0000
From:      Jessica Clarke <jrtc27@FreeBSD.org>
To:        src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-main@FreeBSD.org
Subject:   git: 551d47c5677a - main - arm64: Ditch arm64-specific unsound PCPU optimisation
Message-ID:  <69ef5c7e.26b9d.7a4ba55a@gitrepo.freebsd.org>

index | next in thread | raw e-mail

The branch main has been updated by jrtc27:

URL: https://cgit.FreeBSD.org/src/commit/?id=551d47c5677a5eaf0a1ed2ea3b2b1406b192764d

commit 551d47c5677a5eaf0a1ed2ea3b2b1406b192764d
Author:     Jessica Clarke <jrtc27@FreeBSD.org>
AuthorDate: 2026-04-27 12:53:29 +0000
Commit:     Jessica Clarke <jrtc27@FreeBSD.org>
CommitDate: 2026-04-27 12:53:29 +0000

    arm64: Ditch arm64-specific unsound PCPU optimisation
    
    The current arm64 PCPU implementation uses a global register asm
    variable to use x18, which we reserve with -ffixed-x18, from C. Inside a
    critical_enter() or sched_pin(), it is vital that any PCPU reads use the
    right PCPU pointer, as often the whole point of the critical_enter() or
    sched_pin() is to ensure consistent PCPU use (e.g. for SMR it relies on
    zpcpu giving the same SMR state). critical_enter() and sched_pin() both
    include atomic_interrupt_fence(), i.e. asm volatile("" ::: "memory"),
    barriers to ensure that memory accesses don't get moved by the compiler
    outside the critical section, which on most architectures will also
    order the read of the PCPU pointer itself (whether due to the read being
    another asm volatile statement, or due to using a segment-relative
    memory access as on x86). However, this approach on arm64 is in no sense
    a memory access, and therefore the register access is not ordered with
    respect to the the critical_enter() or sched_pin(), or more specifically
    the curthread->td_critnest++ / curthread->td_pinned++ within.
    
    In practice upstream today this works out ok because the read of x18 is
    inlined into the actual PCPU_GET/ADD/SET memory accesses (i.e. you will
    get something like ldr xN, [x18, #imm-or-xM] for PCPU_GET, etc.), and
    since *that* instruction is ordered properly due to being a memory
    access, the x18 ends up being read in the right place. However, that is
    not in any way guaranteed, it just relies on the hope that compiler
    optimisations will be perfect at inlining the use. Moreover, PCPU_PTR is
    definitely not a memory access in this world, it's just pointer
    arithmetic on x18, and so that has nothing ordering it. This can be
    observed with the following test function compiled into the kernel:
    
        void
        pcpu_test(void)
        {
                extern void __weak_symbol use_pcpu_ptr(void *);
                critical_enter();
                use_pcpu_ptr(PCPU_PTR(curthread));
                critical_exit();
        }
    
    Obviously, this is a bit contrived as you could just read curthread
    directly via its atomic definition that bypasses any worries about PCPU
    atomicity, but it illustrates the point. With the in-tree LLVM*, this
    ends up being compiled for me to:
    
        paciasp
        stp     x29, x30, [sp, #-0x10]!
        mov     x29, sp
        ldr     x8, [x18]
        ldr     w9, [x8, #0x4fc]
        mov     x0, x18
        add     w9, w9, #0x1
        str     w9, [x8, #0x4fc]
        bl      use_pcpu_ptr
        ...
    
    Note that, although the PCPU_PTR was within the critical section in the
    C source, the read of x18 into x0, the argument register passed to
    use_pcpu_ptr, has been hoisted to before the str, which is storing the
    new, incremented, value of td_critnest to curthread, and so there is a
    window within which we have to hope the thread is not preempted and
    migrated to a different CPU, otherwise it will pass a pointer to the
    wrong CPU's pc_curthread PCPU member.
    
    Initially it would seem as though the solution to this would be to add
    an additional barrier to critical_enter() / sched_pin() to ensure the
    register reads could not be hoisted like this. However, I have not been
    able to find a sequence that works reliably across both GCC and Clang,
    independent of optimisation level. Using inline asm with x18 marked as a
    clobber, using "=r"(pcpup), and using "+r"(pcpup) all run into various
    issues; some combinations don't actually seem to be a barrier, and for
    Clang at -O0 some combinations will actually generate writes to x18**,
    at which point you then have to hope that the kernel is compiled with
    optimisations, and that the redundant writes are optimised away such
    that x18 is just passed through. But that just gets us back to hoping
    optimisation works, which isn't a solution to the problem, it just
    trades one point of fragility for another.
    
    In talking to GCC developers, who seemed rather horrified by the
    implications of trying to do this (which is effectively "register
    volatile", a combination that's explicitly forbidden), we could not find
    a solution to this, and so I have concluded that the only reliable to
    have a sound PCPU implementation is to ditch this optimisation and
    follow other non-x86 architectures in using inline asm in one form or
    another; specifically, this adopts riscv's approach of just calling
    get_pcpu(), which, curiously, was already implemented in inline asm here
    on arm64, rather than reading pcpup.
    
    Anyone who feels strongly enough about PCPU performance is welcome to
    try to find a working approach, but such proposals should be heavily
    scrutinised to be certain that they won't come back to bite us in
    future. In particular, this caused a lot of problems downstream in
    CheriBSD's experimental compartmentalised kernel, which is trialling
    interposing on PCPU accesses in order to restrict access within
    compartments. As a result, even PCPU_GET/SET/ADD can look like PCPU_PTR,
    as they pass an opaque PCPU reference to wrapper functions, and so this
    case gets hit all over the kernel, giving highly-confusing panics with
    locks that aren't owned by the current thread or SMR use allegedly not
    within an smr_enter().
    
    The ia64 port encountered the same issue and reached the same conclusion
    in e31ece45b7a4 ("Fix the PCPU access macros."), though went to the
    trouble of trying to fold the offset into the inline assembly (assuming
    it fit, with no fallback if not, since it's using the add pseudo-op that
    will be expanded to either adds with a 14-bit immediate or, if somehow
    that doesn't fit, addl with a 22-bit immediate). Curiously though it
    left pcpup around as a footgun. sparc64 had similar code but was never
    fixed. It also defined a curpcb in the same manner which was presumably
    similarly broken, but looks to have been entirely unreferenced from C,
    only referenced in actual assembly files. Alpha also had the same
    design, but it was removed whilst critical_enter() was extern rather
    than static inline so uses of the pointer could not have been hoisted,
    and whilst sched_pin() didn't have any form of atomic_interrupt_fence()
    to even try to make PCPU well-ordered.
    
     * At time of writing, when that was LLVM 19, not verified at time of
       commit with LLVM 21.
    
    ** For "+r"(pcpup), Clang's initial code generation is to do:
    
           mov xTtmp1, x18
           mov x18, xTmp1
           /* asm (empty) */
           mov xTmp2, x18
           mov x18, xTmp2
    
       since its interpretation of what that means is "read the value of
       pcpup, and make sure that value is in x18 for the duration of the
       assembly due to the asm("x18") on pcpup", and similarly for the output
       side.
    
    Reviewed by:    andrew, jhb
    MFC after:      1 week
    Differential Revision:  https://reviews.freebsd.org/D56601
---
 sys/arm64/include/pcpu.h | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h
index 73399d2c3f8c..286a40e7de3d 100644
--- a/sys/arm64/include/pcpu.h
+++ b/sys/arm64/include/pcpu.h
@@ -58,8 +58,6 @@ struct debug_monitor_state;
 struct pcb;
 struct pcpu;
 
-register struct pcpu *pcpup __asm ("x18");
-
 static inline struct pcpu *
 get_pcpu(void)
 {
@@ -80,10 +78,10 @@ get_curthread(void)
 
 #define	curthread get_curthread()
 
-#define	PCPU_GET(member)	(pcpup->pc_ ## member)
-#define	PCPU_ADD(member, value)	(pcpup->pc_ ## member += (value))
-#define	PCPU_PTR(member)	(&pcpup->pc_ ## member)
-#define	PCPU_SET(member,value)	(pcpup->pc_ ## member = (value))
+#define	PCPU_GET(member)	(get_pcpu()->pc_ ## member)
+#define	PCPU_ADD(member, value)	(get_pcpu()->pc_ ## member += (value))
+#define	PCPU_PTR(member)	(&get_pcpu()->pc_ ## member)
+#define	PCPU_SET(member,value)	(get_pcpu()->pc_ ## member = (value))
 
 #define	PCPU_GET_MPIDR(pc)	((pc)->pc_mpidr)
 


home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?69ef5c7e.26b9d.7a4ba55a>