Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 15 Mar 2014 23:59:34 +0100
From:      Jilles Tjoelker <jilles@stack.nl>
To:        "Meyer, Conrad" <conrad.meyer@isilon.com>
Cc:        "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>, Bryan Drewery <bdrewery@FreeBSD.org>
Subject:   Re: [PATCH] amd64/pcpu.h: Use Clang builtins for clarity when referencing thread's pcpu
Message-ID:  <20140315225934.GA22964@stack.nl>
In-Reply-To: <1394821826-19412-1-git-send-email-conrad.meyer@isilon.com>
References:  <1394821826-19412-1-git-send-email-conrad.meyer@isilon.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Mar 14, 2014 at 06:31:08PM +0000, Meyer, Conrad wrote:
> We can efficiently reference thread-local pcpu members via the %gs
> register with Clang-annotated C code, in place of inline GNU assembly.

> Motivations:
>   - Use C in leiu of inline assembly for clarity
>   - Clang's static analyser may be better able to understand PCPU_*
>     macros using the C constructs rather than inline assembly
>     (unverified)

> Sponsored by: EMC/Isilon storage division
> Signed-off-by: Conrad Meyer <conrad.meyer@isilon.com>
> Reviewed-by: Max Laier <mlaier@FreeBSD.org>
> ---
> This is more of a "what do you think?" than a pull request. It seems
> like using annotated C instead of asm is nice (in particular, Clang
> detects casts from pointers typed with one segment to another, or
> unsegmented type). On the other hand, this is code that doesn't change
> frequently, and we may still need to support GCC for some time. So
> adding a second, parallel implementation just doubles room for bugs.

> Open questions:
>   - How long is GCC intended to be supported as a compiler?
>   - How atomic does PCPU_INC() need to be? It looks like it updates
>     cpu-local counters; as long as it's a single asm instruction, should
>     it be fine w.r.t. interrupts? The existing implementation does NOT
>     use the 'lock; ' prefix.

> [snip]
> +/*
> + * Adds the value to the per-cpu counter name.  The implementation
> + * must be atomic with respect to interrupts.
> + */
> +#define	__PCPU_ADD(name, val) do {					\
> +	__pcpu_type(name) __val;					\
> +	volatile __pcpu_type(name) __GS_RELATIVE *__ptr;		\
> +									\
> +	__val = (val);							\
> +	__ptr = __PCPU_PTR(name);					\
> +	*__ptr += __val;						\
> +} while (0)
> +
> +/*
> + * Increments the value of the per-cpu counter name.  The implementation
> + * must be atomic with respect to interrupts.
> + */
> +#define	__PCPU_INC(name) __PCPU_ADD(name, 1)

This code does not force the compiler to generate the required
read-modify-write instruction. The compiler may generate a load, add and
store, which may corrupt the counter if the thread is preempted between
the load and the store.

With the %gs: prefix, a plain read-modify-write instruction is indeed
sufficient. It is not possible to preempt the instruction between
determining the linear address of the counter and incrementing the
counter, or between reading and writing the value; therefore, each core
only touches its own counter and a lock prefix is not required. (This is
basically the same reason why the lock prefixes are #ifdef'ed out in a
uniprocessor kernel.)

If a compiler wanted to support this from C, I suggest extending the
memory_order type with one or more values for synchronization within a
core or within a thread which generate read-modify-write instructions
without lock prefix on x86.

-- 
Jilles Tjoelker



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140315225934.GA22964>