From owner-freebsd-amd64@FreeBSD.ORG  Thu Jul 19 03:10:40 2012
Return-Path: <owner-freebsd-amd64@FreeBSD.ORG>
Delivered-To: freebsd-amd64@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 70444106566B
	for <freebsd-amd64@freebsd.org>; Thu, 19 Jul 2012 03:10:40 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail34.syd.optusnet.com.au (mail34.syd.optusnet.com.au
	[211.29.133.218])
	by mx1.freebsd.org (Postfix) with ESMTP id E292B8FC15
	for <freebsd-amd64@freebsd.org>; Thu, 19 Jul 2012 03:10:39 +0000 (UTC)
Received: from c122-106-171-246.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-246.carlnfd1.nsw.optusnet.com.au [122.106.171.246])
	by mail34.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q6J3ATmA012239
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 19 Jul 2012 13:10:31 +1000
Date: Thu, 19 Jul 2012 13:10:29 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kostikbel@gmail.com>
In-Reply-To: <20120718081003.GT2676@deviant.kiev.zoral.com.ua>
Message-ID: <20120719111846.G780@besplex.bde.org>
References: <201207172150.q6HLoFvB096742@freefall.freebsd.org>
	<20120718103210.E834@besplex.bde.org>
	<20120718035220.GO2676@deviant.kiev.zoral.com.ua>
	<20120718140523.X1862@besplex.bde.org>
	<20120718050958.GQ2676@deviant.kiev.zoral.com.ua>
	<20120718153554.O2195@besplex.bde.org>
	<20120718081003.GT2676@deviant.kiev.zoral.com.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Mark Linimon <linimon@lonesome.com>, freebsd-amd64@freebsd.org
Subject: Re: amd64/169927: siginfo, si_code for fpe errors when error occurs
 using the SSE math processor
X-BeenThere: freebsd-amd64@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Porting FreeBSD to the AMD64 platform <freebsd-amd64.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-amd64>,
	<mailto:freebsd-amd64-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-amd64>
List-Post: <mailto:freebsd-amd64@freebsd.org>
List-Help: <mailto:freebsd-amd64-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-amd64>,
	<mailto:freebsd-amd64-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 19 Jul 2012 03:10:40 -0000

On Wed, 18 Jul 2012, Konstantin Belousov wrote:

> On Wed, Jul 18, 2012 at 04:59:38PM +1000, Bruce Evans wrote:
>> On Wed, 18 Jul 2012, Konstantin Belousov wrote:
>> Putting the definiton in machine/pcpu.h should avoid changing the
>> prerequistes for machine/pcb.h.
>>
>>> #ifndef _KERNEL
>>> /* stuff that *used* to be included by user.h, or is now needed */
>>>
>>> Please note the location in pcb.h an not in machine/pcpu.h, where it
>>> cannot work for technical reasons (struct pcpu is not defined yet).
>>
>> Not applicable -- see above.
> No, this cannot work. machine/pcpu.h defines PCPU_MD_FIELDS which is used
> to provide md part of the struct pcpu.

I see.  Oops.

But this problem is easy to avoid.  There are at least the following
alternatives:
(1) hard-code the offset, as for curthread.  This is ugly, but not too
     hard to maintain.  The offset has been 4 * sizeof(struct thread *),
     for more than 10 years, and you can't change this without breaking
     the ABI.
(2) #define the offset, as in raw asm code.  The offset is #define'd in
     assym.s.  This would give too much namespace pollution and bloat
     (unless a short assym.non-s is used).
(3) use a macro instead of an inline function.  Inline functions tend to
     require much more namespace pollution than macros, and this is an
     example of that.  This is essentially the method that we already
     use.  We define define the macro PCPU_GET().  We could define curpcb
     as PCPU_GET(curpcb).  Instead, we write PCPU_GET(curpcb)
     everywhere that we want the value.

     I didn't like the roto-tilling of the source code for this and
     other pcpu fields, and we didn't do it for curproc (we roto-tilled
     curproc to curthread instead).  Both curthread and curpcb are not
     as volatile as the other fields, so it is technically slightly
     incorrect to use the general volatile PCPU_GET() for them.  We
     already use a de-volatalized PCPU_GET() for curthread.  This was
     originally just an optimization and a de-obfuscation.  PCPU_GET()
     expands to really horrible, hard-to debug code.  It was so horrible
     and bloated that it broke compiling, so peter changed curthread
     to use an inline function in 2003.  This is only done for some
     arches (just amd64 and i386?), and sys/pcpu.h still defines
     curthread as PCPU_GET(curthread) if it is not already defined.
     The inline function was originally equivalent to PCPU_GET().  It
     used a volatile asm and wasn't pure2.  But since it was independent
     of PCPU_GET(), it was easier to optimize.

     To do the same thing for curpcb, provide a PCPU_NONVOLATILE_GET()
     which is the same as PCPU_GET() without the volatile in the asm,
     and #define curpcb as PCPU_NONVOLATILE_GET(curpcb).  Or maybe
     start with the inline function used for curthread, and macroize it.

     This could be used for curthread too, but I think we want to keep it
     as an inline function since the macro expansions are still horrible.

BTW, can you explain the locking for the pcpu access functions?  It seems
to be quite broken, with the volatile in the asms neither necessary nor
sufficient.  PCPU_PTR() seems to be almost unusable, but it is used.
Here is the i386 version of __PCPU_GET():

% #define	__PCPU_GET(name) __extension__ ({				\
% 	__pcpu_type(name) __res;					\
% 	struct __s {							\
% 		u_char	__b[MIN(sizeof(__res), 4)];			\
% 	} __s;								\
% 									\
% 	if (sizeof(__res) == 1 || sizeof(__res) == 2 ||			\
% 	    sizeof(__res) == 4) {					\
% 		__asm __volatile("mov %%fs:%1,%0"			\
% 		    : "=r" (__s)					\
% 		    : "m" (*(struct __s *)(__pcpu_offset(name))));	\
% 		*(struct __s *)(void *)&__res = __s;			\

This part is more or less OK.  It is subtle.  The pcpu data starting at
%fs:0 is volatile (or at least variable), and so is its address.  Its
address is not a C address, but is the constant 0 in %fs-space.  %fs is
equivalent to a volatile C pointer.  When %fs changes, it is normally
the responsibility of the caller to lock things so that this is not a
problem or so that %fs can't change.  (The %fs descriptor doesn't
actually change, but the %fs data does). There are some important special
cases where no locking is required:
- for pc_curthread.  This cannot change (except in context-switching
   code).  The CPU running the code can change unless the caller locks
   things.  Then %fs changes, but pc_curthread is switched so that it
   is the same for the new CPU as for the old CPU.
- for pc_curpcb.  This is switched like pc_curthread.
- I don't know of any other pcpu variables that are as easy to use or
   as obviously correctly locked as the above 2.
- counter variables.  Now you want the %fs pointer to change to track
   the CPU.

% 	} else {							\
% 		__res = *__PCPU_PTR(name);				\

__PCPU_PTR() seems to be almost unusable, and this is not one of the
few cases where it is usable.  Here it is used for wide or unusually-sized
fields where atomic access is especially problematic, yet it does even
less that the above to give atomicity.  First consider using it in all
cases to replace the above:
- using it for fields like pc_curthead would be very broken.  The pointer
   doesn't change, so if there is a context switch between reading the
   pointer and following it, the pointer reads the data for the wrong CPU.
   For pc_cuthread, the new pointer will give the old thread which is
   what curthread should return, but the old pointer will give some unrelated
   thread (whatever the old CPU is running now).
- for counter variables, there is no problem.  We don't care whether the
   pointer is for the new CPU or the old one, provided the access is atomic,
   and here I am only considering the case where accesses are natuarally
   atomic.  (We actually use different access functions for counters so
   that things like incrementing counters is atomic.)
Now consider it for wide data.  The pointer doesn't change so the multiple
accesses needed for the wide data are at least for the same CPU.  This gives
the same atomicity problems as for C pointers.  The caller needs to lock.
But PCPU_GET() is supposed to be a general interface.  Its atomicity
shouldn't depend on the size of the data.

% 	}								\
% 	__res;								\
% })

What happens for arches where PCPU_GET() uses a C pointer?  I don't see
how this can work for fields that are switched like pc_curthread.  Well,
here is what happens in some cases:
- on ia64, PCPU_GET() uses the C pointer pcpu which is a fixed register.
   curthread works because it doesn't use PCPU_GET() (it uses a __pure2
   non-volatile asm similarly to i386).  curpcb works beause it is not
   used.
- sparc64 is similar to ia64 except it gives even more examples of the
   special switched fields: pcpu is C pointer in a fixed register;
   curpcb is a C pointer in another fixed register; curthread uses a
   __pure2 non-volatile asm similarly to ia64.  Why is asm used for
   curthread on ia64 and sparc64?  The asm seems to have no magic, but
   just follows the C pointer.  Ah, this seems to be just bogus.  Old
   versions did have magic -- they used a volatile asm and didn't use
   __pure2, but they were optimized like the x86 versions 2 years ago.

So the problem seems to be well understood for the switched fields, and
to not exist for counters.  Other cases are hopefully handled by locking.
Here are some that aren't:
- PCPU_GET/SET(switchtime) in kern_resource.c.  The accesses are only
   proc locked AFAIK.  switchtime is uint64_t, so accesses to it are
   non-atomic on 32-bit arches.
- PCPU_GET/SET(switchtime) in kern_synch.c and sched_*.c.  Now the
   accesses are thread locked or sched_locked, which had better be enough
   (it is all that locks switching pc_curthread too?).
- in clock code, there is a lot of thread locking, but this is localized
   and not all of the pcpu accesses are covered by it.  Apparently, clock
   code depends on being called in fast interrupt handle context to give
   locking for pcpu.

PCPU accesses are surprisingly rare except for counters, so there don't
seem to be many problems.

Most uses of PCP_PTR() are reasonable, since they are in places like
subr_witness.c that should know to lock.  I would remove the one in
PCPU_GET().  This would break mainly the switchtime accesses on i386
and force callers to use PCPU_GET() and know that they need locking.

Bruce