From owner-svn-src-head@FreeBSD.ORG  Mon May 28 19:48:20 2012
Return-Path: <owner-svn-src-head@FreeBSD.ORG>
Delivered-To: svn-src-head@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 93F2910657C2;
	Mon, 28 May 2012 19:48:20 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from fallbackmx09.syd.optusnet.com.au
	(fallbackmx09.syd.optusnet.com.au [211.29.132.242])
	by mx1.freebsd.org (Postfix) with ESMTP id 0CC208FC12;
	Mon, 28 May 2012 19:48:19 +0000 (UTC)
Received: from mail34.syd.optusnet.com.au (mail34.syd.optusnet.com.au
	[211.29.133.218])
	by fallbackmx09.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q4SJmDPX000519; Tue, 29 May 2012 05:48:13 +1000
Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232])
	by mail34.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q4SJm3tU025917
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Tue, 29 May 2012 05:48:04 +1000
Date: Tue, 29 May 2012 05:48:03 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: "Robert N. M. Watson" <rwatson@freebsd.org>
In-Reply-To: <71304742-3635-49C6-BE36-60E4F4A6FC20@freebsd.org>
Message-ID: <20120529033708.U2726@besplex.bde.org>
References: <201205252150.q4PLomFk035064@svn.freebsd.org>
	<20120526173233.A885@besplex.bde.org>
	<20120526164927.GU2358@deviant.kiev.zoral.com.ua>
	<20120527043827.W3357@besplex.bde.org>
	<20120528133633.GB2358@deviant.kiev.zoral.com.ua>
	<71304742-3635-49C6-BE36-60E4F4A6FC20@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: src-committers@freebsd.org, svn-src-all@freebsd.org,
	Ed Schouten <ed@freebsd.org>, svn-src-head@freebsd.org,
	Bruce Evans <brde@optusnet.com.au>,
	Konstantin Belousov <kostikbel@gmail.com>, jonathan@freebsd.org
Subject: Re: svn commit: r236026 - in head/sys: amd64/linux32
 compat/freebsd32 kern
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
	<svn-src-head.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-head>,
	<mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head>
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-head>,
	<mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 May 2012 19:48:20 -0000

On Mon, 28 May 2012, Robert N. M. Watson wrote:

> On 28 May 2012, at 09:36, Konstantin Belousov wrote:
>
>> On Sun, May 27, 2012 at 07:49:36AM +1000, Bruce Evans wrote:
>>> On Sat, 26 May 2012, Konstantin Belousov wrote:
>>>
>>>> On Sat, May 26, 2012 at 10:21:25PM +1000, Bruce Evans wrote:
>>>> The 'low level' AKA magic happens in several *_fetch_syscall_args()
>>>> functions. For both linux32 and freebsd32, the magic code automatically
>>>> zero-extends the arguments into 64bit entities. Linux passes args in
>>>> registers, while FreeBSD uses words on stack.
>>>
>>> Actually, the amd64 linux_fetch32_fetch_syscall_args() just copies from
>>> 64-bit registers frame->tf_r* to 64-bit sa->args[*].  I can't see how
>>> this gives anything except garbage in the top bits.  Is there magic in
>>> the switch to 64-bit mode that sets the top bits?  Anyway, sign extension
>>> would give garbage for unsigned args, and zero-extension would give
>>> garbage for negative signed args.
>> Hardware zero-extends any register touched in the 32bit mode.
>>
>> In fact, please see r217991 for related bug.
>
> This may well be true on Intel, but is not true of MIPS -- which we probably don't care about currently for the purposes of Linux emulation, but maybe someday we will. On MIPS, 32-bit values are sign-extended rather than zero-extended.

Please use the unix newline character in mail.

The translation is MD so "this" (not the newline) isn't a problem.

> I see a somewhat complex thread here, but am not sure I quite understand the import for Capsicum. Is the 64-bit rights mask as part of system call arguments not working properly in compat32 scenarios? Or are there issues outside of the compat environment? Right now compat32 is not well-supported with Capsicum, but fixing that is quite important to productionising Capsicum.

64-bit args need special translation, and the rights mask doesn't have any.
At best, the low 32 bits of it might work accidentally.  64-bit args in
ia32 start as 2 32-bit ones.  These get turned into 2 64-bit ones
(zero extended).  These must be combined (in non-automatically generated
code) into a 32-bit one.

The automatically-generated args struct promotes the 32-bit args to 64
bits so that it can be passed directly to native syscalls.  This doesn't
work if there are any specially specially-translated args.  Then the
whole args struct must be translated (by simple copying for 32-bit args)
lseek() shows how this should be done:

% #ifdef COMPAT_43
% int
% ofreebsd32_lseek(struct thread *td, struct ofreebsd32_lseek_args *uap)
% {
% 	struct lseek_args nuap;
% 
% 	nuap.fd = uap->fd;
% 	nuap.offset = uap->offset;
% 	nuap.whence = uap->whence;
% 	return (sys_lseek(td, &nuap));
% }
% #endif

This one seems to be to work around olseek() being broken in
kern/syscalls.master.  Its offset arg is only 32 bits, so it should
be possible to call olseek() directly, but the bug makes the args
structs incompatible, so the above does a simple translation by
copying args and calls the full native lseek.  olseek() uses the
same copying code, but never worked since it starts with a wrong
args struct.  Apparently no one uses olseek(), but this compat lseek()
is used a bit so the bug was noticed.

The args stucts are:

% struct ofreebsd32_lseek_args {
% 	char fd_l_[PADL_(int)]; int fd; char fd_r_[PADR_(int)];
% 	char offset_l_[PADL_(int)]; int offset; char offset_r_[PADR_(int)];
% 	char whence_l_[PADL_(int)]; int whence; char whence_r_[PADR_(int)];
% };

The padding is to 64 bits.  All the types are correct, although the
type of `offset' is confusing.  Its type is that of the pre-4.4BSD off_t,
which was long, which was 32 bits on ia32 back then.  This is now spelled
as int, so that we can't easily see either of the 4 layers of translation
in it (off_t = long -> int32_t back then (type pun) -> int32_t now (type
pun) = int now.

% struct olseek_args {
% 	char fd_l_[PADL_(int)]; int fd; char fd_r_[PADR_(int)];
% 	char offset_l_[PADL_(long)]; long offset; char offset_r_[PADR_(long)];
% 	char whence_l_[PADL_(int)]; int whence; char whence_r_[PADR_(int)];
% };

The type of `offset' is broken.  It needs to be 32 bits since that is what
it was on ia32 before 4.4BSD, as above.

amd64 shouldn't support pre-4.4BSD, and making an olseek() call is close
to impossible on it.  However, on amd64 with ia32 support, you have to
enable COMPAT_43 for the ofreebsd32_lseek() to be compiled, and this
gives olseek() too, so you might as well use it after fixing it.

If a native olseek() call is somehow made on amd64, then it would start
with a 32-bit offset and this might be extended correctly in userland,
and then the wrong `long offset' in the above would work.  But there is
no need to risk this.  By declaring the offset as signed 32 bits, it
will be extended correctly when it is loaded from the args struct.

Note that the above translation `nuap.offset = uap->offset;' does a
critical non-null currect extension and this requires the olseek()
being declared correctly in the ia32 case, and on standard magic for
the args struct.  `offset' starts as 32 bits, but has been zero-extended
the args struct.  But its type in *uap is 32 bits, so the zero bits are
ignored, and its type in *uap is signed so it is sign-extended before
storing it into nuap.  The corresponding assignment in olseek() just
copies it.

% 
% int
% freebsd32_lseek(struct thread *td, struct freebsd32_lseek_args *uap)
% {
% 	int error;
% 	struct lseek_args ap;
% 	off_t pos;
% 
% 	ap.fd = uap->fd;
% 	ap.offset = PAIR32TO64(off_t,uap->offset);

This does the repacking of the offset.  This looks nice at first, but
it is missing a space after the comma and the macro is disgusting
internally.  uap->offset looks like a single term, but it is expanded
into an expression involving uap->offset1 and uap->offset2 using
delicate or invalid token pasting.  The macro is only defined in this
file.  All other places that I looked at do the repacking using explicit
expressions.  The macro is more needed here since the endianness varies.

% 	ap.whence = uap->whence;
% 	error = sys_lseek(td, &ap);
% 	/* Expand the quad return into two parts for eax and edx */
% 	pos = *(off_t *)(td->td_retval);
% 	td->td_retval[RETVAL_LO] = pos & 0xffffffff;	/* %eax */
% 	td->td_retval[RETVAL_HI] = pos >> 32;		/* %edx */
% 	return error;
% }

This function has lots more style bugs:
- unsorted declarations
- bogus comments about i386 registers.  The RETVAL* macros remove the need
   for register magic, and 3 of 4 arches that this code can be configured
   for (if not work), namely ia64, mips and powerpc, don't have any i386
   registers.
- *(off_t *)(td->td_retval) is hackish.  We use this hack a lot, er, a little
   for 64-bit return values.  It is to possibly combine td_retval[0]
   with td_retval[1] to create an off_t, without ifdef or if tangles
   for endianness, etc.  It is probably unnecessary here, since this
   code is only useful on 64 bit arches and then everything is in
   td_retval[0] (until we repack for returning).  This hack assumes
   either 64-bit td_retval[0] or little-endian to work anyway.
- `quad' in the comment is bogus too
- missing spaces around return value.  Perhaps required to match non-KNF
   stype elsewhere in compat code.

Fixing some style bugs gives:

@ int
@ freebsd32_lseek(struct thread *td, struct freebsd32_lseek_args *uap)
@ {
@ 	struct lseek_args ap;
@ 	off_t pos;
@ 	int error;
@ 
@ 	ap.fd = uap->fd;
@ 	ap.offset = PAIR32TO64(off_t, uap->offset);
@ 	ap.whence = uap->whence;
@ 	error = sys_lseek(td, &ap);
@ 	/* Expand the off_t return into two parts for 2 32-bit registers. */
@	/* Assume that it initially fits in 1 register_t (XXX?) */
@ 	pos = td->td_retval[0];
@ 	td->td_retval[RETVAL_LO] = pos & 0xffffffff;
@ 	td->td_retval[RETVAL_HI] = pos >> 32;
@ 	return error;
@ }

Another bug is now clearer: we assume that either right shifting does the
right thing with the sign bit, or that extra top bits for RETVAL_HI don't
matter.  I think it actually does a wrong thing on amd64, and this is
responsible for some of the non-zero-extensions that I observed:
    Suppose pos is -1; this is all bits 1 (0xffffffffffffffff; it becomes
    0xffffffff in RETVAL_LO, but remains 0xffffffffffffffff in RETVAL_HI;
    the kernel then returns extra bits to 32-bit mode.  They have no
    effect in 32-bit mode, but if the register is not touched in userland
    then the come back to the kernel on the next syscall and mess up the
    indended zero-extension.  Same sign bug as sigreturn() used to have.

    But as I said before, even looking at the top bits is a bug -- just
    declare all the syscalls correctly so that the top bits are ignored
    and correct sign extension is done.

I looked for other td_retval hacks and could only find these 2 in kern:

% int
% ogethostid(td, uap)
% 	struct thread *td;
% 	struct ogethostid_args *uap;
% {
% 	size_t len = sizeof(long);
% 	int name[2];
% 
% 	name[0] = CTL_KERN;
% 	name[1] = KERN_HOSTID;
% 	return (kernel_sysctl(td, name, 2, (long *)td->td_retval, &len,
% 	    NULL, 0, NULL, 0));
% }

I'm not sure what kernel_sysctl() does with this, but the cast has at
least 2 layers of bogusness.  td_retval starts as 2 register_t's and
decays to register_t *.  kernel_sysctl() doesn't even take a long *
arg.  It takes a void *.  Not casting would convert the register_t *
directly to a void * and undefined things would only happen later when
kernel_sysctl() doesn't convert it back to a register_t *.  Now they
can happen for conversion to long *.  In practice, long is similar to
register_t on all supported arches, instead of correct and twice as
long, so none of this makes any difference.

% int
% sys_lseek(td, uap)
% ...
% 	fp->f_offset = offset;
% 	VFS_KNOTE_UNLOCKED(vp, 0);
% 	*(off_t *)(td->td_retval) = fp->f_offset;

Seems to work as intended.

I talked kib out of using similar large ifdefs or similar hacks for
ssize_t.  If ssize_t is 64 bits, then register_t should be 64 bits too,
so td_retval[1] is not used and a hack like the above wouldn't touch
it.  If the size of the foo_t being stored using the *(foo_t *) hack
is not precisely 1 or 2 register_t's, then the store will overrun or
underrun.  But that case needs complicated conditionals to handle,
and its easier not to fix it before it happens.  The above hack hasn't
needed fixing in almost 20 years.  Hopefully any overrun or underrun
would be more obvious than bugs caused by above shift and sign
extension bugs.

Bruce