Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 27 May 2012 07:49:36 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        src-committers@FreeBSD.org, svn-src-all@FreeBSD.org, Ed Schouten <ed@FreeBSD.org>, rwatson@FreeBSD.org, Bruce Evans <brde@optusnet.com.au>, svn-src-head@FreeBSD.org, jonathan@FreeBSD.org
Subject:   Re: svn commit: r236026 - in head/sys: amd64/linux32 compat/freebsd32 kern
Message-ID:  <20120527043827.W3357@besplex.bde.org>
In-Reply-To: <20120526164927.GU2358@deviant.kiev.zoral.com.ua>
References:  <201205252150.q4PLomFk035064@svn.freebsd.org> <20120526173233.A885@besplex.bde.org> <20120526164927.GU2358@deviant.kiev.zoral.com.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 26 May 2012, Konstantin Belousov wrote:

> On Sat, May 26, 2012 at 10:21:25PM +1000, Bruce Evans wrote:

Please don't quote the whole thing.

>> On Fri, 25 May 2012, Ed Schouten wrote:
>>
>>> Log:
>>> Remove use of non-ISO-C integer types from system call tables.
>>>
>>> These files already use ISO-C-style integer types, so make them less
>>> inconsistent by preferring the standard types.
>>
>> These should actually be Linux types l_foo_t.  ISO-C-style integer types
>> [... but only in the low-level linux headers where this is possible]

>>> Modified: head/sys/amd64/linux32/syscalls.master
>>> ==============================================================================
>>> --- head/sys/amd64/linux32/syscalls.master	Fri May 25 21:12:24 2012
>>> (r236025)
>>> +++ head/sys/amd64/linux32/syscalls.master	Fri May 25 21:50:48 2012
>>> (r236026)
>>> @@ -54,8 +54,8 @@
>>> 				    l_int mode); }
>>> 9	AUE_LINK	STD	{ int linux_link(char *path, char *to); }
>>> 10	AUE_UNLINK	STD	{ int linux_unlink(char *path); }
>>> -11	AUE_EXECVE	STD	{ int linux_execve(char *path, u_int32_t
>>> *argp, \
>>> -				    u_int32_t *envp); }
>>> +11	AUE_EXECVE	STD	{ int linux_execve(char *path, uint32_t
>>> *argp, \
>>> +				    uint32_t *envp); }
>>
>> argp and envp aren't uintany_t * in either Linux or FreeBSD.  They start as
>> "char * const *".  There is no Linux type for an indirect "char *", and one
>> was hacked up here by pretending that "char *" is u_int32_t (it is actually
>> just 32 bits).  Using l_caddr_t seems to be the best hack available (since
>> by abusing l_caddr_t, we know that it is actually char *).
>>
>> The `const' in the type for argp and envp is further from being handled
>> correctly.  Most or all syscall.master's just type-pun it away.  Similarly
>> for "const char *path".
>>
>> All the non-indirect "char *"s for pathnames and other things seem to be
>> completely wrong on amd64 too.  These pointers start as 32 bits, and it
>> takes more than a bad type pun to turn then into kernel 64-bit pointers.
>> The magic for this seems to be:
>> - all args are converted to 64 bits (by zero-extension?) at a low level
>>...
> The 'low level' AKA magic happens in several *_fetch_syscall_args()
> functions. For both linux32 and freebsd32, the magic code automatically
> zero-extends the arguments into 64bit entities. Linux passes args in
> registers, while FreeBSD uses words on stack.

Actually, the amd64 linux_fetch32_fetch_syscall_args() just copies from
64-bit registers frame->tf_r* to 64-bit sa->args[*].  I can't see how
this gives anything except garbage in the top bits.  Is there magic in
the switch to 64-bit mode that sets the top bits?  Anyway, sign extension
would give garbage for unsigned args, and zero-extension would give
garbage for negative signed args.

The amd64 ia32_fetch_syscall_args() is quite different.  Now the args
stack as 32 bits on the stack, so normal C accesses naturally extend
them when assigning them to 64-bit memory sa->args[*].  The stack is
in user space, so normal C accesses are unavailable at first.  sa->code
is read using fuword32(), which gives zero-extension.  Then the stack
is copied in and normal C accesses become available.  Finally, all
args are copied from 32-bit args[i] to 64-bit sa->args[i].  args[i]
is u_int32_t, so this indeed gives zero-extension.  But args[i] is
signed (int64_t), since it is register_t which is bogusly signed.
So for negative args, overflow occurs when int32_t accessed as u_int32_t
via the type pun in the declaration of args[].  There is no further
overflow when the result is assigned to sa->args[i], but the result
is wrong (no longer negative).  It takes further magic to undo this.

BTW, struct syscall_args has bad names and formatting.  It is missing
indentation and sa_prefixes for all its members.  These bugs are missing
for the 64-bit register side of the copy (struct trapframe).

> The types in the syscalls.master prototype should be in fact selected
> to fit the in-kernel prototypes for the functions implementing the syscalls,

No, they have to match the types actually passed so as to ignore any
garbage bits created by previous magic steps.  The lower levels cannot
do a correct conversion since they don't have access to the type info
generated from syscalls.master, and they shouldn't do a correct conversion
since this is easier to do later -- only the functions implementing the
syscalls can do it easily.  The type info is mostly encoded in syscall
args structs for the individual functions, in a way that the conversions
only need to access fields in the structs to do the conversions (this
only works for simple conversions but handles ignoring any leading and
trailing bits up to reasonably boundaries, since this is needed for
endianness handling).

> esp. for NOPROTO cases, and not to the low-level layout of the syscall
> entry data.

Of course the layout must match the one actually used by the function
implementing the syscall.  For NOPROTO cases, this is given by
sys/sysproto.h, not the syscall's C source code.  This cannot use
l_foo of course.  But this is fragile, and there are only 41 NOPROTO
cases, mostly for the simplest syscalls.  It would be easy to wrap
all of these so as never to use NOPROTO.  It is also easy to check 41
cases.  I checked them all and only found a few harmless errors, where
only differences with kern/syscalls.master are counted as errors.  These
are mainly u_ints for ffs and perhaps a couple of ints for pid_t and
vice versa.  Anyway, it is correct to not use l_foo in the NOPROTO cases.

compat/freebsd32/syscalls.master has 253 NOPROTO cases.  Now too many
to wrap or check easily.  There are lots of pointer args which depend
on magic.  mknod() has a dev arg that is only int, but there has been
discussion recently of bloating dev_t to 64 bits.  One of the new
CAP syscalls is more surely broken than the other, since it is NOPROTO
but has a uint64_t arg.  fcntl()'s 3rd arg is bogusly declared as long
to match kern/sysalls.master.  According to fcntl(3), the 3rd arg is
actually int.  Zero-extension of the int followed by not convering the
64-bit sa->argsp[i] back to int32_t via the fcntl_args struct might
cause problems, but apparently this arg is rarely negative.  Lots of
NOPROTO entries are bug for bug compatible with kern/syscalls.master.
It's weird that setreuid() is declared as taking int ids in the native
and ia32 versions but as taking uid_t ids in the linux32 version.  linux32
had to be more careful about uids because it also has 16-bit ones.

>> - the args struct for a pathname is
>>     [left padding]; char *; [right padding];
>>   Since the char * is misdeclared, the explicit padding is null, but the
>>   layout of the args struct is correct because the wrong arg type in it
>>   supplies equivalent padding (extra 32 bits on the right).
>>
> The arg struct layout is irrelevant, since fetch_syscall_args() functions
> perform the needed translation from process ABI to kernel ABI.

No, these functions cannot and do not do more than a simple translation,
since they don't know the arg's actual size or signedness.  They do
basically the endianness part of the args struct (add right padding
for little-endian).  They messes up extra bits mainly for the top 32.
The arg should have been promoted to 32 bits in userland earlier, but
it is unclear of 32-bit ABIs require the extra bits to be non-garbage.
When an arg is say int16_t, gcc does sign-extension in callers, but
then doesn't trust this to have occured in callees, and does extra
work to demote to the original type and then often to promote back to
the 32-bit type.  The args struct arranges for the corresponding
demotion and repromotion when the callee is the kernel, provided the
arg types are actually declared correctly.  Now the demotion is almost
free, by the type pun of ignoring unwanted bits in the args struct.

> I think that the padding could be completely eliminated for translated
> ABI, but since it is easier to reuse makesyscalls.sh instead of creating
> ABI-specific script, and since there is quite non-trivial count of NOPROTO
> declarations that just match the native-ABI syscall handlers, it is better
> not to start that.

It would just be slower without reorganizing all syscall layers.  You
would need a large switch statement with all syscalls that need special
translation, or an extra layer of syscall functions.  There are already
too many layers of syscall functions.  FreeBSD started with 1 or 2
more layers than Linux.  Now it has 4 or 5 more layers for many important
syscalls :-(.  The second lowest layer was syscall(), and it much slower
than Linux, since it has to fetch the args from the stack where Linux
has most args in registers.  With Linux-style layering, you can go
straight from the lowest level to the individual syscall level (and
shouldn't have sys_foo(), and a couple of kern_foo() layers between that
and the actual work).  Args are passed in registers and for the i386 ABI
only to be pushed onto the stack to convert say lseek(2) into lseek(9).
But given that FreeBSD lseek(9) wants to use an args struct, you shouldn't
do extra work to pack the args struct.  Leaving it sparse doesn't save
cost much time unpacking it in read(9).

>> - the "char *" in the args struct is not actually a char *, and is unusable
>>   directly in the kernel.  However, it is only used in copyin() and
>>   copyout(), where it becomes a user address and works correctly.  (An
>>   older bug in this that the user address for copy*() is declared as
>>   "void *".  "void *" means a kernel pointer.  The type of a user
>>   address should be more like vm_offset_t, but even that needs logical
>>   translation for linux32).
>>
> It is char *, but in different address space. Linux-style type qualifiers
> like __usermode would be appropriate there (remember far/near ?), but
> we do not have static checkers that do understand the difference.

Of course it works for linux32, but this depends on the static checkers
being stupid and on the user address space not being larger than the
kernel address space (or at least on kernel pointers being large
enough).  Native mode depends on a flat virtual address space with
ordinary pointers working, except in vm where it doesn't depend on the
pointers part doesn't depend on the pointers part.  The type puns from
this are just more obvious in emulators.

>>> 14	AUE_MKNOD	STD	{ int linux_mknod(char *path, l_int mode, \
>>
>> Broken except in the K&R case, but amd64 is too new to pretend to support
>> K&R.  Bug for bug compatible with mknod() and some other syscalls in
>
> C std version is irrelevant there. Args fetch code promotes all arguments
> to (u)int(32).

No, the args fetch code starts with all args already 32 bits (32-bit
register_t for FreeBSD native i386).  Promotion from mode_t to int occurs
in the userland ABI.  I just checked that neither gcc -O3 nor clang -O3
depends on this promotion being a strict C one.  They both load an arg
that is declared as int32_t using "movswl" from the stack, but if the
ABI guaranteed strict C promotion, then optimal code would depend on
this and use "movl" from the stack.  -march=i386 makes no difference to
this (movswl is almost as fast as movl on modern x86, but on original
i386 it was much slower).

I try to avoid using args shorter than int because of this
pessimization.  It was supposed to be an optimization in C90 to allow
functions to pass short args without promotion, but ABI and FUD makes
it mostly a pessimization on i386.  First, the ABI requires passing
32 bits on the stack.  Before C90, and still if no prototype in scope,
the caller is required to promote an int16_t arg to int according to C
promotion rules.  Callees could assume this.  Now, callees can't assume
this, since they can't know if the caller did more than pass garbage
padding bits, since the callee might have a prototype in scope and the
ABI may allow passing garbage.  This pessimization can now cost a whole
nanosecond per arg per call :-).

Anyway, the kernel should fear the ABI just as much as C compilers,
since it easier to give correct declarations that do the demotions
automatically via the args struct than it is to give incorrect ones
that work (just copy arg types from userland prototypes, except
for variadic functions, and don't allow variadic functions like open()
and fcntl() for syscalls).

>>> Modified: head/sys/compat/freebsd32/syscalls.master
>>> ==============================================================================
>>> --- head/sys/compat/freebsd32/syscalls.master	Fri May 25 21:12:24 2012
>>> (r236025)
>>> +++ head/sys/compat/freebsd32/syscalls.master	Fri May 25 21:50:48 2012
>>> (r236026)
>>> @@ -104,9 +104,9 @@
>>> 				    int flags); }
>>> 28	AUE_SENDMSG	STD	{ int freebsd32_sendmsg(int s, struct
>>> msghdr32 *msg, \
>>> 				    int flags); }
>>> -29	AUE_RECVFROM	STD	{ int freebsd32_recvfrom(int s, u_int32_t
>>> buf, \
>>> -				    u_int32_t len, int flags, u_int32_t
>>> from, \
>>> -				    u_int32_t fromlenaddr); }
>>> +29	AUE_RECVFROM	STD	{ int freebsd32_recvfrom(int s, uint32_t
>>> buf, \
>>> +				    uint32_t len, int flags, uint32_t from, \
>>> +				    uint32_t fromlenaddr); }
>>
>> Oops, I didn't looke at this file when I said that "ISO-C-style integer
>> types seem to have only been used for a couple of uintptr_t's".  This
>> file is independent of Linux, so it can't use l_foo.  It hard-codes 32
>> instead, starting with the directory name.  Still, all of the types in
>> the above are fairly bogus and hard to understand:
>> - "int" for `s' and `flags' depends on ints being 32 bits
>>
> I agree that this is the biggest type-pun in the whole compat32/linux32
> components. It would be nice to use int32_t/uint32_t instead of int there,
> since we in fact operate on ABI-defined type.

Something like l_int (e_int? for emulated int?) would be better because
it is shorter and more specific.  Similarly for l_size_t.  You can almost
translate back and forth automatically by adding and removing the prefix,
while there is no way to decode from a generic int32_t back to a foo_t.

> But I do not think it is much worry to expect an arch to appear which have
> sizeof(int) != 4. It is already engraved in stone.

Like long was in 1982.  And the result of fixing that was the long long,
LFS, Linux loff_t and lots of foo64() syscalls in Linux just to support
64-bit off_t.  FreeBSD missed most of this since BSD had a clean "break"
and converted early to 64-bit off_t and made long 64 bits on alpha.

>>> 481	AUE_KILL	NOPROTO	{ int thr_kill2(pid_t pid, long id, int
>>> sig); }
>>
>> Untranslated pid_t depends on pid_t being no larger than int.
>>
>> Untranslated long completely breaks this.

Actually no breakage, especially with NOPROTO, and without NOPROTO there
is only a minor problem with zero-extension.  It takes a 64-bit type to
break things.

>>> -514	AUE_CAP_NEW	NOPROTO	{ int cap_new(int fd, u_int64_t rights); }
>>> +514	AUE_CAP_NEW	NOPROTO	{ int cap_new(int fd, uint64_t rights); }
>>
>> Broken, assuming that all the other spittings into 2 uint32_t's are needed.
>> ..
> Yes, this is broken.
> I added Robert and Jonathan to Cc:.

We only found this 64-bit type for freebsd32.

>> These bugs are easiest to fix in the Linux syscall.master's since the
>> bugs are mostly already avoided using l_foo_t.  In amd64/linux32/
>> syscalls.master, I only noticed the following ones:
>> ...
>> % 193	AUE_TRUNCATE	STD	{ int linux_truncate64(char *path, \
>> % 				    l_loff_t length); }
>> % 194	AUE_FTRUNCATE	STD	{ int linux_ftruncate64(l_uint fd, \
>> % 				    l_loff_t length); }
>>
>> How can this work?  I was going to say that the splitting up of the 64-bit
>> args in compat/freebsd32/syscalls.master was unnecessary because it is
>> apparently not needed here.  Then I misread this as doing the splitting.
>> But it doesn't.  Similarly later in this file.

I'm now surer that loff_t cannot work.  It gets passed as 2 32-bit registers.
These get interpreted as 64-bit registers and copied to 2 64-bit args.
But the above only describes 1 64-bit args.

However, x86 is little-endian, so many values will work by ignoring the
second register.  All values below 2G will work provided the top bits are
either sign-extended or zero-extended.  All values between 2G and 4G
will work provided the top bits are zero-extended.

Negative loff_t's are common for linux_llseek() but not for linux_ltruncate().
And linux_llseek() doesn't have this bug since it is inmplemented almost
correctly bt splitting the arg into "l_ulong ohigh, l_ulong olow" (is this
big-endian order really correct?!).  linux_llseek() claims to retun int
instead of the 64-bit offset, and that is correct since the offset is
returned indirectly.  But linux_lseek() also claims to return int instead
of l_off_t.  That is bogus but works.

Here is the complete list of syscalls with buggy loff_t's from linux32:

% 180	AUE_PREAD	STD	{ int linux_pread(l_uint fd, char *buf, \
% 				    l_size_t nbyte, l_loff_t offset); }
% 181	AUE_PWRITE	STD	{ int linux_pwrite(l_uint fd, char *buf, \
% 				    l_size_t nbyte, l_loff_t offset); }
% 193	AUE_TRUNCATE	STD	{ int linux_truncate64(char *path, \
% 				    l_loff_t length); }
% 194	AUE_FTRUNCATE	STD	{ int linux_ftruncate64(l_uint fd, \
% 				    l_loff_t length); }

WHen there is only 1 loff_t arg and it is the last one, it won't mess up
the other args.  But what about loff_t apparently being in big-endian
order for linux_llseek().  I guess the order is actually little-endian,
but the order is reversed by allocation of the 2 halves in registers.

% 250	AUE_NULL	STD	{ int linux_fadvise64(int fd, l_loff_t offset, \
% 					l_size_t len, int advice); }
% 272	AUE_NULL	STD	{ int linux_fadvise64_64(int fd, \
% 					l_loff_t offset, l_loff_t len, \
% 					int advice); }

Now the loff_t's are in the middle of the args, and/or there are 2 of them,
so later args are messed up.

I didn't quote the whole thing, but its still a 17K reply, sorry.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120527043827.W3357>