Date: Tue, 29 May 2012 05:48:03 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: "Robert N. M. Watson" <rwatson@freebsd.org> Cc: src-committers@freebsd.org, svn-src-all@freebsd.org, Ed Schouten <ed@freebsd.org>, svn-src-head@freebsd.org, Bruce Evans <brde@optusnet.com.au>, Konstantin Belousov <kostikbel@gmail.com>, jonathan@freebsd.org Subject: Re: svn commit: r236026 - in head/sys: amd64/linux32 compat/freebsd32 kern Message-ID: <20120529033708.U2726@besplex.bde.org> In-Reply-To: <71304742-3635-49C6-BE36-60E4F4A6FC20@freebsd.org> References: <201205252150.q4PLomFk035064@svn.freebsd.org> <20120526173233.A885@besplex.bde.org> <20120526164927.GU2358@deviant.kiev.zoral.com.ua> <20120527043827.W3357@besplex.bde.org> <20120528133633.GB2358@deviant.kiev.zoral.com.ua> <71304742-3635-49C6-BE36-60E4F4A6FC20@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 28 May 2012, Robert N. M. Watson wrote: > On 28 May 2012, at 09:36, Konstantin Belousov wrote: > >> On Sun, May 27, 2012 at 07:49:36AM +1000, Bruce Evans wrote: >>> On Sat, 26 May 2012, Konstantin Belousov wrote: >>> >>>> On Sat, May 26, 2012 at 10:21:25PM +1000, Bruce Evans wrote: >>>> The 'low level' AKA magic happens in several *_fetch_syscall_args() >>>> functions. For both linux32 and freebsd32, the magic code automatically >>>> zero-extends the arguments into 64bit entities. Linux passes args in >>>> registers, while FreeBSD uses words on stack. >>> >>> Actually, the amd64 linux_fetch32_fetch_syscall_args() just copies from >>> 64-bit registers frame->tf_r* to 64-bit sa->args[*]. I can't see how >>> this gives anything except garbage in the top bits. Is there magic in >>> the switch to 64-bit mode that sets the top bits? Anyway, sign extension >>> would give garbage for unsigned args, and zero-extension would give >>> garbage for negative signed args. >> Hardware zero-extends any register touched in the 32bit mode. >> >> In fact, please see r217991 for related bug. > > This may well be true on Intel, but is not true of MIPS -- which we probably don't care about currently for the purposes of Linux emulation, but maybe someday we will. On MIPS, 32-bit values are sign-extended rather than zero-extended. Please use the unix newline character in mail. The translation is MD so "this" (not the newline) isn't a problem. > I see a somewhat complex thread here, but am not sure I quite understand the import for Capsicum. Is the 64-bit rights mask as part of system call arguments not working properly in compat32 scenarios? Or are there issues outside of the compat environment? Right now compat32 is not well-supported with Capsicum, but fixing that is quite important to productionising Capsicum. 64-bit args need special translation, and the rights mask doesn't have any. At best, the low 32 bits of it might work accidentally. 64-bit args in ia32 start as 2 32-bit ones. These get turned into 2 64-bit ones (zero extended). These must be combined (in non-automatically generated code) into a 32-bit one. The automatically-generated args struct promotes the 32-bit args to 64 bits so that it can be passed directly to native syscalls. This doesn't work if there are any specially specially-translated args. Then the whole args struct must be translated (by simple copying for 32-bit args) lseek() shows how this should be done: % #ifdef COMPAT_43 % int % ofreebsd32_lseek(struct thread *td, struct ofreebsd32_lseek_args *uap) % { % struct lseek_args nuap; % % nuap.fd = uap->fd; % nuap.offset = uap->offset; % nuap.whence = uap->whence; % return (sys_lseek(td, &nuap)); % } % #endif This one seems to be to work around olseek() being broken in kern/syscalls.master. Its offset arg is only 32 bits, so it should be possible to call olseek() directly, but the bug makes the args structs incompatible, so the above does a simple translation by copying args and calls the full native lseek. olseek() uses the same copying code, but never worked since it starts with a wrong args struct. Apparently no one uses olseek(), but this compat lseek() is used a bit so the bug was noticed. The args stucts are: % struct ofreebsd32_lseek_args { % char fd_l_[PADL_(int)]; int fd; char fd_r_[PADR_(int)]; % char offset_l_[PADL_(int)]; int offset; char offset_r_[PADR_(int)]; % char whence_l_[PADL_(int)]; int whence; char whence_r_[PADR_(int)]; % }; The padding is to 64 bits. All the types are correct, although the type of `offset' is confusing. Its type is that of the pre-4.4BSD off_t, which was long, which was 32 bits on ia32 back then. This is now spelled as int, so that we can't easily see either of the 4 layers of translation in it (off_t = long -> int32_t back then (type pun) -> int32_t now (type pun) = int now. % struct olseek_args { % char fd_l_[PADL_(int)]; int fd; char fd_r_[PADR_(int)]; % char offset_l_[PADL_(long)]; long offset; char offset_r_[PADR_(long)]; % char whence_l_[PADL_(int)]; int whence; char whence_r_[PADR_(int)]; % }; The type of `offset' is broken. It needs to be 32 bits since that is what it was on ia32 before 4.4BSD, as above. amd64 shouldn't support pre-4.4BSD, and making an olseek() call is close to impossible on it. However, on amd64 with ia32 support, you have to enable COMPAT_43 for the ofreebsd32_lseek() to be compiled, and this gives olseek() too, so you might as well use it after fixing it. If a native olseek() call is somehow made on amd64, then it would start with a 32-bit offset and this might be extended correctly in userland, and then the wrong `long offset' in the above would work. But there is no need to risk this. By declaring the offset as signed 32 bits, it will be extended correctly when it is loaded from the args struct. Note that the above translation `nuap.offset = uap->offset;' does a critical non-null currect extension and this requires the olseek() being declared correctly in the ia32 case, and on standard magic for the args struct. `offset' starts as 32 bits, but has been zero-extended the args struct. But its type in *uap is 32 bits, so the zero bits are ignored, and its type in *uap is signed so it is sign-extended before storing it into nuap. The corresponding assignment in olseek() just copies it. % % int % freebsd32_lseek(struct thread *td, struct freebsd32_lseek_args *uap) % { % int error; % struct lseek_args ap; % off_t pos; % % ap.fd = uap->fd; % ap.offset = PAIR32TO64(off_t,uap->offset); This does the repacking of the offset. This looks nice at first, but it is missing a space after the comma and the macro is disgusting internally. uap->offset looks like a single term, but it is expanded into an expression involving uap->offset1 and uap->offset2 using delicate or invalid token pasting. The macro is only defined in this file. All other places that I looked at do the repacking using explicit expressions. The macro is more needed here since the endianness varies. % ap.whence = uap->whence; % error = sys_lseek(td, &ap); % /* Expand the quad return into two parts for eax and edx */ % pos = *(off_t *)(td->td_retval); % td->td_retval[RETVAL_LO] = pos & 0xffffffff; /* %eax */ % td->td_retval[RETVAL_HI] = pos >> 32; /* %edx */ % return error; % } This function has lots more style bugs: - unsorted declarations - bogus comments about i386 registers. The RETVAL* macros remove the need for register magic, and 3 of 4 arches that this code can be configured for (if not work), namely ia64, mips and powerpc, don't have any i386 registers. - *(off_t *)(td->td_retval) is hackish. We use this hack a lot, er, a little for 64-bit return values. It is to possibly combine td_retval[0] with td_retval[1] to create an off_t, without ifdef or if tangles for endianness, etc. It is probably unnecessary here, since this code is only useful on 64 bit arches and then everything is in td_retval[0] (until we repack for returning). This hack assumes either 64-bit td_retval[0] or little-endian to work anyway. - `quad' in the comment is bogus too - missing spaces around return value. Perhaps required to match non-KNF stype elsewhere in compat code. Fixing some style bugs gives: @ int @ freebsd32_lseek(struct thread *td, struct freebsd32_lseek_args *uap) @ { @ struct lseek_args ap; @ off_t pos; @ int error; @ @ ap.fd = uap->fd; @ ap.offset = PAIR32TO64(off_t, uap->offset); @ ap.whence = uap->whence; @ error = sys_lseek(td, &ap); @ /* Expand the off_t return into two parts for 2 32-bit registers. */ @ /* Assume that it initially fits in 1 register_t (XXX?) */ @ pos = td->td_retval[0]; @ td->td_retval[RETVAL_LO] = pos & 0xffffffff; @ td->td_retval[RETVAL_HI] = pos >> 32; @ return error; @ } Another bug is now clearer: we assume that either right shifting does the right thing with the sign bit, or that extra top bits for RETVAL_HI don't matter. I think it actually does a wrong thing on amd64, and this is responsible for some of the non-zero-extensions that I observed: Suppose pos is -1; this is all bits 1 (0xffffffffffffffff; it becomes 0xffffffff in RETVAL_LO, but remains 0xffffffffffffffff in RETVAL_HI; the kernel then returns extra bits to 32-bit mode. They have no effect in 32-bit mode, but if the register is not touched in userland then the come back to the kernel on the next syscall and mess up the indended zero-extension. Same sign bug as sigreturn() used to have. But as I said before, even looking at the top bits is a bug -- just declare all the syscalls correctly so that the top bits are ignored and correct sign extension is done. I looked for other td_retval hacks and could only find these 2 in kern: % int % ogethostid(td, uap) % struct thread *td; % struct ogethostid_args *uap; % { % size_t len = sizeof(long); % int name[2]; % % name[0] = CTL_KERN; % name[1] = KERN_HOSTID; % return (kernel_sysctl(td, name, 2, (long *)td->td_retval, &len, % NULL, 0, NULL, 0)); % } I'm not sure what kernel_sysctl() does with this, but the cast has at least 2 layers of bogusness. td_retval starts as 2 register_t's and decays to register_t *. kernel_sysctl() doesn't even take a long * arg. It takes a void *. Not casting would convert the register_t * directly to a void * and undefined things would only happen later when kernel_sysctl() doesn't convert it back to a register_t *. Now they can happen for conversion to long *. In practice, long is similar to register_t on all supported arches, instead of correct and twice as long, so none of this makes any difference. % int % sys_lseek(td, uap) % ... % fp->f_offset = offset; % VFS_KNOTE_UNLOCKED(vp, 0); % *(off_t *)(td->td_retval) = fp->f_offset; Seems to work as intended. I talked kib out of using similar large ifdefs or similar hacks for ssize_t. If ssize_t is 64 bits, then register_t should be 64 bits too, so td_retval[1] is not used and a hack like the above wouldn't touch it. If the size of the foo_t being stored using the *(foo_t *) hack is not precisely 1 or 2 register_t's, then the store will overrun or underrun. But that case needs complicated conditionals to handle, and its easier not to fix it before it happens. The above hack hasn't needed fixing in almost 20 years. Hopefully any overrun or underrun would be more obvious than bugs caused by above shift and sign extension bugs. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120529033708.U2726>