Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 14 Apr 2008 10:19:18 +0200
From:      Pawel Jakub Dawidek <pjd@FreeBSD.org>
To:        Kostik Belousov <kostikbel@gmail.com>
Cc:        Roman Divacky <rdivacky@freebsd.org>, rwatson@FreeBSD.garage.freebsd.pl, freebsd-arch@freebsd.org
Subject:   Re: final decision about *at syscalls
Message-ID:  <20080414081918.GA10478@garage.freebsd.pl>
In-Reply-To: <20080412131604.GX21209@deviant.kiev.zoral.com.ua>
References:  <20071218092222.GA9695@freebsd.org> <200712201138.56423.jhb@freebsd.org> <20080412112019.GI45299@garage.freebsd.pl> <20080412131604.GX21209@deviant.kiev.zoral.com.ua>

next in thread | previous in thread | raw e-mail | index | archive | help

--82I3+IH0IqGh5yIs
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Apr 12, 2008 at 04:16:04PM +0300, Kostik Belousov wrote:
> On Sat, Apr 12, 2008 at 01:20:19PM +0200, Pawel Jakub Dawidek wrote:
> > On Thu, Dec 20, 2007 at 11:38:55AM -0500, John Baldwin wrote:
> > > On Tuesday 18 December 2007 04:22:22 am Roman Divacky wrote:
> > > > Dear arch@
> > > >=20
> > > > Over this summer I was working (among other things) on *at family o=
f syscalls
> > > > kindly sponsored by Google (in their Summer of Code). The resulting=
 patch is=20
> > > > almost finished but I need to decide one design question. If you ar=
e not interested=20
> > > > in *at/namei feel free to skip this mail.
> > > >=20
> > > > The *at syscalls are a threads-oriented extension to basic file sys=
calls (think
> > > > of open(), fstat(), etc.) adding the possibility to specify from wh=
ere the search
> > > > for relative path should start.
> > > >=20
> > > > image that we have /tmp/foo/bar
> > > >=20
> > > > and CWD is set to "/tmp/", and the process has opened "foo" as dirf=
d. with ordinary
> > > > open() syscall you have to either
> > > >=20
> > > > 	chdir("/tmp/foo");open("./bar");
> > > >=20
> > > > or
> > > >=20
> > > > 	open("/tmp/foo/bar");
> > > >=20
> > > > The first approach is problematic because it changes CWD for all th=
reads in the process,
> > > > the second is prone to race-conditions as some of the components of=
 the path can
> > > > change in parallel with the "open".
> > > >=20
> > > > So POSIX introduced a new API, called "Extended API set part 2, ISB=
N: 1-931624-67-4" (at
> > > > least this was the latest when I looked last time), which solves th=
at by introducing "*at"
> > > > syscalls that supply an fd of previously opened directory which is =
used instead of CWD
> > > > for searching relative path, ie. the previous example becomes
> > > >=20
> > > >    dirfd =3D open("/tmp/foo"); openat("foo", dirfd);
> > > >=20
> > > > I implemented the whole API as native FreeBSD syscalls + in linuxul=
ator emulation layer.
> > > > Here's the problem:
> > > >=20
> > > > There are two approaches to the name translation from "filedescript=
or" to the "vnode".
> > > >=20
> > > > 1) we can do it in the kern_fooat() syscall and pass namei() the re=
sulting vnode
> > > > 2) we can pass namei() the filedescriptor and do the translation th=
ere
> > > >=20
> > > > PROs of #1:
> > > >=20
> > > > 	o	namei() does not need to know about the curthread, you can use t=
his *at
> > > > 		ability for different purposes, it's cleaner (imho)
> > > >=20
> > > > PROs of #2
> > > >=20
> > > > 	o	raceless implementation
> > > > 	o	no code duplication
> > > >=20
> > > > CONs of #1
> > > >=20
> > > > 	o	some very small code duplication (the translation is done in eve=
ry=20
> > > > 		kern_fooat() function)
> > > > 	o	there is a race between the name translation and the actual use =
of the result
> > > > 		of the translation that needs to be handled, the "path_to_file" s=
tring is copied
> > > > 		to the kernel space twice hence a race
> > > >=20
> > > > CONs of #2
> > > >=20
> > > > 	o	namei is made thread dependant	=09
> > > >=20
> > > > Please tell me what approach you like more. I personally favour #1 =
because I don't like namei()
> > > > being thread dependant, Kostik Belousov prefers #2.
> > >=20
> > > Considering Robert's paper on security race problems in things like s=
ystrace
> > > stemming from when you copy parameters out of userland and into the k=
ernel
> > > multiple times, I think #2 is definitely the better choice.  Also, na=
mei() is
> > > already thread aware AFAICT since 'struct componentname' already cont=
ains a
> > > 'cnp_thread' member (was 'cnp_proc' in 4.x).
> >=20
> > It looks like I'm a bit too late, but anyway...
> >=20
> > From what you write John, #1 is a better choice than #2. If you want to
> > avoid races, you can pass already locked vnode. In case of file
> > descriptors, if p_fd is not locked another thread can close and open
> > different directory under the same descriptor number.
> This is the application imposed race, not the externally imposed one.
> Moreover, I would argue that this is application error.

Right, this will be an application bug and the race can occur even
before entering the kernel. What I'm saying is that this is more racy
than vnode approach (at least is not less racy).

> > I also need such functionality for recent ZFS and #2 makes it impossible
> > to use it. NDINIT_AT() is kernel (VFS) API so it should operate on
> > vnodes, not file descriptor numbers, IMHO.
> Following the same arguments, NDINIT() shall not operate on the pathes
> too.

We both know that we have to convert path to vnode somehow, but you
can't disagree that VFS API doesn't operate on file descriptors in
general, but on vnodes.

> > For completness can you Kostik and Robert provide your arguments against
> > #1?
>=20
> The #2 was already committed.
> The #1 caused a code duplication that was quite error-prone.
>=20
> What are your problems with the #2 ?

Take a look at perforce change @139873. You can find there how I was
using #1.

--=20
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd@FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!

--82I3+IH0IqGh5yIs
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4 (FreeBSD)

iD8DBQFIAxOFForvXbEpPzQRAg2zAKCYG0/6D5q5C5SFuJiYfw5w+fUkfwCglTaY
0aYlD8psV6shTkfW1BabXCE=
=jVbV
-----END PGP SIGNATURE-----

--82I3+IH0IqGh5yIs--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080414081918.GA10478>