Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 18 Jul 2014 16:40:12 +0200
From:      Mateusz Guzik <mjguzik@gmail.com>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        freebsd-arch@freebsd.org
Subject:   Re: current fd allocation idiom
Message-ID:  <20140718144012.GA7179@dft-labs.eu>
In-Reply-To: <20140718130629.GJ93733@kib.kiev.ua>
References:  <20140717235538.GA15714@dft-labs.eu> <20140718130629.GJ93733@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Jul 18, 2014 at 04:06:29PM +0300, Konstantin Belousov wrote:
> On Fri, Jul 18, 2014 at 01:55:38AM +0200, Mateusz Guzik wrote:
> > The kernel has to deal with concurrent attempts to open, close, dup2 and
> > use the same file descriptor.
> > 
> > I start with stating a rather esoteric bug affecting this area, then I
> > follow with a short overview of what is happening in general and a
> > proposal how to change to get rid of the bug and get some additional
> > enchancements. Interestingly enough turns out Linux is doing pretty much
> > the same thing.
> > 
> > ============================
> > THE BUG:
> > /*
> >  * Initialize the file pointer with the specified properties.
> >  *
> >  * The ops are set with release semantics to be certain that the flags, type,
> >  * and data are visible when ops is.  This is to prevent ops methods from being
> >  * called with bad data.
> >  */
> > void
> > finit(struct file *fp, u_int flag, short type, void *data, struct fileops *ops)
> > {
> >         fp->f_data = data;
> >         fp->f_flag = flag;
> >         fp->f_type = type;
> >         atomic_store_rel_ptr((volatile uintptr_t *)&fp->f_ops, (uintptr_t)ops);
> > }
> > 
> > This itself is fine, but it assumes all code obtaining fp from fdtable places
> > a read memory barrier after reading f_ops and before reading anything else.
> > As you could guess no code does that and I don't believe placing rmb's
> > in several new places is the way to go.
> I think your analysis is correct for all cases except kern_openat().
> 
> For kern_openat(), we install a file into fdtable only after the struct
> file is fully initialized, see kern_openat(). The file is allocated
> with falloc_noinstall(), then file is initialized, then finstall() does
> FILEDESC_LOCK/UNLOCK, which ensures the full barrier. Other file
> accessors do fget_unlocked(), which does acquire (of fp->f_count, but
> this does not matter).
> 
> The only critical constraint is that other accessors must not see
> f_ops != badfileops while struct file is not fully visible.  IMO,
> the full barrier in finstall() and acquire in fget*() guarantee
> that we see badfileops until writes to other members are visible.
> 

I agree, but I doubt everything can be converted to this model (see below).

> For falloc(), indeed the write to f_ops could become visible too early
> (but not on x86).
> 
> > 
> > ============================
> > GENERAL OVERVIEW OF CURRENT STATE:
> > 
> > What readers need to do:
> > - rmb() after reading fp_ops
> > - check fp_ops for badfileops
> How can readers see badfileops ?
> 

Not sure what you mean. fp is installed with badfileops, anything
accessing fdtable before finit finishes will see this.

> > 
> > ============================
> > PROPOSAL:
> > 
> > struct file *fp;
> > int fd;
> > 
> > if (error = falloc(&fp, &fd))
> Use falloc_noinstall() there.
> 
> > 	return (error);
> > if (error = something(....)) {
> > 	fdclose(fp, fd);
> > 	return (error);
> > }
> > 
> > finit(fp, ....);
> > factivate(fd, fp);
> This function is spelled finstall().
> 
> It seems that all what is needed is conversion of places using
> falloc() to falloc_noinstall()/finstall().
> 

This postpones fd allocation to after interested function did all work
it wanted to do, which means we would need reliable ways of reverting
all the work in case allocation failed. I'm not so confident we can do
that for all current consumers and both current and my proposed approach
don't impose such requirement.

Of course postponing fd allocation where possible is definitely worth
doing.

For cases where it is not possible/feasible, the only problem we have is
making sure we update fd entry in proper table in factivate.

The easiest solution is to FILEDESC_XLOCK, but this may have measurable
impact. We can get away with FILEDESC_SLOCK just fine, but this still
writes and may ping-pong with other cpus.

This is another place where we could just plop sequence counters.

fdgrowtable would +/-:
seq_write_begin(&fdp->fd_tbl_seq);
memcpy(....);
assign the new pointer
seq_end_begin(&fdp->fd_tbl_seq);

Then factivate would be +/-:

do {
	fdtable = __READ_ONCE(..., fdp->fd_tbl);
	rmb();
	seq = seq_read(fdp->fd_tbl_seq);
	fdtable[fd] = fp;
while (!seq_consistent(fdp->fd_tbl_seq, seq));

This in worst case updates old fdtable (which we never free so it is
harmless) and retries. seq_read never returns seq in 'modify' state,
and we check that seq is the same before and after the operation. If the
condition is met there was no concurrent fdtable copy.

This also means that table readers can 'suddenly' get fps, but this
should be fine. NULL fp is used to denote unused fd. If fp is non NULL,
it is safe for use. If it is NULL, fd is ignored which should not matter.

Functions like fdcopy just have to make sure they read fp once.

-- 
Mateusz Guzik <mjguzik gmail.com>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140718144012.GA7179>