Date: Fri, 18 Jul 2014 18:59:59 +0300 From: Konstantin Belousov <kostikbel@gmail.com> To: Mateusz Guzik <mjguzik@gmail.com> Cc: freebsd-arch@freebsd.org Subject: Re: current fd allocation idiom Message-ID: <20140718155959.GN93733@kib.kiev.ua> In-Reply-To: <20140718144012.GA7179@dft-labs.eu> References: <20140717235538.GA15714@dft-labs.eu> <20140718130629.GJ93733@kib.kiev.ua> <20140718144012.GA7179@dft-labs.eu>
next in thread | previous in thread | raw e-mail | index | archive | help
--whB3aSVTmcGVRwAq Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Jul 18, 2014 at 04:40:12PM +0200, Mateusz Guzik wrote: > On Fri, Jul 18, 2014 at 04:06:29PM +0300, Konstantin Belousov wrote: > > On Fri, Jul 18, 2014 at 01:55:38AM +0200, Mateusz Guzik wrote: > > > The kernel has to deal with concurrent attempts to open, close, dup2 = and > > > use the same file descriptor. > > >=20 > > > I start with stating a rather esoteric bug affecting this area, then I > > > follow with a short overview of what is happening in general and a > > > proposal how to change to get rid of the bug and get some additional > > > enchancements. Interestingly enough turns out Linux is doing pretty m= uch > > > the same thing. > > >=20 > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D > > > THE BUG: > > > /* > > > * Initialize the file pointer with the specified properties. > > > * > > > * The ops are set with release semantics to be certain that the flag= s, type, > > > * and data are visible when ops is. This is to prevent ops methods = =66rom being > > > * called with bad data. > > > */ > > > void > > > finit(struct file *fp, u_int flag, short type, void *data, struct fil= eops *ops) > > > { > > > fp->f_data =3D data; > > > fp->f_flag =3D flag; > > > fp->f_type =3D type; > > > atomic_store_rel_ptr((volatile uintptr_t *)&fp->f_ops, (uintp= tr_t)ops); > > > } > > >=20 > > > This itself is fine, but it assumes all code obtaining fp from fdtabl= e places > > > a read memory barrier after reading f_ops and before reading anything= else. > > > As you could guess no code does that and I don't believe placing rmb's > > > in several new places is the way to go. > > I think your analysis is correct for all cases except kern_openat(). > >=20 > > For kern_openat(), we install a file into fdtable only after the struct > > file is fully initialized, see kern_openat(). The file is allocated > > with falloc_noinstall(), then file is initialized, then finstall() does > > FILEDESC_LOCK/UNLOCK, which ensures the full barrier. Other file > > accessors do fget_unlocked(), which does acquire (of fp->f_count, but > > this does not matter). > >=20 > > The only critical constraint is that other accessors must not see > > f_ops !=3D badfileops while struct file is not fully visible. IMO, > > the full barrier in finstall() and acquire in fget*() guarantee > > that we see badfileops until writes to other members are visible. > >=20 >=20 > I agree, but I doubt everything can be converted to this model (see below= ). >=20 > > For falloc(), indeed the write to f_ops could become visible too early > > (but not on x86). > >=20 > > >=20 > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D > > > GENERAL OVERVIEW OF CURRENT STATE: > > >=20 > > > What readers need to do: > > > - rmb() after reading fp_ops > > > - check fp_ops for badfileops > > How can readers see badfileops ? > >=20 >=20 > Not sure what you mean. fp is installed with badfileops, anything > accessing fdtable before finit finishes will see this. I referenced falloc_noinstall(). >=20 > > >=20 > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D > > > PROPOSAL: > > >=20 > > > struct file *fp; > > > int fd; > > >=20 > > > if (error =3D falloc(&fp, &fd)) > > Use falloc_noinstall() there. > >=20 > > > return (error); > > > if (error =3D something(....)) { > > > fdclose(fp, fd); > > > return (error); > > > } > > >=20 > > > finit(fp, ....); > > > factivate(fd, fp); > > This function is spelled finstall(). > >=20 > > It seems that all what is needed is conversion of places using > > falloc() to falloc_noinstall()/finstall(). > >=20 >=20 > This postpones fd allocation to after interested function did all work > it wanted to do, which means we would need reliable ways of reverting > all the work in case allocation failed. I'm not so confident we can do > that for all current consumers and both current and my proposed approach > don't impose such requirement. Cleanup should be identical to the actions done on close(2). >=20 > Of course postponing fd allocation where possible is definitely worth > doing. Yes, and after that the rest of the cases should be evaluated. But my gut feeling is that everything would be converted. >=20 > For cases where it is not possible/feasible, the only problem we have is > making sure we update fd entry in proper table in factivate. >=20 > The easiest solution is to FILEDESC_XLOCK, but this may have measurable > impact. We can get away with FILEDESC_SLOCK just fine, but this still > writes and may ping-pong with other cpus. >=20 > This is another place where we could just plop sequence counters. >=20 > fdgrowtable would +/-: > seq_write_begin(&fdp->fd_tbl_seq); > memcpy(....); > assign the new pointer > seq_end_begin(&fdp->fd_tbl_seq); >=20 > Then factivate would be +/-: >=20 > do { > fdtable =3D __READ_ONCE(..., fdp->fd_tbl); > rmb(); > seq =3D seq_read(fdp->fd_tbl_seq); > fdtable[fd] =3D fp; > while (!seq_consistent(fdp->fd_tbl_seq, seq)); >=20 > This in worst case updates old fdtable (which we never free so it is > harmless) and retries. seq_read never returns seq in 'modify' state, > and we check that seq is the same before and after the operation. If the > condition is met there was no concurrent fdtable copy. >=20 > This also means that table readers can 'suddenly' get fps, but this > should be fine. NULL fp is used to denote unused fd. If fp is non NULL, > it is safe for use. If it is NULL, fd is ignored which should not matter. >=20 > Functions like fdcopy just have to make sure they read fp once. >=20 > --=20 > Mateusz Guzik <mjguzik gmail.com> --whB3aSVTmcGVRwAq Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBAgAGBQJTyUR+AAoJEJDCuSvBvK1BQcYP/iGyymAyxnX4nZ55SIS0sz1n KS9bJzNmOsnQrqQNcTP1AArkmCPqa2MGTWEXxzvS54dhXCBRxA14WmSQykfUV7jg 2B2AsnnQ5lNOxwOjjHwvvkhrWtqoxvuVhg7a6ADFIu4AS5TCtVxi1bohO5X+T3ED qoPxtDai6rNNiRJKu3aNfKYT0idNVt1HROHRu3pRelBkfFqwdxO8D4GsI8e2rQGY 74ZIi3kkVq+N3GA1KbSnMzrEhzE5jrUNz5v9YCszoYmbsyKY1myToTQoqX/E70l2 KnhIotBbdQFEYBb2EffCHfrzn/Tl8y6uE//7Aqag6DVj3iJ5Khv5FLe1jiJXx573 sxfOj5N+eaiBY9Y8T7pNkYkrHAz2910UR+yW04atAV7+/yKHH1uyVewZej+JAFwN IFYKcfwhBOlbCBc+mfV1WZN0axszJX3eX+i7Fi5Mo2GdaLeEFyqZSWbBlNHTPpw2 OwrzDFZAcm9EsLQGf6RP4Mawq3TeuKNvdqjuNcegeZHNHGyFU3XCrzBLKVz2OOWY Dh+sO5b79K7MZctNvZNhjOMawy5BALaFapsjnJc5QvR9Gknxml256Sc18eSfDRgt qQ/fQA/SoQRndzhRjRsHEGOCIzwqeNIgaMKv86NiAND/zB3Xi5jFKzpeSoig3TvU 0HhnD6SBejgndg23BgKI =UyEP -----END PGP SIGNATURE----- --whB3aSVTmcGVRwAq--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140718155959.GN93733>