From owner-freebsd-arch@FreeBSD.ORG Fri Jul 18 16:00:05 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2BD21E2F for ; Fri, 18 Jul 2014 16:00:05 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id A64BD22F5 for ; Fri, 18 Jul 2014 16:00:04 +0000 (UTC) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id s6IFxxW6090738 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 18 Jul 2014 18:59:59 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua s6IFxxW6090738 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id s6IFxxi3090737; Fri, 18 Jul 2014 18:59:59 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 18 Jul 2014 18:59:59 +0300 From: Konstantin Belousov To: Mateusz Guzik Subject: Re: current fd allocation idiom Message-ID: <20140718155959.GN93733@kib.kiev.ua> References: <20140717235538.GA15714@dft-labs.eu> <20140718130629.GJ93733@kib.kiev.ua> <20140718144012.GA7179@dft-labs.eu> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="whB3aSVTmcGVRwAq" Content-Disposition: inline In-Reply-To: <20140718144012.GA7179@dft-labs.eu> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home Cc: freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 18 Jul 2014 16:00:05 -0000 --whB3aSVTmcGVRwAq Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Jul 18, 2014 at 04:40:12PM +0200, Mateusz Guzik wrote: > On Fri, Jul 18, 2014 at 04:06:29PM +0300, Konstantin Belousov wrote: > > On Fri, Jul 18, 2014 at 01:55:38AM +0200, Mateusz Guzik wrote: > > > The kernel has to deal with concurrent attempts to open, close, dup2 = and > > > use the same file descriptor. > > >=20 > > > I start with stating a rather esoteric bug affecting this area, then I > > > follow with a short overview of what is happening in general and a > > > proposal how to change to get rid of the bug and get some additional > > > enchancements. Interestingly enough turns out Linux is doing pretty m= uch > > > the same thing. > > >=20 > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D > > > THE BUG: > > > /* > > > * Initialize the file pointer with the specified properties. > > > * > > > * The ops are set with release semantics to be certain that the flag= s, type, > > > * and data are visible when ops is. This is to prevent ops methods = =66rom being > > > * called with bad data. > > > */ > > > void > > > finit(struct file *fp, u_int flag, short type, void *data, struct fil= eops *ops) > > > { > > > fp->f_data =3D data; > > > fp->f_flag =3D flag; > > > fp->f_type =3D type; > > > atomic_store_rel_ptr((volatile uintptr_t *)&fp->f_ops, (uintp= tr_t)ops); > > > } > > >=20 > > > This itself is fine, but it assumes all code obtaining fp from fdtabl= e places > > > a read memory barrier after reading f_ops and before reading anything= else. > > > As you could guess no code does that and I don't believe placing rmb's > > > in several new places is the way to go. > > I think your analysis is correct for all cases except kern_openat(). > >=20 > > For kern_openat(), we install a file into fdtable only after the struct > > file is fully initialized, see kern_openat(). The file is allocated > > with falloc_noinstall(), then file is initialized, then finstall() does > > FILEDESC_LOCK/UNLOCK, which ensures the full barrier. Other file > > accessors do fget_unlocked(), which does acquire (of fp->f_count, but > > this does not matter). > >=20 > > The only critical constraint is that other accessors must not see > > f_ops !=3D badfileops while struct file is not fully visible. IMO, > > the full barrier in finstall() and acquire in fget*() guarantee > > that we see badfileops until writes to other members are visible. > >=20 >=20 > I agree, but I doubt everything can be converted to this model (see below= ). >=20 > > For falloc(), indeed the write to f_ops could become visible too early > > (but not on x86). > >=20 > > >=20 > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D > > > GENERAL OVERVIEW OF CURRENT STATE: > > >=20 > > > What readers need to do: > > > - rmb() after reading fp_ops > > > - check fp_ops for badfileops > > How can readers see badfileops ? > >=20 >=20 > Not sure what you mean. fp is installed with badfileops, anything > accessing fdtable before finit finishes will see this. I referenced falloc_noinstall(). >=20 > > >=20 > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D > > > PROPOSAL: > > >=20 > > > struct file *fp; > > > int fd; > > >=20 > > > if (error =3D falloc(&fp, &fd)) > > Use falloc_noinstall() there. > >=20 > > > return (error); > > > if (error =3D something(....)) { > > > fdclose(fp, fd); > > > return (error); > > > } > > >=20 > > > finit(fp, ....); > > > factivate(fd, fp); > > This function is spelled finstall(). > >=20 > > It seems that all what is needed is conversion of places using > > falloc() to falloc_noinstall()/finstall(). > >=20 >=20 > This postpones fd allocation to after interested function did all work > it wanted to do, which means we would need reliable ways of reverting > all the work in case allocation failed. I'm not so confident we can do > that for all current consumers and both current and my proposed approach > don't impose such requirement. Cleanup should be identical to the actions done on close(2). >=20 > Of course postponing fd allocation where possible is definitely worth > doing. Yes, and after that the rest of the cases should be evaluated. But my gut feeling is that everything would be converted. >=20 > For cases where it is not possible/feasible, the only problem we have is > making sure we update fd entry in proper table in factivate. >=20 > The easiest solution is to FILEDESC_XLOCK, but this may have measurable > impact. We can get away with FILEDESC_SLOCK just fine, but this still > writes and may ping-pong with other cpus. >=20 > This is another place where we could just plop sequence counters. >=20 > fdgrowtable would +/-: > seq_write_begin(&fdp->fd_tbl_seq); > memcpy(....); > assign the new pointer > seq_end_begin(&fdp->fd_tbl_seq); >=20 > Then factivate would be +/-: >=20 > do { > fdtable =3D __READ_ONCE(..., fdp->fd_tbl); > rmb(); > seq =3D seq_read(fdp->fd_tbl_seq); > fdtable[fd] =3D fp; > while (!seq_consistent(fdp->fd_tbl_seq, seq)); >=20 > This in worst case updates old fdtable (which we never free so it is > harmless) and retries. seq_read never returns seq in 'modify' state, > and we check that seq is the same before and after the operation. If the > condition is met there was no concurrent fdtable copy. >=20 > This also means that table readers can 'suddenly' get fps, but this > should be fine. NULL fp is used to denote unused fd. If fp is non NULL, > it is safe for use. If it is NULL, fd is ignored which should not matter. >=20 > Functions like fdcopy just have to make sure they read fp once. >=20 > --=20 > Mateusz Guzik --whB3aSVTmcGVRwAq Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBAgAGBQJTyUR+AAoJEJDCuSvBvK1BQcYP/iGyymAyxnX4nZ55SIS0sz1n KS9bJzNmOsnQrqQNcTP1AArkmCPqa2MGTWEXxzvS54dhXCBRxA14WmSQykfUV7jg 2B2AsnnQ5lNOxwOjjHwvvkhrWtqoxvuVhg7a6ADFIu4AS5TCtVxi1bohO5X+T3ED qoPxtDai6rNNiRJKu3aNfKYT0idNVt1HROHRu3pRelBkfFqwdxO8D4GsI8e2rQGY 74ZIi3kkVq+N3GA1KbSnMzrEhzE5jrUNz5v9YCszoYmbsyKY1myToTQoqX/E70l2 KnhIotBbdQFEYBb2EffCHfrzn/Tl8y6uE//7Aqag6DVj3iJ5Khv5FLe1jiJXx573 sxfOj5N+eaiBY9Y8T7pNkYkrHAz2910UR+yW04atAV7+/yKHH1uyVewZej+JAFwN IFYKcfwhBOlbCBc+mfV1WZN0axszJX3eX+i7Fi5Mo2GdaLeEFyqZSWbBlNHTPpw2 OwrzDFZAcm9EsLQGf6RP4Mawq3TeuKNvdqjuNcegeZHNHGyFU3XCrzBLKVz2OOWY Dh+sO5b79K7MZctNvZNhjOMawy5BALaFapsjnJc5QvR9Gknxml256Sc18eSfDRgt qQ/fQA/SoQRndzhRjRsHEGOCIzwqeNIgaMKv86NiAND/zB3Xi5jFKzpeSoig3TvU 0HhnD6SBejgndg23BgKI =UyEP -----END PGP SIGNATURE----- --whB3aSVTmcGVRwAq--