Date: Fri, 28 Jan 2022 22:32:02 -0700 From: Warner Losh <imp@bsdimp.com> To: Konstantin Belousov <kostikbel@gmail.com> Cc: peterj@freebsd.org, FreeBSD FS <freebsd-fs@freebsd.org>, "freebsd-geom@FreeBSD.org" <freebsd-geom@freebsd.org> Subject: Re: bio re-ordering Message-ID: <CANCZdfoqQ3Ze%2BcMTsk_ho9x8hsSM9=fTavSao%2BUtwc2nSAEJpQ@mail.gmail.com> In-Reply-To: <YfTEj1KLhQhoR3xP@kib.kiev.ua> References: <YfTCs7j3TPZFcFCD@server.rulingia.com> <YfTEj1KLhQhoR3xP@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
--0000000000003f085605d6b1df04 Content-Type: text/plain; charset="UTF-8" On Fri, Jan 28, 2022 at 9:38 PM Konstantin Belousov <kostikbel@gmail.com> wrote: > On Sat, Jan 29, 2022 at 03:29:39PM +1100, peterj@freebsd.org wrote: > > I'm working on a GEOM Gate network client to better handle high-latency > > connections and have some questions regarding bio ordering assumptions > > (alternatively, how much should I be able to re-order bio requests > without > > breaking things). Within geom_gate, an incoming bio request is retrieved > > from the kernel using a G_GATE_CMD_START ioctl, processed in userland > > (typically by forwarding it to a remote system) and then returned via a > > G_GATE_CMD_DONE ioctl. My GEOM Gate client can reorder requests quite > > aggressively and I suspect it's breaking some kernel assumptions > regarding > > bio behaviour. The following questions assume that BIO_READ, BIO_WRITE > and > > BIO_FLUSH are valid but BIO_DELETE isn't supported. > > > > a) In the absence of BIO_FLUSH operations, what (if any) are the limits > on > > reordering operations? Given a block that initially contains A, > followed > > by a write B, read and write C, is there any constraint on which > content > > the read returns? > There are no limits. Either other software entities, or hardware itself, > can process requests in arbitrary order. This is why things are typically > done in the completion handler, and part of the reason why the complexity > of UFS SU exists. > > > > > b) Are individual BIO_READ and BIO_WRITE operations expected to be atomic > > with respect to other BIO_WRITE operations? Give 2 adjacent blocks > that > > initially contain AB, and successive write CD, read and write EF > > operations to those blocks, is it expected that the read would return > CD > > (or maybe AD or EF, assuming that's valid from the previous question) > or > > could the write operations partially complete in different orders, > > resulting in something like AD, CF, EB etc? > No. At very least, underlying entities can split request into several, > each of which is ordered individiually. Typically, it is higher-level > code that ensures that there are no concurrent modifications of the same > block. For instance, we exclusively lock vnodes and buffers around > metadata updates. Similarly, we lock buffers until the data is written > to the device. > > > > > b) I assume that a BIO_FLUSH should not return DONE until all preceeding > > write operations have completed issued. Is it required that write > > operations issued after the BIO_FLUSH must not complete before the > > BIO_FLUSH completes? > UFS SU relies on BIO_FLUSH being the full barrier. > I think that ufs relies on two ordering primitives, both marked with BIO_ORDERED today. That's what most of the drivers key off of. We always set BIO_ORDERED on all the BIO_FLUSH events as far as I Can tell. Also, anything that sets a B_BARRIER at the upper layers, also gets BIO_ORDERED added to it. b*barrierwrite() sets this, and that's used in the ffs_alloc code. Warner --0000000000003f085605d6b1df04 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">= <div dir=3D"ltr" class=3D"gmail_attr">On Fri, Jan 28, 2022 at 9:38 PM Konst= antin Belousov <<a href=3D"mailto:kostikbel@gmail.com">kostikbel@gmail.c= om</a>> wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margi= n:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex= ">On Sat, Jan 29, 2022 at 03:29:39PM +1100, <a href=3D"mailto:peterj@freebs= d.org" target=3D"_blank">peterj@freebsd.org</a> wrote:<br> > I'm working on a GEOM Gate network client to better handle high-la= tency<br> > connections and have some questions regarding bio ordering assumptions= <br> > (alternatively, how much should I be able to re-order bio requests wit= hout<br> > breaking things).=C2=A0 Within geom_gate, an incoming bio request is r= etrieved<br> > from the kernel using a G_GATE_CMD_START ioctl, processed in userland<= br> > (typically by forwarding it to a remote system) and then returned via = a<br> > G_GATE_CMD_DONE ioctl.=C2=A0 My GEOM Gate client can reorder requests = quite<br> > aggressively and I suspect it's breaking some kernel assumptions r= egarding<br> > bio behaviour.=C2=A0 The following questions assume that BIO_READ, BIO= _WRITE and<br> > BIO_FLUSH are valid but BIO_DELETE isn't supported.<br> > <br> > a) In the absence of BIO_FLUSH operations, what (if any) are the limit= s on<br> >=C2=A0 =C2=A0 reordering operations?=C2=A0 Given a block that initially= contains A, followed<br> >=C2=A0 =C2=A0 by a write B, read and write C, is there any constraint o= n which content<br> >=C2=A0 =C2=A0 the read returns?<br> There are no limits.=C2=A0 Either other software entities, or hardware itse= lf,<br> can process requests in arbitrary order.=C2=A0 This is why things are typic= ally<br> done in the completion handler, and part of the reason why the complexity<b= r> of UFS SU exists.<br> <br> > <br> > b) Are individual BIO_READ and BIO_WRITE operations expected to be ato= mic<br> >=C2=A0 =C2=A0 with respect to other BIO_WRITE operations?=C2=A0 Give 2 = adjacent blocks that<br> >=C2=A0 =C2=A0 initially contain AB, and successive write CD, read and w= rite EF<br> >=C2=A0 =C2=A0 operations to those blocks, is it expected that the read = would return CD<br> >=C2=A0 =C2=A0 (or maybe AD or EF, assuming that's valid from the pr= evious question) or<br> >=C2=A0 =C2=A0 could the write operations partially complete in differen= t orders,<br> >=C2=A0 =C2=A0 resulting in something like AD, CF, EB etc?<br> No.=C2=A0 At very least, underlying entities can split request into several= ,<br> each of which is ordered individiually.=C2=A0 Typically, it is higher-level= <br> code that ensures that there are no concurrent modifications of the same<br= > block.=C2=A0 For instance, we exclusively lock vnodes and buffers around <b= r> metadata updates.=C2=A0 Similarly, we lock buffers until the data is writte= n<br> to the device.<br> <br> > <br> > b) I assume that a BIO_FLUSH should not return DONE until all preceedi= ng<br> >=C2=A0 =C2=A0 write operations have completed issued.=C2=A0 Is it requi= red that write<br> >=C2=A0 =C2=A0 operations issued after the BIO_FLUSH must not complete b= efore the<br> >=C2=A0 =C2=A0 BIO_FLUSH completes?<br> UFS SU relies on BIO_FLUSH being the full barrier.<br></blockquote><div><br= ></div><div>I think that ufs relies on two ordering primitives, both marked= with BIO_ORDERED today.</div><div>That's what most of the drivers key = off of. We always set BIO_ORDERED on all the BIO_FLUSH</div><div>events as = far as I Can tell.<br></div><div><br></div><div>Also, anything that sets a = B_BARRIER at the upper layers, also gets BIO_ORDERED added</div><div>to it.= b*barrierwrite() sets this, and that's used in the ffs_alloc code.</di= v><div><br></div><div>Warner<br></div></div></div> --0000000000003f085605d6b1df04--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfoqQ3Ze%2BcMTsk_ho9x8hsSM9=fTavSao%2BUtwc2nSAEJpQ>