Date: Fri, 18 Feb 2022 20:47:09 +0300 From: Mehmet Erol Sanliturk <m.e.sanliturk@gmail.com> To: Warner Losh <imp@bsdimp.com> Cc: Peter Jeremy <peterj@freebsd.org>, FreeBSD FS <freebsd-fs@freebsd.org>, "freebsd-geom@FreeBSD.org" <freebsd-geom@freebsd.org> Subject: Re: bio re-ordering Message-ID: <CAOgwaMvXD_HOoBF414LSXhwqNw99nfHd10qa_yzdXRq9qK5-2Q@mail.gmail.com> In-Reply-To: <CANCZdfp_6KNUpxNe9yp5QR9K-5qM9ez%2BLW=sGAPJ72yHYvH6tg@mail.gmail.com> References: <YfTCs7j3TPZFcFCD@server.rulingia.com> <YfTEj1KLhQhoR3xP@kib.kiev.ua> <CANCZdfoqQ3Ze%2BcMTsk_ho9x8hsSM9=fTavSao%2BUtwc2nSAEJpQ@mail.gmail.com> <Yfo3i9Yy/uCUpss1@server.rulingia.com> <CANCZdfqBQOvzMCrJxWq9GzqCKyK_AubBE1CxAW5FULnE7D_jrg@mail.gmail.com> <b75872f4-521b-5eab-68d0-4b1c04a10add@FreeBSD.org> <CANCZdfp=0rbBkr4SoXhvn7hrQniPQzTeZra2HGBwXDGsJjN8XQ@mail.gmail.com> <9848cde6-5c12-cdd4-e722-42fe26fa0349@FreeBSD.org> <Yf5IUCWW/tgI/Cse@server.rulingia.com> <20220218014814.GJ97875@funkthat.com> <Yg9agkeypdDOwKWm@server.rulingia.com> <CANCZdfp_6KNUpxNe9yp5QR9K-5qM9ez%2BLW=sGAPJ72yHYvH6tg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000f5514505d84e7a95 Content-Type: text/plain; charset="UTF-8" On Fri, Feb 18, 2022 at 7:31 PM Warner Losh <imp@bsdimp.com> wrote: > So I spent some time looking at what BIO_ORDERED means in today's kernel > and flavored it with my indoctrination of the ordering guarantees with BIO > requests > from when I wrote the CAM I/O scheduler. it's kinda long, but spells out > what > BIO_ORDERED means, where it can come from and who depends on it for what. > > On Fri, Feb 18, 2022 at 1:36 AM Peter Jeremy <peterj@freebsd.org> wrote: > >> On 2022-Feb-17 17:48:14 -0800, John-Mark Gurney <jmg@funkthat.com> wrote: >> >Peter Jeremy wrote this message on Sat, Feb 05, 2022 at 20:50 +1100: >> >> I've raised https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=261731 >> to >> >> make geom_gate support BIO_ORDERED. Exposing the BIO_ORDERED flag to >> >> userland is quite easy (once a decision is made as to how to do that). >> >> Enhancing the geom_gate clients to correctly implement BIO_ORDERED is >> >> somewhat harder. >> > >> >The clients are single threaded wrt IOs, so I don't think updating them >> >are required. >> >> ggatec(8) and ggated(8) will not reorder I/Os. I'm not sure about hast. >> >> >I do have patches to improve things by making ggated multithreaded to >> >improve IOPs, and so making this improvement would allow those patches >> >to be useful. >> >> Likewise, I found ggatec and ggated to be too slow for my purposes and >> so I've implemented my own variant (not network API compatible) that >> can/does reorder requests. That was when I noticed that BIO_ORDERED >> wasn't implemented. >> >> >I do have a question though, what is the exact semantics of _ORDERED? >> >> I can't authoritatively answer this, sorry. >> > > This is under documented. Clients, in general, are expected to cope with > I/O that completes in an arbitrary order. They are expected to not schedule > new I/O that depends on old I/O completing for whatever reason (usually > on-media consistency). BIO_ORDERED is used to create a full barrier > in the stream of I/Os. The comments in the code say vaguely: > > /* > * This bio must be executed after all previous bios in the queue have been > * executed, and before any successive bios can be executed. > */ > > Drivers implement this as a partitioning of requests. All requests before > it are completed, then the BIO_ORDERED operation is done, then requests > after it are scheduled with the device. > > BIO_FLUSH I think is the only remaining operation that's done as > BIO_ORDERED > directly. xen.../blkback.c, geom_io.c and ffs_softdep.c are the only ones > that set it > and all on BIO_FLUSH operations. bio/buf clients depend on this to ensure > metadata > on the drive is in a consistent state after it's been updated. > > xen/.../blkback.c also sets it for all BLKIF_OP_WRITE_BARRIER operations > (so > write barriers). > > In the upper layers, we have struct buf instead of struct bio to describe > future I/Os > that the buffer cache may need to do. There's a flag B_BARRIER that gets > turned > into BIO_ORDERED in geom_vfs. B_BARRIER is set in only two places (and > copied > in one other) in vfs_bio.c. babarrierwrite and bbarrierwrite for async vs > sync writes > respectively. > > CAM will set BIO_ORDERED for all BIO_ZONE commands for reasons that are > at best unclear to me, but which won't matter for this discussion. > > ffs_alloc.c (so UFS again) is the only place that uses babarrierwrite. It > is used > to ensure that all inode initializations are completed before the cylinder > group > bitmap is written out. This is done with newfs, when new cylinder groups > are > created with growfs, and apparently in a few other cases where additional > inodes > are created in newly-created UFS2 filesystems. This can be disabled with > vfs.ffs.doasyncinodeinit=0 when barrier writes aren't working as > advertised, > but there's a big performance hit from doing so until all the inodes for > the > filesystem have been lazily populated. > > No place uses bbarrierwrite that I can find. > > Based on all of that, the CAM's dynamic I/O scheduler will reorder reads > around a BIO_ORDERED operation, but not writes, trims or flushes. Since, > in general, operations happen in an arbitrary order, scheduling both a read > and a write at the same time for the same block will result in undefined > results. > > Different drivers handle this differently. CAM will honor the BIO_ORDERED > tag by scheduling the I/O with an ordering tag so that the SCSI hardware > will > properly order the result. The simpler ATA version will use a non NCQ > request > to force the proper ordering (since to send a non-NCQ request, you have to > drain the queue, do that one command, and then start up again). nvd will > just throw > the I/O at the device, until it encounters a BIO_ORDERED request. Then it > will queue > everything until all the current requests complete, then do the ordered > request, then > do the rest of the queued I/O as if it had just showed up. > > Most drivers use bioq_disksort(), which will queue the request to the end > of the bioq > and mark things so all I/Os after that are in their new 'elevator car' for > its elevator sort > algorithm. This means that CAM's normal ways of dequeuing the request will > preserve > ordering through the periph driver's start routine (where the dynamic > schedule will honor > it for writes, but not reads, but the default scheduler will honor it for > both). > > >> >And right now, the ggate protocol (from what I remember) doesn't have >> >a way to know when the remote kernel has received notification that an >> >IO is complete. >> >> A G_GATE_CMD_START write request will be sent to the remote system and >> issued as a pwrite(2) then an acknowledgement packet will be returned >> and passed back to the local kernel via G_GATE_CMD_DONE. There's no >> support for BIO_FLUSH or BIO_ORDERED so there's no way for the local >> kernel to know when the write has been written to non-volatile store. >> > > That's unfortunate. UFS can work around the BIO_ORDERED problem with > a simple setting, but not the BIO_FLUSH problem. > > >> >> I've done some experiments and OpenZFS doesn't generate BIO_ORDERED >> >> operations so I've also raised >> https://github.com/openzfs/zfs/issues/13065 >> >> I haven't looked into how difficult that would be to fix. >> >> Unrelated to the above but for completeness: OpenZFS avoids the need >> for BIO_ORDERED by not issuing additional I/Os until previous I/Os have >> been retired when ordering is important. (It does rely on BIO_FLUSH). >> > > To be clear: OpenZFS won't schedule new I/Os until the BIO_FLUSH it sends > down w/o the BIO_ORDERED flag completes, right? The parenthetical confuses > me on how to parse it: BIO_FLUSH is needed and ZFS depends on it completing > with all blocks flushed to stable media, or ZFS depends on BIO_FLUSH being > strongly ordered relative to other commands. I think you mean the former, > but want > to make sure. > > The root of this problem, I think, is the following: > % man 9 bio > No manual entry for bio > ---------------------------------------------------------------------- > I think I'll have to massage this email into an appropriate man page. > At the very least, I should turn some/all of the above into a blog post :) > > Warner > ---------------------------------------------------------------------- The above sentence is WONDERFUL ... In my some messages , I am saying that : - Make Handbook parts , man pages a "blog" system , - Attach the related messages to these parts , - Relay comments / questions specified for these pages to mailing lists , - After a while or at suitable times , move "knowledge" ( meaning "what to do" ) in these messages into related parts . My opinion is that my ideas are not very effective . If the above sentence can "converge" to such a structure , it may be really WONDERFUL ... With my best wishes , Mehmet Erol Sanliturk --000000000000f5514505d84e7a95 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr"><div class=3D"gmail_default" style=3D"fon= t-family:tahoma,sans-serif;font-size:large"><br></div></div><br><div class= =3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Fri, Feb 18, 2022= at 7:31 PM Warner Losh <<a href=3D"mailto:imp@bsdimp.com">imp@bsdimp.co= m</a>> wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin= :0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"= ><div dir=3D"ltr"><div>So I spent some time looking at what BIO_ORDERED mea= ns in today's kernel</div><div>and flavored it with my indoctrination o= f the ordering guarantees with BIO requests</div><div>from when I wrote the= CAM I/O scheduler. it's kinda long, but spells out what</div><div>BIO_= ORDERED means, where it can come from and who depends on it for what.</div>= <br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Fri= , Feb 18, 2022 at 1:36 AM Peter Jeremy <<a href=3D"mailto:peterj@freebsd= .org" target=3D"_blank">peterj@freebsd.org</a>> wrote:<br></div><blockqu= ote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px= solid rgb(204,204,204);padding-left:1ex">On 2022-Feb-17 17:48:14 -0800, Jo= hn-Mark Gurney <<a href=3D"mailto:jmg@funkthat.com" target=3D"_blank">jm= g@funkthat.com</a>> wrote:<br> >Peter Jeremy wrote this message on Sat, Feb 05, 2022 at 20:50 +1100:<br= > >> I've raised <a href=3D"https://bugs.freebsd.org/bugzilla/show_= bug.cgi?id=3D261731" rel=3D"noreferrer" target=3D"_blank">https://bugs.free= bsd.org/bugzilla/show_bug.cgi?id=3D261731</a> to<br> >> make geom_gate support BIO_ORDERED.=C2=A0 Exposing the BIO_ORDERED= flag to<br> >> userland is quite easy (once a decision is made as to how to do th= at).<br> >> Enhancing the geom_gate clients to correctly implement BIO_ORDERED= is<br> >> somewhat harder.<br> ><br> >The clients are single threaded wrt IOs, so I don't think updating = them<br> >are required.<br> <br> ggatec(8) and ggated(8) will not reorder I/Os.=C2=A0 I'm not sure about= hast.<br> <br> >I do have patches to improve things by making ggated multithreaded to<b= r> >improve IOPs, and so making this improvement would allow those patches<= br> >to be useful.<br> <br> Likewise, I found ggatec and ggated to be too slow for my purposes and<br> so I've implemented my own variant (not network API compatible) that<br= > can/does reorder requests.=C2=A0 That was when I noticed that BIO_ORDERED<b= r> wasn't implemented.<br> <br> >I do have a question though, what is the exact semantics of _ORDERED?<b= r> <br> I can't authoritatively answer this, sorry.<br></blockquote><div><br></= div><div>This is under documented. Clients, in general, are expected to cop= e with</div><div>I/O that completes in an arbitrary order. They are expecte= d to not schedule</div><div>new I/O that depends on old I/O completing for = whatever reason (usually</div><div>on-media consistency). BIO_ORDERED is us= ed to create a full barrier</div><div>in the stream of I/Os. The comments i= n the code say vaguely:</div><div><br></div><div>/*<br>=C2=A0* This bio mus= t be executed after all previous bios in the queue have been<br>=C2=A0* exe= cuted, and before any successive bios can be executed.<br>=C2=A0*/<br></div= ><div><br></div><div>Drivers implement this as a partitioning of requests. = All requests before</div><div>it are completed, then the BIO_ORDERED operat= ion is done, then requests</div><div>after it are scheduled with the device= .</div><div><br></div><div>BIO_FLUSH I think is the only remaining operatio= n that's done as BIO_ORDERED</div><div>directly. xen.../blkback.c, geom= _io.c and ffs_softdep.c are the only ones that set it</div><div>and all on = BIO_FLUSH operations. bio/buf clients depend on this to ensure metadata</di= v><div>on the drive is in a consistent state after it's been updated.</= div><div><br></div><div>xen/.../blkback.c also sets it for all=C2=A0BLKIF_O= P_WRITE_BARRIER operations (so</div><div>write barriers).</div><div><br></d= iv><div>In the upper layers, we have struct buf instead of struct bio to de= scribe future I/Os<br></div><div>that the buffer cache may need to do. Ther= e's a flag B_BARRIER that gets turned</div><div>into BIO_ORDERED in geo= m_vfs. B_BARRIER is set in only two places (and copied</div><div>in one oth= er) in vfs_bio.c.=C2=A0babarrierwrite and=C2=A0bbarrierwrite for async vs s= ync writes</div><div>respectively.</div><div><br></div><div>CAM will set BI= O_ORDERED for all BIO_ZONE commands for reasons that are</div><div>at best = unclear to me, but which won't matter for this discussion.</div><div><b= r></div><div>ffs_alloc.c (so UFS again) is the only place that uses=C2=A0ba= barrierwrite. It is used</div><div>to ensure that all inode initializations= are completed before the cylinder group</div><div>bitmap is written out. T= his is done with newfs, when new cylinder groups are</div><div>created with= growfs, and apparently in a few other cases where additional inodes</div><= div>are created in newly-created UFS2 filesystems. This can be disabled wit= h</div><div>vfs.ffs.doasyncinodeinit=3D0 when barrier writes aren't wor= king as advertised,</div><div>but there's a big performance hit from do= ing so until all the inodes for the</div><div>filesystem have been lazily p= opulated.</div><div><br></div><div>No place uses bbarrierwrite that I can f= ind.</div><div><br></div><div>Based on all of that, the CAM's dynamic I= /O scheduler will reorder reads</div><div>around a BIO_ORDERED operation, b= ut not writes, trims or flushes. Since,</div><div>in general, operations ha= ppen in an arbitrary order, scheduling both a read</div><div>and a write at= the same time for the same block will result in undefined</div><div>result= s.</div><div><br></div><div>Different drivers handle this differently. CAM = will honor the BIO_ORDERED</div><div>tag by scheduling the I/O with an orde= ring tag so that the SCSI hardware will</div><div>properly order the result= . The simpler ATA version will use a non NCQ request</div><div>to force the= proper ordering (since to send a non-NCQ request, you have to</div><div>dr= ain the queue, do that one command, and then start up again). nvd will just= throw</div><div>the I/O at the device, until it encounters a BIO_ORDERED r= equest. Then it will queue</div><div>everything until all the current reque= sts complete, then do the ordered request, then</div><div>do the rest of th= e queued I/O as if it had just showed up.</div><div><br></div><div>Most dri= vers use bioq_disksort(), which will queue the request to the end of the bi= oq</div><div>and mark things so all I/Os after that are in their new 'e= levator car' for its elevator sort</div><div>algorithm. This means that= CAM's normal ways of dequeuing the request will preserve</div><div>ord= ering through the periph driver's start routine (where the dynamic sche= dule will honor</div><div>it for writes, but not reads, but the default sch= eduler will honor it for both).</div><div>=C2=A0</div><blockquote class=3D"= gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(20= 4,204,204);padding-left:1ex"> >And right now, the ggate protocol (from what I remember) doesn't ha= ve<br> >a way to know when the remote kernel has received notification that an<= br> >IO is complete.<br> <br> A G_GATE_CMD_START write request will be sent to the remote system and<br> issued as a pwrite(2) then an acknowledgement packet will be returned<br> and passed back to the local kernel via G_GATE_CMD_DONE.=C2=A0 There's = no<br> support for BIO_FLUSH or BIO_ORDERED so there's no way for the local<br= > kernel to know when the write has been written to non-volatile store.<br></= blockquote><div><br></div><div>That's unfortunate. UFS can work around = the BIO_ORDERED problem with</div><div>a simple setting, but not the BIO_FL= USH problem.</div><div>=C2=A0<br></div><blockquote class=3D"gmail_quote" st= yle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padd= ing-left:1ex"> >> I've done some experiments and OpenZFS doesn't generate BI= O_ORDERED<br> >> operations so I've also raised <a href=3D"https://github.com/o= penzfs/zfs/issues/13065" rel=3D"noreferrer" target=3D"_blank">https://githu= b.com/openzfs/zfs/issues/13065</a><br> >> I haven't looked into how difficult that would be to fix.<br> <br> Unrelated to the above but for completeness:=C2=A0 OpenZFS avoids the need<= br> for BIO_ORDERED by not issuing additional I/Os until previous I/Os have<br> been retired when ordering is important.=C2=A0 (It does rely on BIO_FLUSH).= <br></blockquote><div><br></div><div>To be clear: OpenZFS won't schedul= e new I/Os until the BIO_FLUSH it sends</div><div>down w/o the BIO_ORDERED = flag completes, right? The parenthetical confuses</div><div>me on how to pa= rse it: BIO_FLUSH is needed and ZFS depends on it completing</div><div>with= all blocks flushed to stable media, or ZFS depends on BIO_FLUSH being</div= ><div>strongly ordered relative to other commands. I think you mean the for= mer, but want</div><div>to make sure.</div><div><br></div><div>The root of = this problem, I think, is the following:</div><div>=C2=A0 =C2=A0 =C2=A0% ma= n 9 bio</div>=C2=A0 =C2=A0 =C2=A0No manual entry for bio</div></div></block= quote><div><br></div><div><br></div><div><div><br></div><div class=3D"gmail= _quote"><div style=3D"font-family:tahoma,sans-serif;font-size:large" class= =3D"gmail_default">--------------------------------------------------------= --------------</div></div></div><div>=C2=A0</div><blockquote class=3D"gmail= _quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204= ,204);padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_quote">I think= I'll have to massage this email into an appropriate man page.<br></div= ></div></blockquote><div></div><blockquote class=3D"gmail_quote" style=3D"m= argin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left= :1ex"><div dir=3D"ltr"><div class=3D"gmail_quote">At the very least, I shou= ld turn some/all of the above into a blog post :)</div><div class=3D"gmail_= quote"><div><br></div><div>Warner</div></div></div></blockquote><div><br></= div><div><div><br></div><div><div style=3D"font-family:tahoma,sans-serif;fo= nt-size:large" class=3D"gmail_default">------------------------------------= ----------------------------------</div><br></div><div>=C2=A0</div><div><di= v style=3D"font-family:tahoma,sans-serif;font-size:large" class=3D"gmail_de= fault">The above sentence is WONDERFUL ...</div><div style=3D"font-family:t= ahoma,sans-serif;font-size:large" class=3D"gmail_default"><br></div><div st= yle=3D"font-family:tahoma,sans-serif;font-size:large" class=3D"gmail_defaul= t">In my some messages , I am saying that :</div><div style=3D"font-family:= tahoma,sans-serif;font-size:large" class=3D"gmail_default"><br></div><div s= tyle=3D"font-family:tahoma,sans-serif;font-size:large" class=3D"gmail_defau= lt"><br></div><div style=3D"font-family:tahoma,sans-serif;font-size:large" = class=3D"gmail_default">- Make Handbook parts , man pages a=C2=A0 "blo= g" system ,</div><div style=3D"font-family:tahoma,sans-serif;font-size= :large" class=3D"gmail_default">- Attach the related messages to these part= s ,</div><div style=3D"font-family:tahoma,sans-serif;font-size:large" class= =3D"gmail_default">- Relay comments / questions specified for these pages t= o mailing lists ,</div><div style=3D"font-family:tahoma,sans-serif;font-siz= e:large" class=3D"gmail_default">- After a while or at suitable times , mov= e "knowledge" ( meaning "what to do" ) in these</div><d= iv style=3D"font-family:tahoma,sans-serif;font-size:large" class=3D"gmail_d= efault">=C2=A0 messages into related parts .</div><div style=3D"font-family= :tahoma,sans-serif;font-size:large" class=3D"gmail_default"><br></div><div = style=3D"font-family:tahoma,sans-serif;font-size:large" class=3D"gmail_defa= ult">My opinion is that my ideas are not very effective .</div><div style= =3D"font-family:tahoma,sans-serif;font-size:large" class=3D"gmail_default">= <br></div><div style=3D"font-family:tahoma,sans-serif;font-size:large" clas= s=3D"gmail_default"><br></div><div style=3D"font-family:tahoma,sans-serif;f= ont-size:large" class=3D"gmail_default">If the above sentence can=C2=A0 &qu= ot;converge" to such a structure , it may be really WONDERFUL ...</div= ><div style=3D"font-family:tahoma,sans-serif;font-size:large" class=3D"gmai= l_default"><br></div><div style=3D"font-family:tahoma,sans-serif;font-size:= large" class=3D"gmail_default"><br></div><div style=3D"font-family:tahoma,s= ans-serif;font-size:large" class=3D"gmail_default">With my best wishes ,</d= iv><div style=3D"font-family:tahoma,sans-serif;font-size:large" class=3D"gm= ail_default"><br></div><div style=3D"font-family:tahoma,sans-serif;font-siz= e:large" class=3D"gmail_default">Mehmet Erol Sanliturk</div><div style=3D"f= ont-family:tahoma,sans-serif;font-size:large" class=3D"gmail_default"><br><= /div><div style=3D"font-family:tahoma,sans-serif;font-size:large" class=3D"= gmail_default"><br></div><div style=3D"font-family:tahoma,sans-serif;font-s= ize:large" class=3D"gmail_default"><br></div><div style=3D"font-family:taho= ma,sans-serif;font-size:large" class=3D"gmail_default"><br></div><br></div>= <div><br></div></div><div style=3D"font-family:tahoma,sans-serif;font-size:= large" class=3D"gmail_default"><br></div></div></div> --000000000000f5514505d84e7a95--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOgwaMvXD_HOoBF414LSXhwqNw99nfHd10qa_yzdXRq9qK5-2Q>