Date: Sun, 14 Jul 2024 15:14:44 -0700 From: Rick Macklem <rick.macklem@gmail.com> To: Garrett Wollman <wollman@bimajority.org> Cc: freebsd-stable@freebsd.org, Mark Johnston <markj@freebsd.org> Subject: Re: Possible bug in zfs send or pipe implementation? Message-ID: <CAM5tNy7hM66buzFUMTz6t8sMheSyOnwQ_R-S_CQ8Kbf9Bc1OHQ@mail.gmail.com> In-Reply-To: <26260.2984.961319.782123@hergotha.csail.mit.edu> References: <26259.12713.114036.564205@hergotha.csail.mit.edu> <CAM5tNy4pPF9mHdXM5W6gjztm4_TtFfXnOLu3cdkqvaRf3Ab5uA@mail.gmail.com> <26259.17366.276955.824313@hergotha.csail.mit.edu> <CAM5tNy5m9-rqNTWJP5wAs9FewB6=FW9XbAe4V7qLgnrYJkSFKA@mail.gmail.com> <26260.2984.961319.782123@hergotha.csail.mit.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
--0000000000003b04a7061d3c7185 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sun, Jul 14, 2024 at 10:32=E2=80=AFAM Garrett Wollman <wollman@bimajorit= y.org> wrote: > <<On Sat, 13 Jul 2024 20:50:55 -0700, Rick Macklem <rick.macklem@gmail.co= m> > said: > > > Just to clarify it, are you saying zfs is sleeping on "pipewr"? > > (There is also a msleep() for "pipbww" in pipe_write().) > > It is sleeping on pipewr, yes. > > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.pipekva > kern.ipc.pipekva: 536576 > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.maxpipekva > kern.ipc.maxpipekva: 2144993280 > > It's not out of KVA, it's just waiting for the `pv` process to wake up > and read more data. `pv` is single-threaded and blocked on "select". > > It doesn't always get stuck in the same place, which is why I'm > suspecting a lost wakeup somewhere. > This snippet from sys/kern/sys_pipe.c looks a little suspicious to me... /* * Direct copy, bypassing a kernel buffer. */ } else if ((size =3D rpipe->pipe_pages.cnt) !=3D 0) { if (size > uio->uio_resid) size =3D (u_int) uio->uio_resid; PIPE_UNLOCK(rpipe); error =3D uiomove_fromphys(rpipe->pipe_pages.ms, rpipe->pipe_pages.pos, size, uio); PIPE_LOCK(rpipe); if (error) break; nread +=3D size; rpipe->pipe_pages.pos +=3D size; rpipe->pipe_pages.cnt -=3D size; if (rpipe->pipe_pages.cnt =3D=3D 0) { rpipe->pipe_state &=3D ~PIPE_WANTW; wakeup(rpipe); } If it reads uio_resid bytes which is less than pipe_pages.cnt, no wakeup() occurs. I'd be tempted to try getting rid of the "if (rpipe->pipe_pages.cnt =3D=3D = 0)" and do the wakeup() unconditionally, to see if it helps? Because if the application ("pv" in this case) doesn't do another read() on the pipe before calling select(), no wakeup() is going to occur, because here's what pipe_write() does... /* * We have no more space and have something to offer, * wake up select/poll. */ pipeselwakeup(wpipe); wpipe->pipe_state |=3D PIPE_WANTW; pipeunlock(wpipe); error =3D msleep(wpipe, PIPE_MTX(rpipe), PRIBIO | PCATCH, "pipewr", 0); pipelock(wpipe, 0); if (error !=3D 0) break; continue; Note that, once in msleep(), no call to pipeselwakeup() will occur until it gets woken up. I think the current code assumes that the reader ("pv" in this case) will read all the data out of the pipe before calling select() again. Does it do that? rick ps: I've added markj@ as a cc, since he seems to have been the last guy involved in sys_pipe.c. > -GAWollman > > --0000000000003b04a7061d3c7185 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div cl= ass=3D"gmail_default" style=3D"font-family:monospace"><br></div></div><br><= div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Sun, Jul= 14, 2024 at 10:32=E2=80=AFAM Garrett Wollman <<a href=3D"mailto:wollman= @bimajority.org">wollman@bimajority.org</a>> wrote:<br></div><blockquote= class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px so= lid rgb(204,204,204);padding-left:1ex"><<On Sat, 13 Jul 2024 20:50:55= -0700, Rick Macklem <<a href=3D"mailto:rick.macklem@gmail.com" target= =3D"_blank">rick.macklem@gmail.com</a>> said:<br> <br> > Just to clarify it, are you saying zfs is sleeping on "pipewr&quo= t;?<br> > (There is also a msleep() for "pipbww" in pipe_write().)<br> <br> It is sleeping on pipewr, yes.<br> <br> [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.pipekva<br> kern.ipc.pipekva: 536576<br> [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.maxpipekva<br> kern.ipc.maxpipekva: 2144993280<br> <br> It's not out of KVA, it's just waiting for the `pv` process to wake= up<br> and read more data.=C2=A0 `pv` is single-threaded and blocked on "sele= ct".<br> <br> It doesn't always get stuck in the same place, which is why I'm<br> suspecting a lost wakeup somewhere.<br></blockquote><div><span class=3D"gma= il_default" style=3D"font-family:monospace">This snippet from sys/kern/sys_= pipe.c looks a little suspicious to me...</span>=C2=A0</div><div>=C2=A0<spa= n style=3D"white-space:normal"><span style=3D"white-space:pre"> </span>/*<= /span></div><div><span style=3D"white-space:normal"><span style=3D"white-sp= ace:pre"> </span> * Direct copy, bypassing a kernel buffer.</span></div><d= iv><span style=3D"white-space:normal"><span style=3D"white-space:pre"> </s= pan> */</span></div><div><span style=3D"white-space:normal"><span style=3D"= white-space:pre"> </span>} else if ((size =3D rpipe->pipe_pages.cnt) != =3D 0) {</span></div><div><span style=3D"white-space:normal"><span style=3D= "white-space:pre"> </span>if (size > uio->uio_resid)</span></div><d= iv><span style=3D"white-space:normal"><span style=3D"white-space:pre"> <= /span>size =3D (u_int) uio->uio_resid;</span></div><div><span style=3D"w= hite-space:normal"><span style=3D"white-space:pre"> </span>PIPE_UNLOCK(rp= ipe);</span></div><div><span style=3D"white-space:normal"><span style=3D"wh= ite-space:pre"> </span>error =3D uiomove_fromphys(rpipe-><a href=3D"ht= tp://pipe_pages.ms">pipe_pages.ms</a>,</span></div><div><span style=3D"whit= e-space:normal"><span style=3D"white-space:pre"> </span>=C2=A0 =C2=A0 rpi= pe->pipe_pages.pos, size, uio);</span></div><div><span style=3D"white-sp= ace:normal"><span style=3D"white-space:pre"> </span>PIPE_LOCK(rpipe);</sp= an></div><div><span style=3D"white-space:normal"><span style=3D"white-space= :pre"> </span>if (error)</span></div><div><span style=3D"white-space:norm= al"><span style=3D"white-space:pre"> </span>break;</span></div><div><spa= n style=3D"white-space:normal"><span style=3D"white-space:pre"> </span>nr= ead +=3D size;</span></div><div><span style=3D"white-space:normal"><span st= yle=3D"white-space:pre"> </span>rpipe->pipe_pages.pos +=3D size;</span= ></div><div><span style=3D"white-space:normal"><span style=3D"white-space:p= re"> </span>rpipe->pipe_pages.cnt -=3D size;</span></div><div><span st= yle=3D"white-space:normal"><span style=3D"white-space:pre"> </span>if (rp= ipe->pipe_pages.cnt =3D=3D 0) {</span></div><div><span style=3D"white-sp= ace:normal"><span style=3D"white-space:pre"> </span>rpipe->pipe_state= &=3D ~PIPE_WANTW;</span></div><div><span style=3D"white-space:normal">= <span style=3D"white-space:pre"> </span>wakeup(rpipe);</span></div><div>= <span style=3D"white-space:normal"><span style=3D"white-space:pre"> </spa= n>}</span></div><div><font face=3D"monospace">I<span class=3D"gmail_default= " style=3D"font-family:monospace">f it reads uio_resid bytes which is less = than pipe_pages.cnt, no</span></font></div><div><font face=3D"monospace"><s= pan class=3D"gmail_default" style=3D"font-family:monospace">wakeup() occurs= .</span></font></div><div><font face=3D"monospace"><span class=3D"gmail_def= ault" style=3D"font-family:monospace">I'd be tempted to try getting rid= of the "if (rpipe->pipe_pages.cnt =3D=3D 0)"</span></font></d= iv><div><font face=3D"monospace"><span class=3D"gmail_default" style=3D"fon= t-family:monospace">and do the wakeup() unconditionally, to see if it helps= ?</span></font></div><div><font face=3D"monospace"><span class=3D"gmail_def= ault" style=3D"font-family:monospace"><br></span></font></div><div><font fa= ce=3D"monospace"><span class=3D"gmail_default" style=3D"font-family:monospa= ce">Because if the application ("pv" in this case) doesn't do= another read() on the</span></font></div><div><font face=3D"monospace"><sp= an class=3D"gmail_default" style=3D"font-family:monospace">pipe before call= ing select(), no wakeup() is going to occur, because here's</span></fon= t></div><div><font face=3D"monospace"><span class=3D"gmail_default" style= =3D"font-family:monospace">what pipe_write() does...</span></font></div><di= v><span class=3D"gmail_default"><font face=3D"monospace"><div><span style= =3D"white-space:pre"> </span>/*</div><div><span style=3D"white-space:pre"= > </span> * We have no more space and have something to offer,</div><div>= <span style=3D"white-space:pre"> </span> * wake up select/poll.</div><div= ><span style=3D"white-space:pre"> </span> */</div><div><span style=3D"whi= te-space:pre"> </span>pipeselwakeup(wpipe);</div><div><br></div><div><spa= n style=3D"white-space:pre"> </span>wpipe->pipe_state |=3D PIPE_WANTW;= </div><div><span style=3D"white-space:pre"> </span>pipeunlock(wpipe);</di= v><div><span style=3D"white-space:pre"> </span>error =3D msleep(wpipe, PI= PE_MTX(rpipe),</div><div><span style=3D"white-space:pre"> </span>=C2=A0 = =C2=A0 PRIBIO | PCATCH, "pipewr", 0);</div><div><span style=3D"wh= ite-space:pre"> </span>pipelock(wpipe, 0);</div><div><span style=3D"white= -space:pre"> </span>if (error !=3D 0)</div><div><span style=3D"white-spac= e:pre"> </span>break;</div><div><span style=3D"white-space:pre"> </spa= n>continue;</div><div>Note that, once in msleep(), no call to pipeselwakeup= () will occur until</div><div>it gets woken up.</div><div><br></div><div>I = think the current code assumes that the reader ("pv" in this case= ) will</div><div>read all the data out of the pipe before calling select() = again.</div><div>Does it do that?</div><div><br></div><div>rick</div><div>p= s: I've added markj@ as a cc, since he seems to have been the last guy = involved</div><div>=C2=A0 =C2=A0 in sys_pipe.c.</div><div><br></div><div><b= r></div></font></span></div><blockquote class=3D"gmail_quote" style=3D"marg= in:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1e= x"> <br> -GAWollman<br> <br> </blockquote></div></div></div></div> --0000000000003b04a7061d3c7185--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAM5tNy7hM66buzFUMTz6t8sMheSyOnwQ_R-S_CQ8Kbf9Bc1OHQ>