Date: Mon, 15 Jul 2024 23:08:44 -0500 From: "Eric A. Borisch" <eborisch+FreeBSD@gmail.com> To: Mark Johnston <markj@freebsd.org> Cc: FreeBSD Stable <freebsd-stable@freebsd.org>, rick.macklem@gmail.com, Garrett Wollman <wollman@bimajority.org> Subject: Re: Possible bug in zfs send or pipe implementation? Message-ID: <CAASnNno-8vSkqx_WR7CUsMbL_inEhW8WL35pGrVaH1Edi65n7g@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
--000000000000d91af5061d558015 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, Jul 15, 2024 at 9:50=E2=80=AFAM Mark Johnston <markj@freebsd.org> w= rote: > On Sun, Jul 14, 2024 at 03:14:44PM -0700, Rick Macklem wrote: > > On Sun, Jul 14, 2024 at 10:32=E2=80=AFAM Garrett Wollman <wollman@bimaj= ority.org > > > > wrote: > > > > > <<On Sat, 13 Jul 2024 20:50:55 -0700, Rick Macklem < > rick.macklem@gmail.com> > > > said: > > > > > > > Just to clarify it, are you saying zfs is sleeping on "pipewr"? > > > > (There is also a msleep() for "pipbww" in pipe_write().) > > > > > > It is sleeping on pipewr, yes. > > > > > > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.pipekva > > > kern.ipc.pipekva: 536576 > > > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.maxpipekva > > > kern.ipc.maxpipekva: 2144993280 > > > > > > It's not out of KVA, it's just waiting for the `pv` process to wake u= p > > > and read more data. `pv` is single-threaded and blocked on "select". > > > > > > It doesn't always get stuck in the same place, which is why I'm > > > suspecting a lost wakeup somewhere. > > > > > This snippet from sys/kern/sys_pipe.c looks a little suspicious to me..= . > > /* > > * Direct copy, bypassing a kernel buffer. > > */ > > } else if ((size =3D rpipe->pipe_pages.cnt) !=3D 0) { > > if (size > uio->uio_resid) > > size =3D (u_int) uio->uio_resid; > > PIPE_UNLOCK(rpipe); > > error =3D uiomove_fromphys(rpipe->pipe_pages.ms, > > rpipe->pipe_pages.pos, size, uio); > > PIPE_LOCK(rpipe); > > if (error) > > break; > > nread +=3D size; > > rpipe->pipe_pages.pos +=3D size; > > rpipe->pipe_pages.cnt -=3D size; > > if (rpipe->pipe_pages.cnt =3D=3D 0) { > > rpipe->pipe_state &=3D ~PIPE_WANTW; > > wakeup(rpipe); > > } > > If it reads uio_resid bytes which is less than pipe_pages.cnt, no > > wakeup() occurs. > > I'd be tempted to try getting rid of the "if (rpipe->pipe_pages.cnt =3D= =3D > 0)" > > and do the wakeup() unconditionally, to see if it helps? > > I don't think that can help. pipe_write() will block if the "direct > write" buffer is non-empty. See the comment in pipe_write(), "Pipe > buffered writes cannot be coincidental with direct writes". > > select()/poll()/etc. should return an event if pipe_pages.cnt > 0 on the > read side, so I suspect that the problem is elsewhere, or else I'm > misunderstanding something. > > > Because if the application ("pv" in this case) doesn't do another read(= ) > on > > the > > pipe before calling select(), no wakeup() is going to occur, because > here's > > what pipe_write() does... > > /* > > * We have no more space and have something to offer, > > * wake up select/poll. > > */ > > pipeselwakeup(wpipe); > > > > wpipe->pipe_state |=3D PIPE_WANTW; > > pipeunlock(wpipe); > > error =3D msleep(wpipe, PIPE_MTX(rpipe), > > PRIBIO | PCATCH, "pipewr", 0); > > pipelock(wpipe, 0); > > if (error !=3D 0) > > break; > > continue; > > Note that, once in msleep(), no call to pipeselwakeup() will occur unti= l > > it gets woken up. > > > > I think the current code assumes that the reader ("pv" in this case) wi= ll > > read all the data out of the pipe before calling select() again. > > Does it do that? > > > > rick > > ps: I've added markj@ as a cc, since he seems to have been the last guy > > involved > > in sys_pipe.c. > I was running into this (in a repeatable fashion) today with pv in a ZFS send/recv pipeline. I was able to recompile pv with debugging, and look at what is happening within pv -- it is setting its internal EOF flag for the output pipe. This is happening when it has a write(2) call return 0 (not _technically_ an error?) at https://codeberg.org/a-j-wood/pv/src/commit/42b623e9b078a6799e081587204a54f= 41e26da37/src/pv/transfer.c#L209 . Once this happens during a __transfer_write_repeated() call when nothing (within the function) has been written yet, it sets *eof_out =3D true at li= ne 711, and _never_ tries to write again. It still keeps select()-ing (I haven't run down why it's not aborting at this point if it really thinks it has an output it can't write to) like mad, however. Editing the pv source to instead handle a 0 return from write(2) similar to EINTR and EAGAIN _makes it work without a hiccup_ on the same send|pv|recv that was previously freezing up every time in a matter of seconds. (In this particular case, it's an initial receive of a tree with a number of filesystems & snapshots). So _if_ write(2) should ever return 0 is a question I'll leave to the kernel developers here, but that appears to be what is tripping up pv here, at least for me. I've submitted a pull request (with a reference back to this thread) at https://codeberg.org/a-j-wood/pv/pulls/92 I've been unable to force this to happen with synthetic workloads; I don't know if there is something in the zfs recv code that is setting / interacting with the pipe in a particular way to expose this behavior. - Eric --000000000000d91af5061d558015 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr">On Mon, Jul 15, 2024 at 9:50=E2=80=AFAM M= ark Johnston <<a href=3D"mailto:markj@freebsd.org" target=3D"_blank">mar= kj@freebsd.org</a>> wrote:</div><div dir=3D"ltr"><div class=3D"gmail_quo= te"><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;bor= der-left:1px solid rgb(204,204,204);padding-left:1ex">On Sun, Jul 14, 2024 = at 03:14:44PM -0700, Rick Macklem wrote:<br> > On Sun, Jul 14, 2024 at 10:32=E2=80=AFAM Garrett Wollman <<a href= =3D"mailto:wollman@bimajority.org" target=3D"_blank">wollman@bimajority.org= </a>><br> > wrote:<br> > <br> > > <<On Sat, 13 Jul 2024 20:50:55 -0700, Rick Macklem <<a h= ref=3D"mailto:rick.macklem@gmail.com" target=3D"_blank">rick.macklem@gmail.= com</a>><br> > > said:<br> > ><br> > > > Just to clarify it, are you saying zfs is sleeping on "= pipewr"?<br> > > > (There is also a msleep() for "pipbww" in pipe_wri= te().)<br> > ><br> > > It is sleeping on pipewr, yes.<br> > ><br> > > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.pipekva<br> > > kern.ipc.pipekva: 536576<br> > > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.maxpipekva<br> > > kern.ipc.maxpipekva: 2144993280<br> > ><br> > > It's not out of KVA, it's just waiting for the `pv` proce= ss to wake up<br> > > and read more data.=C2=A0 `pv` is single-threaded and blocked on = "select".<br> > ><br> > > It doesn't always get stuck in the same place, which is why I= 'm<br> > > suspecting a lost wakeup somewhere.<br> > ><br> > This snippet from sys/kern/sys_pipe.c looks a little suspicious to me.= ..<br> >=C2=A0 =C2=A0/*<br> > * Direct copy, bypassing a kernel buffer.<br> > */<br> > } else if ((size =3D rpipe->pipe_pages.cnt) !=3D 0) {<br> > if (size > uio->uio_resid)<br> > size =3D (u_int) uio->uio_resid;<br> > PIPE_UNLOCK(rpipe);<br> > error =3D uiomove_fromphys(rpipe-><a href=3D"http://pipe_pages.ms" = rel=3D"noreferrer" target=3D"_blank">pipe_pages.ms</a>,<br> >=C2=A0 =C2=A0 =C2=A0rpipe->pipe_pages.pos, size, uio);<br> > PIPE_LOCK(rpipe);<br> > if (error)<br> > break;<br> > nread +=3D size;<br> > rpipe->pipe_pages.pos +=3D size;<br> > rpipe->pipe_pages.cnt -=3D size;<br> > if (rpipe->pipe_pages.cnt =3D=3D 0) {<br> > rpipe->pipe_state &=3D ~PIPE_WANTW;<br> > wakeup(rpipe);<br> > }<br> > If it reads uio_resid bytes which is less than pipe_pages.cnt, no<br> > wakeup() occurs.<br> > I'd be tempted to try getting rid of the "if (rpipe->pipe_= pages.cnt =3D=3D 0)"<br> > and do the wakeup() unconditionally, to see if it helps?<br> <br> I don't think that can help.=C2=A0 pipe_write() will block if the "= ;direct<br> write" buffer is non-empty.=C2=A0 See the comment in pipe_write(), &qu= ot;Pipe<br> buffered writes cannot be coincidental with direct writes".<br> <br> select()/poll()/etc. should return an event if pipe_pages.cnt > 0 on the= <br> read side, so I suspect that the problem is elsewhere, or else I'm<br> misunderstanding something.<br> <br> > Because if the application ("pv" in this case) doesn't d= o another read() on<br> > the<br> > pipe before calling select(), no wakeup() is going to occur, because h= ere's<br> > what pipe_write() does...<br> > /*<br> > * We have no more space and have something to offer,<br> > * wake up select/poll.<br> > */<br> > pipeselwakeup(wpipe);<br> > <br> > wpipe->pipe_state |=3D PIPE_WANTW;<br> > pipeunlock(wpipe);<br> > error =3D msleep(wpipe, PIPE_MTX(rpipe),<br> >=C2=A0 =C2=A0 =C2=A0PRIBIO | PCATCH, "pipewr", 0);<br> > pipelock(wpipe, 0);<br> > if (error !=3D 0)<br> > break;<br> > continue;<br> > Note that, once in msleep(), no call to pipeselwakeup() will occur unt= il<br> > it gets woken up.<br> > <br> > I think the current code assumes that the reader ("pv" in th= is case) will<br> > read all the data out of the pipe before calling select() again.<br> > Does it do that?<br> > <br> > rick<br> > ps: I've added markj@ as a cc, since he seems to have been the las= t guy<br> > involved<br> >=C2=A0 =C2=A0 =C2=A0in sys_pipe.c.<br></blockquote><div><br></div><div>= I was running into this (in a repeatable fashion) today with pv in a ZFS se= nd/recv pipeline. I was able to recompile pv with debugging, and look at wh= at is happening within pv -- it is setting its internal EOF flag for the ou= tput pipe.</div><div><br></div><div>This is happening when it has a write(2= ) call return 0 (not _technically_ an error?) at <a href=3D"https://codeber= g.org/a-j-wood/pv/src/commit/42b623e9b078a6799e081587204a54f41e26da37/src/p= v/transfer.c#L209">https://codeberg.org/a-j-wood/pv/src/commit/42b623e9b078= a6799e081587204a54f41e26da37/src/pv/transfer.c#L209</a>.=C2=A0</div><div><b= r></div><div>Once this happens during a=C2=A0__transfer_write_repeated() ca= ll when nothing (within the function) has been written yet, it sets *eof_ou= t =3D true at line 711, and _never_ tries to write again. It still keeps se= lect()-ing (I haven't run down why it's not aborting at this point = if it really thinks it has an output it can't write to) like mad, howev= er.<br></div><div><br></div><div>Editing the pv source to instead handle a = 0 return from write(2) similar to EINTR and EAGAIN _makes it work without a= hiccup_ on the same send|pv|recv that was previously freezing up every tim= e in a matter of seconds. (In this particular case, it's an initial rec= eive of a tree with a number of filesystems & snapshots).</div><div><br= ></div><div>So _if_ write(2) should ever return 0 is a question I'll le= ave to the kernel developers here, but that appears to be what is tripping = up pv here, at least for me.</div><div><br></div><div>I've submitted a = pull request (with a reference back to this thread) at <a href=3D"https://c= odeberg.org/a-j-wood/pv/pulls/92">https://codeberg.org/a-j-wood/pv/pulls/92= </a></div><div><br></div><div>I've been unable to force this to happen = with synthetic workloads; I don't know if there is something in the zfs= recv code that is setting / interacting with the pipe in a particular way = to expose this behavior.<br></div><div><br></div><div>=C2=A0 - Eric<br></di= v><div>=C2=A0</div></div></div> </div> --000000000000d91af5061d558015--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAASnNno-8vSkqx_WR7CUsMbL_inEhW8WL35pGrVaH1Edi65n7g>