Date: Tue, 16 Jul 2024 09:36:19 -0400 From: Mark Johnston <markj@freebsd.org> To: "Eric A. Borisch" <eborisch+FreeBSD@gmail.com> Cc: FreeBSD Stable <freebsd-stable@freebsd.org>, rick.macklem@gmail.com, Garrett Wollman <wollman@bimajority.org> Subject: Re: Possible bug in zfs send or pipe implementation? Message-ID: <ZpZ3Uyg6nBhFP4i1@nuc> In-Reply-To: <CAASnNno-8vSkqx_WR7CUsMbL_inEhW8WL35pGrVaH1Edi65n7g@mail.gmail.com> References: <CAASnNno-8vSkqx_WR7CUsMbL_inEhW8WL35pGrVaH1Edi65n7g@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Jul 15, 2024 at 11:08:44PM -0500, Eric A. Borisch wrote: > On Mon, Jul 15, 2024 at 9:50 AM Mark Johnston <markj@freebsd.org> wrote: > > > On Sun, Jul 14, 2024 at 03:14:44PM -0700, Rick Macklem wrote: > > > On Sun, Jul 14, 2024 at 10:32 AM Garrett Wollman <wollman@bimajority.org > > > > > > wrote: > > > > > > > <<On Sat, 13 Jul 2024 20:50:55 -0700, Rick Macklem < > > rick.macklem@gmail.com> > > > > said: > > > > > > > > > Just to clarify it, are you saying zfs is sleeping on "pipewr"? > > > > > (There is also a msleep() for "pipbww" in pipe_write().) > > > > > > > > It is sleeping on pipewr, yes. > > > > > > > > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.pipekva > > > > kern.ipc.pipekva: 536576 > > > > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.maxpipekva > > > > kern.ipc.maxpipekva: 2144993280 > > > > > > > > It's not out of KVA, it's just waiting for the `pv` process to wake up > > > > and read more data. `pv` is single-threaded and blocked on "select". > > > > > > > > It doesn't always get stuck in the same place, which is why I'm > > > > suspecting a lost wakeup somewhere. > > > > > > > This snippet from sys/kern/sys_pipe.c looks a little suspicious to me... > > > /* > > > * Direct copy, bypassing a kernel buffer. > > > */ > > > } else if ((size = rpipe->pipe_pages.cnt) != 0) { > > > if (size > uio->uio_resid) > > > size = (u_int) uio->uio_resid; > > > PIPE_UNLOCK(rpipe); > > > error = uiomove_fromphys(rpipe->pipe_pages.ms, > > > rpipe->pipe_pages.pos, size, uio); > > > PIPE_LOCK(rpipe); > > > if (error) > > > break; > > > nread += size; > > > rpipe->pipe_pages.pos += size; > > > rpipe->pipe_pages.cnt -= size; > > > if (rpipe->pipe_pages.cnt == 0) { > > > rpipe->pipe_state &= ~PIPE_WANTW; > > > wakeup(rpipe); > > > } > > > If it reads uio_resid bytes which is less than pipe_pages.cnt, no > > > wakeup() occurs. > > > I'd be tempted to try getting rid of the "if (rpipe->pipe_pages.cnt == > > 0)" > > > and do the wakeup() unconditionally, to see if it helps? > > > > I don't think that can help. pipe_write() will block if the "direct > > write" buffer is non-empty. See the comment in pipe_write(), "Pipe > > buffered writes cannot be coincidental with direct writes". > > > > select()/poll()/etc. should return an event if pipe_pages.cnt > 0 on the > > read side, so I suspect that the problem is elsewhere, or else I'm > > misunderstanding something. > > > > > Because if the application ("pv" in this case) doesn't do another read() > > on > > > the > > > pipe before calling select(), no wakeup() is going to occur, because > > here's > > > what pipe_write() does... > > > /* > > > * We have no more space and have something to offer, > > > * wake up select/poll. > > > */ > > > pipeselwakeup(wpipe); > > > > > > wpipe->pipe_state |= PIPE_WANTW; > > > pipeunlock(wpipe); > > > error = msleep(wpipe, PIPE_MTX(rpipe), > > > PRIBIO | PCATCH, "pipewr", 0); > > > pipelock(wpipe, 0); > > > if (error != 0) > > > break; > > > continue; > > > Note that, once in msleep(), no call to pipeselwakeup() will occur until > > > it gets woken up. > > > > > > I think the current code assumes that the reader ("pv" in this case) will > > > read all the data out of the pipe before calling select() again. > > > Does it do that? > > > > > > rick > > > ps: I've added markj@ as a cc, since he seems to have been the last guy > > > involved > > > in sys_pipe.c. > > > > I was running into this (in a repeatable fashion) today with pv in a ZFS > send/recv pipeline. I was able to recompile pv with debugging, and look at > what is happening within pv -- it is setting its internal EOF flag for the > output pipe. > > This is happening when it has a write(2) call return 0 (not _technically_ > an error?) at > https://codeberg.org/a-j-wood/pv/src/commit/42b623e9b078a6799e081587204a54f41e26da37/src/pv/transfer.c#L209 > . > > Once this happens during a __transfer_write_repeated() call when nothing > (within the function) has been written yet, it sets *eof_out = true at line > 711, and _never_ tries to write again. It still keeps select()-ing (I > haven't run down why it's not aborting at this point if it really thinks it > has an output it can't write to) like mad, however. > > Editing the pv source to instead handle a 0 return from write(2) similar to > EINTR and EAGAIN _makes it work without a hiccup_ on the same send|pv|recv > that was previously freezing up every time in a matter of seconds. (In this > particular case, it's an initial receive of a tree with a number of > filesystems & snapshots). > > So _if_ write(2) should ever return 0 is a question I'll leave to the > kernel developers here, but that appears to be what is tripping up pv here, > at least for me. Looking at the sources, it seems that this is a blocking file descriptor, i.e., O_NONBLOCK is not set. Is that correct? If so, the kernel's behaviour seems wrong to me. It would be useful to see a ktrace of the process. > I've submitted a pull request (with a reference back to this thread) at > https://codeberg.org/a-j-wood/pv/pulls/92 > > I've been unable to force this to happen with synthetic workloads; I don't > know if there is something in the zfs recv code that is setting / > interacting with the pipe in a particular way to expose this behavior. > > - Eric
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ZpZ3Uyg6nBhFP4i1>