Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 16 Jul 2024 09:36:19 -0400
From:      Mark Johnston <markj@freebsd.org>
To:        "Eric A. Borisch" <eborisch+FreeBSD@gmail.com>
Cc:        FreeBSD Stable <freebsd-stable@freebsd.org>, rick.macklem@gmail.com, Garrett Wollman <wollman@bimajority.org>
Subject:   Re: Possible bug in zfs send or pipe implementation?
Message-ID:  <ZpZ3Uyg6nBhFP4i1@nuc>
In-Reply-To: <CAASnNno-8vSkqx_WR7CUsMbL_inEhW8WL35pGrVaH1Edi65n7g@mail.gmail.com>
References:  <CAASnNno-8vSkqx_WR7CUsMbL_inEhW8WL35pGrVaH1Edi65n7g@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Jul 15, 2024 at 11:08:44PM -0500, Eric A. Borisch wrote:
> On Mon, Jul 15, 2024 at 9:50 AM Mark Johnston <markj@freebsd.org> wrote:
> 
> > On Sun, Jul 14, 2024 at 03:14:44PM -0700, Rick Macklem wrote:
> > > On Sun, Jul 14, 2024 at 10:32 AM Garrett Wollman <wollman@bimajority.org
> > >
> > > wrote:
> > >
> > > > <<On Sat, 13 Jul 2024 20:50:55 -0700, Rick Macklem <
> > rick.macklem@gmail.com>
> > > > said:
> > > >
> > > > > Just to clarify it, are you saying zfs is sleeping on "pipewr"?
> > > > > (There is also a msleep() for "pipbww" in pipe_write().)
> > > >
> > > > It is sleeping on pipewr, yes.
> > > >
> > > > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.pipekva
> > > > kern.ipc.pipekva: 536576
> > > > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.maxpipekva
> > > > kern.ipc.maxpipekva: 2144993280
> > > >
> > > > It's not out of KVA, it's just waiting for the `pv` process to wake up
> > > > and read more data.  `pv` is single-threaded and blocked on "select".
> > > >
> > > > It doesn't always get stuck in the same place, which is why I'm
> > > > suspecting a lost wakeup somewhere.
> > > >
> > > This snippet from sys/kern/sys_pipe.c looks a little suspicious to me...
> > >   /*
> > > * Direct copy, bypassing a kernel buffer.
> > > */
> > > } else if ((size = rpipe->pipe_pages.cnt) != 0) {
> > > if (size > uio->uio_resid)
> > > size = (u_int) uio->uio_resid;
> > > PIPE_UNLOCK(rpipe);
> > > error = uiomove_fromphys(rpipe->pipe_pages.ms,
> > >     rpipe->pipe_pages.pos, size, uio);
> > > PIPE_LOCK(rpipe);
> > > if (error)
> > > break;
> > > nread += size;
> > > rpipe->pipe_pages.pos += size;
> > > rpipe->pipe_pages.cnt -= size;
> > > if (rpipe->pipe_pages.cnt == 0) {
> > > rpipe->pipe_state &= ~PIPE_WANTW;
> > > wakeup(rpipe);
> > > }
> > > If it reads uio_resid bytes which is less than pipe_pages.cnt, no
> > > wakeup() occurs.
> > > I'd be tempted to try getting rid of the "if (rpipe->pipe_pages.cnt ==
> > 0)"
> > > and do the wakeup() unconditionally, to see if it helps?
> >
> > I don't think that can help.  pipe_write() will block if the "direct
> > write" buffer is non-empty.  See the comment in pipe_write(), "Pipe
> > buffered writes cannot be coincidental with direct writes".
> >
> > select()/poll()/etc. should return an event if pipe_pages.cnt > 0 on the
> > read side, so I suspect that the problem is elsewhere, or else I'm
> > misunderstanding something.
> >
> > > Because if the application ("pv" in this case) doesn't do another read()
> > on
> > > the
> > > pipe before calling select(), no wakeup() is going to occur, because
> > here's
> > > what pipe_write() does...
> > > /*
> > > * We have no more space and have something to offer,
> > > * wake up select/poll.
> > > */
> > > pipeselwakeup(wpipe);
> > >
> > > wpipe->pipe_state |= PIPE_WANTW;
> > > pipeunlock(wpipe);
> > > error = msleep(wpipe, PIPE_MTX(rpipe),
> > >     PRIBIO | PCATCH, "pipewr", 0);
> > > pipelock(wpipe, 0);
> > > if (error != 0)
> > > break;
> > > continue;
> > > Note that, once in msleep(), no call to pipeselwakeup() will occur until
> > > it gets woken up.
> > >
> > > I think the current code assumes that the reader ("pv" in this case) will
> > > read all the data out of the pipe before calling select() again.
> > > Does it do that?
> > >
> > > rick
> > > ps: I've added markj@ as a cc, since he seems to have been the last guy
> > > involved
> > >     in sys_pipe.c.
> >
> 
> I was running into this (in a repeatable fashion) today with pv in a ZFS
> send/recv pipeline. I was able to recompile pv with debugging, and look at
> what is happening within pv -- it is setting its internal EOF flag for the
> output pipe.
> 
> This is happening when it has a write(2) call return 0 (not _technically_
> an error?) at
> https://codeberg.org/a-j-wood/pv/src/commit/42b623e9b078a6799e081587204a54f41e26da37/src/pv/transfer.c#L209
> .
> 
> Once this happens during a __transfer_write_repeated() call when nothing
> (within the function) has been written yet, it sets *eof_out = true at line
> 711, and _never_ tries to write again. It still keeps select()-ing (I
> haven't run down why it's not aborting at this point if it really thinks it
> has an output it can't write to) like mad, however.
> 
> Editing the pv source to instead handle a 0 return from write(2) similar to
> EINTR and EAGAIN _makes it work without a hiccup_ on the same send|pv|recv
> that was previously freezing up every time in a matter of seconds. (In this
> particular case, it's an initial receive of a tree with a number of
> filesystems & snapshots).
> 
> So _if_ write(2) should ever return 0 is a question I'll leave to the
> kernel developers here, but that appears to be what is tripping up pv here,
> at least for me.

Looking at the sources, it seems that this is a blocking file
descriptor, i.e., O_NONBLOCK is not set.  Is that correct?  If so, the
kernel's behaviour seems wrong to me.  It would be useful to see a
ktrace of the process.

> I've submitted a pull request (with a reference back to this thread) at
> https://codeberg.org/a-j-wood/pv/pulls/92
> 
> I've been unable to force this to happen with synthetic workloads; I don't
> know if there is something in the zfs recv code that is setting /
> interacting with the pipe in a particular way to expose this behavior.
> 
>   - Eric



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ZpZ3Uyg6nBhFP4i1>