Date: Mon, 15 Jul 2024 10:50:12 -0400 From: Mark Johnston <markj@freebsd.org> To: Rick Macklem <rick.macklem@gmail.com> Cc: Garrett Wollman <wollman@bimajority.org>, freebsd-stable@freebsd.org Subject: Re: Possible bug in zfs send or pipe implementation? Message-ID: <ZpU3JOtTL_-E5FY0@nuc> In-Reply-To: <CAM5tNy7hM66buzFUMTz6t8sMheSyOnwQ_R-S_CQ8Kbf9Bc1OHQ@mail.gmail.com> References: <26259.12713.114036.564205@hergotha.csail.mit.edu> <CAM5tNy4pPF9mHdXM5W6gjztm4_TtFfXnOLu3cdkqvaRf3Ab5uA@mail.gmail.com> <26259.17366.276955.824313@hergotha.csail.mit.edu> <CAM5tNy5m9-rqNTWJP5wAs9FewB6=FW9XbAe4V7qLgnrYJkSFKA@mail.gmail.com> <26260.2984.961319.782123@hergotha.csail.mit.edu> <CAM5tNy7hM66buzFUMTz6t8sMheSyOnwQ_R-S_CQ8Kbf9Bc1OHQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Jul 14, 2024 at 03:14:44PM -0700, Rick Macklem wrote: > On Sun, Jul 14, 2024 at 10:32 AM Garrett Wollman <wollman@bimajority.org> > wrote: > > > <<On Sat, 13 Jul 2024 20:50:55 -0700, Rick Macklem <rick.macklem@gmail.com> > > said: > > > > > Just to clarify it, are you saying zfs is sleeping on "pipewr"? > > > (There is also a msleep() for "pipbww" in pipe_write().) > > > > It is sleeping on pipewr, yes. > > > > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.pipekva > > kern.ipc.pipekva: 536576 > > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.maxpipekva > > kern.ipc.maxpipekva: 2144993280 > > > > It's not out of KVA, it's just waiting for the `pv` process to wake up > > and read more data. `pv` is single-threaded and blocked on "select". > > > > It doesn't always get stuck in the same place, which is why I'm > > suspecting a lost wakeup somewhere. > > > This snippet from sys/kern/sys_pipe.c looks a little suspicious to me... > /* > * Direct copy, bypassing a kernel buffer. > */ > } else if ((size = rpipe->pipe_pages.cnt) != 0) { > if (size > uio->uio_resid) > size = (u_int) uio->uio_resid; > PIPE_UNLOCK(rpipe); > error = uiomove_fromphys(rpipe->pipe_pages.ms, > rpipe->pipe_pages.pos, size, uio); > PIPE_LOCK(rpipe); > if (error) > break; > nread += size; > rpipe->pipe_pages.pos += size; > rpipe->pipe_pages.cnt -= size; > if (rpipe->pipe_pages.cnt == 0) { > rpipe->pipe_state &= ~PIPE_WANTW; > wakeup(rpipe); > } > If it reads uio_resid bytes which is less than pipe_pages.cnt, no > wakeup() occurs. > I'd be tempted to try getting rid of the "if (rpipe->pipe_pages.cnt == 0)" > and do the wakeup() unconditionally, to see if it helps? I don't think that can help. pipe_write() will block if the "direct write" buffer is non-empty. See the comment in pipe_write(), "Pipe buffered writes cannot be coincidental with direct writes". select()/poll()/etc. should return an event if pipe_pages.cnt > 0 on the read side, so I suspect that the problem is elsewhere, or else I'm misunderstanding something. > Because if the application ("pv" in this case) doesn't do another read() on > the > pipe before calling select(), no wakeup() is going to occur, because here's > what pipe_write() does... > /* > * We have no more space and have something to offer, > * wake up select/poll. > */ > pipeselwakeup(wpipe); > > wpipe->pipe_state |= PIPE_WANTW; > pipeunlock(wpipe); > error = msleep(wpipe, PIPE_MTX(rpipe), > PRIBIO | PCATCH, "pipewr", 0); > pipelock(wpipe, 0); > if (error != 0) > break; > continue; > Note that, once in msleep(), no call to pipeselwakeup() will occur until > it gets woken up. > > I think the current code assumes that the reader ("pv" in this case) will > read all the data out of the pipe before calling select() again. > Does it do that? > > rick > ps: I've added markj@ as a cc, since he seems to have been the last guy > involved > in sys_pipe.c.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ZpU3JOtTL_-E5FY0>