From nobody Tue Jul 16 04:08:44 2024 X-Original-To: freebsd-stable@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4WNQYM3Hlgz5QvMX for ; Tue, 16 Jul 2024 04:08:55 +0000 (UTC) (envelope-from eborisch@gmail.com) Received: from mail-qv1-xf32.google.com (mail-qv1-xf32.google.com [IPv6:2607:f8b0:4864:20::f32]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "WR4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4WNQYL04gDz4bTN; Tue, 16 Jul 2024 04:08:54 +0000 (UTC) (envelope-from eborisch@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20230601 header.b=DZ5Hzqvg; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of eborisch@gmail.com designates 2607:f8b0:4864:20::f32 as permitted sender) smtp.mailfrom=eborisch@gmail.com Received: by mail-qv1-xf32.google.com with SMTP id 6a1803df08f44-6b5e4466931so27567586d6.0; Mon, 15 Jul 2024 21:08:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1721102931; x=1721707731; darn=freebsd.org; h=cc:to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=Hu/h+y+p+l9ZHejB8i0XnxcNzOWNrin2+nZZ02IJ6Sc=; b=DZ5Hzqvg6DiLxAKk3LvwM9B+TxgUH4bTInfPwPPaZZdHVKlq5QjMYCRdXiQkP0puGh n8Jog261rcbD+J44SjskdUrObCGI0L5scXpVTuzBhNu4vwM78+ck1tOIeAJM0BKLP79K fLnzDORweT3rnhfgGoSIyuOcLTP2sbQMUVFwWatyEF9Orv9iWzxlKiN4av3oaLbw5aF7 sF8wOFWPqFZuZXI5hMPt7IXUUxN7XOqAP08A8/HocLPzmvpmPOurAk/tlR0A5NLLUP/D q5v2fdQ8fGbaSxF70UeamZWenqfPTuZbLxLjGSE9fgHaZfOMbsrPxUJkO0YzIn0DlEW8 oreg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721102931; x=1721707731; h=cc:to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=Hu/h+y+p+l9ZHejB8i0XnxcNzOWNrin2+nZZ02IJ6Sc=; b=Vy/WUg++kRdLKbZJHjavGY+ZF2e5uX+R9h5lhKBYwlZb0rvEReZmlh1RLGk7bxUb12 d6XWR53gU2GM3qFrr/i4K9b1OkU9uOD+3z2sJARAvAUBSxeuf8ZwXX9kpORVtp6GObtN kOKCH8s+jttqz7OATmG/tH5Dv4kugaM19KLlhbBxPLSfzFYD+C8hSypE6vIx4IxE5MBK RAmUDCe5n3RlzATX/p1fBRsetICYD4n9Qfu7HuADvUYFloEWAP39v0kpRZtgX+kGG+rH j4xFSTdAC30tbBkHxmjgQusz7SFhmHFegDSELZBXfH34ZC3oYzF1mikoMGHLu1s1G1Ww 985A== X-Gm-Message-State: AOJu0YzNjVYvdvH7TDD/bZRXk8Vg0n8snK2gU1AR0O6iobYJcfuTxvjU MKdSdjpZ2TZAqWr6lBW3/g7VD3Tj068fboGZB/3ZGMC2+MTLXhHCLBnpCYBb1HV36HRpGIUHFRk c/Ba4kJGWwT3NRXZIdCtulf7gCiOezHWY X-Google-Smtp-Source: AGHT+IGwxVdcxN39umpm86a/4+S3k0TdGZCQTjs529oOpcMq9fwSDYq9ydonoEry33NTWZl3ncjaAFr3taanUjM0t7I= X-Received: by 2002:a05:6214:c44:b0:6b0:774b:38c2 with SMTP id 6a1803df08f44-6b77f3844cbmr15270266d6.6.1721102931398; Mon, 15 Jul 2024 21:08:51 -0700 (PDT) List-Id: Production branch of FreeBSD source code List-Archive: https://lists.freebsd.org/archives/freebsd-stable List-Help: List-Post: List-Subscribe: List-Unsubscribe: X-BeenThere: freebsd-stable@freebsd.org Sender: owner-freebsd-stable@FreeBSD.org MIME-Version: 1.0 From: "Eric A. Borisch" Date: Mon, 15 Jul 2024 23:08:44 -0500 Message-ID: Subject: Re: Possible bug in zfs send or pipe implementation? To: Mark Johnston Cc: FreeBSD Stable , rick.macklem@gmail.com, Garrett Wollman Content-Type: multipart/alternative; boundary="000000000000d91af5061d558015" X-Spamd-Bar: - X-Spamd-Result: default: False [-1.99 / 15.00]; SUBJECT_ENDS_QUESTION(1.00)[]; FAKE_REPLY(1.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-0.99)[-0.992]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20230601]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; TAGGED_FROM(0.00)[FreeBSD]; RCVD_TLS_LAST(0.00)[]; TO_DN_SOME(0.00)[]; FREEMAIL_FROM(0.00)[gmail.com]; FREEMAIL_CC(0.00)[freebsd.org,gmail.com,bimajority.org]; ARC_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; FROM_HAS_DN(0.00)[]; MISSING_XM_UA(0.00)[]; RCPT_COUNT_THREE(0.00)[4]; MLMMJ_DEST(0.00)[freebsd-stable@freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; MID_RHS_MATCH_FROMTLD(0.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; TAGGED_RCPT(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::f32:from]; RCVD_COUNT_ONE(0.00)[1]; FREEMAIL_ENVFROM(0.00)[gmail.com] X-Rspamd-Queue-Id: 4WNQYL04gDz4bTN --000000000000d91af5061d558015 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, Jul 15, 2024 at 9:50=E2=80=AFAM Mark Johnston w= rote: > On Sun, Jul 14, 2024 at 03:14:44PM -0700, Rick Macklem wrote: > > On Sun, Jul 14, 2024 at 10:32=E2=80=AFAM Garrett Wollman > > > wrote: > > > > > < rick.macklem@gmail.com> > > > said: > > > > > > > Just to clarify it, are you saying zfs is sleeping on "pipewr"? > > > > (There is also a msleep() for "pipbww" in pipe_write().) > > > > > > It is sleeping on pipewr, yes. > > > > > > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.pipekva > > > kern.ipc.pipekva: 536576 > > > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.maxpipekva > > > kern.ipc.maxpipekva: 2144993280 > > > > > > It's not out of KVA, it's just waiting for the `pv` process to wake u= p > > > and read more data. `pv` is single-threaded and blocked on "select". > > > > > > It doesn't always get stuck in the same place, which is why I'm > > > suspecting a lost wakeup somewhere. > > > > > This snippet from sys/kern/sys_pipe.c looks a little suspicious to me..= . > > /* > > * Direct copy, bypassing a kernel buffer. > > */ > > } else if ((size =3D rpipe->pipe_pages.cnt) !=3D 0) { > > if (size > uio->uio_resid) > > size =3D (u_int) uio->uio_resid; > > PIPE_UNLOCK(rpipe); > > error =3D uiomove_fromphys(rpipe->pipe_pages.ms, > > rpipe->pipe_pages.pos, size, uio); > > PIPE_LOCK(rpipe); > > if (error) > > break; > > nread +=3D size; > > rpipe->pipe_pages.pos +=3D size; > > rpipe->pipe_pages.cnt -=3D size; > > if (rpipe->pipe_pages.cnt =3D=3D 0) { > > rpipe->pipe_state &=3D ~PIPE_WANTW; > > wakeup(rpipe); > > } > > If it reads uio_resid bytes which is less than pipe_pages.cnt, no > > wakeup() occurs. > > I'd be tempted to try getting rid of the "if (rpipe->pipe_pages.cnt =3D= =3D > 0)" > > and do the wakeup() unconditionally, to see if it helps? > > I don't think that can help. pipe_write() will block if the "direct > write" buffer is non-empty. See the comment in pipe_write(), "Pipe > buffered writes cannot be coincidental with direct writes". > > select()/poll()/etc. should return an event if pipe_pages.cnt > 0 on the > read side, so I suspect that the problem is elsewhere, or else I'm > misunderstanding something. > > > Because if the application ("pv" in this case) doesn't do another read(= ) > on > > the > > pipe before calling select(), no wakeup() is going to occur, because > here's > > what pipe_write() does... > > /* > > * We have no more space and have something to offer, > > * wake up select/poll. > > */ > > pipeselwakeup(wpipe); > > > > wpipe->pipe_state |=3D PIPE_WANTW; > > pipeunlock(wpipe); > > error =3D msleep(wpipe, PIPE_MTX(rpipe), > > PRIBIO | PCATCH, "pipewr", 0); > > pipelock(wpipe, 0); > > if (error !=3D 0) > > break; > > continue; > > Note that, once in msleep(), no call to pipeselwakeup() will occur unti= l > > it gets woken up. > > > > I think the current code assumes that the reader ("pv" in this case) wi= ll > > read all the data out of the pipe before calling select() again. > > Does it do that? > > > > rick > > ps: I've added markj@ as a cc, since he seems to have been the last guy > > involved > > in sys_pipe.c. > I was running into this (in a repeatable fashion) today with pv in a ZFS send/recv pipeline. I was able to recompile pv with debugging, and look at what is happening within pv -- it is setting its internal EOF flag for the output pipe. This is happening when it has a write(2) call return 0 (not _technically_ an error?) at https://codeberg.org/a-j-wood/pv/src/commit/42b623e9b078a6799e081587204a54f= 41e26da37/src/pv/transfer.c#L209 . Once this happens during a __transfer_write_repeated() call when nothing (within the function) has been written yet, it sets *eof_out =3D true at li= ne 711, and _never_ tries to write again. It still keeps select()-ing (I haven't run down why it's not aborting at this point if it really thinks it has an output it can't write to) like mad, however. Editing the pv source to instead handle a 0 return from write(2) similar to EINTR and EAGAIN _makes it work without a hiccup_ on the same send|pv|recv that was previously freezing up every time in a matter of seconds. (In this particular case, it's an initial receive of a tree with a number of filesystems & snapshots). So _if_ write(2) should ever return 0 is a question I'll leave to the kernel developers here, but that appears to be what is tripping up pv here, at least for me. I've submitted a pull request (with a reference back to this thread) at https://codeberg.org/a-j-wood/pv/pulls/92 I've been unable to force this to happen with synthetic workloads; I don't know if there is something in the zfs recv code that is setting / interacting with the pipe in a particular way to expose this behavior. - Eric --000000000000d91af5061d558015 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Mon, Jul 15, 2024 at 9:50=E2=80=AFAM M= ark Johnston <mar= kj@freebsd.org> wrote:
On Sun, Jul 14, 2024 = at 03:14:44PM -0700, Rick Macklem wrote:
> On Sun, Jul 14, 2024 at 10:32=E2=80=AFAM Garrett Wollman <wollman@bimajority.org= >
> wrote:
>
> > <<On Sat, 13 Jul 2024 20:50:55 -0700, Rick Macklem <rick.macklem@gmail.= com>
> > said:
> >
> > > Just to clarify it, are you saying zfs is sleeping on "= pipewr"?
> > > (There is also a msleep() for "pipbww" in pipe_wri= te().)
> >
> > It is sleeping on pipewr, yes.
> >
> > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.pipekva
> > kern.ipc.pipekva: 536576
> > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.maxpipekva
> > kern.ipc.maxpipekva: 2144993280
> >
> > It's not out of KVA, it's just waiting for the `pv` proce= ss to wake up
> > and read more data.=C2=A0 `pv` is single-threaded and blocked on = "select".
> >
> > It doesn't always get stuck in the same place, which is why I= 'm
> > suspecting a lost wakeup somewhere.
> >
> This snippet from sys/kern/sys_pipe.c looks a little suspicious to me.= ..
>=C2=A0 =C2=A0/*
> * Direct copy, bypassing a kernel buffer.
> */
> } else if ((size =3D rpipe->pipe_pages.cnt) !=3D 0) {
> if (size > uio->uio_resid)
> size =3D (u_int) uio->uio_resid;
> PIPE_UNLOCK(rpipe);
> error =3D uiomove_fromphys(rpipe->pipe_pages.ms,
>=C2=A0 =C2=A0 =C2=A0rpipe->pipe_pages.pos, size, uio);
> PIPE_LOCK(rpipe);
> if (error)
> break;
> nread +=3D size;
> rpipe->pipe_pages.pos +=3D size;
> rpipe->pipe_pages.cnt -=3D size;
> if (rpipe->pipe_pages.cnt =3D=3D 0) {
> rpipe->pipe_state &=3D ~PIPE_WANTW;
> wakeup(rpipe);
> }
> If it reads uio_resid bytes which is less than pipe_pages.cnt, no
> wakeup() occurs.
> I'd be tempted to try getting rid of the "if (rpipe->pipe_= pages.cnt =3D=3D 0)"
> and do the wakeup() unconditionally, to see if it helps?

I don't think that can help.=C2=A0 pipe_write() will block if the "= ;direct
write" buffer is non-empty.=C2=A0 See the comment in pipe_write(), &qu= ot;Pipe
buffered writes cannot be coincidental with direct writes".

select()/poll()/etc. should return an event if pipe_pages.cnt > 0 on the=
read side, so I suspect that the problem is elsewhere, or else I'm
misunderstanding something.

> Because if the application ("pv" in this case) doesn't d= o another read() on
> the
> pipe before calling select(), no wakeup() is going to occur, because h= ere's
> what pipe_write() does...
> /*
> * We have no more space and have something to offer,
> * wake up select/poll.
> */
> pipeselwakeup(wpipe);
>
> wpipe->pipe_state |=3D PIPE_WANTW;
> pipeunlock(wpipe);
> error =3D msleep(wpipe, PIPE_MTX(rpipe),
>=C2=A0 =C2=A0 =C2=A0PRIBIO | PCATCH, "pipewr", 0);
> pipelock(wpipe, 0);
> if (error !=3D 0)
> break;
> continue;
> Note that, once in msleep(), no call to pipeselwakeup() will occur unt= il
> it gets woken up.
>
> I think the current code assumes that the reader ("pv" in th= is case) will
> read all the data out of the pipe before calling select() again.
> Does it do that?
>
> rick
> ps: I've added markj@ as a cc, since he seems to have been the las= t guy
> involved
>=C2=A0 =C2=A0 =C2=A0in sys_pipe.c.

= I was running into this (in a repeatable fashion) today with pv in a ZFS se= nd/recv pipeline. I was able to recompile pv with debugging, and look at wh= at is happening within pv -- it is setting its internal EOF flag for the ou= tput pipe.

This is happening when it has a write(2= ) call return 0 (not _technically_ an error?) at https://codeberg.org/a-j-wood/pv/src/commit/42b623e9b078= a6799e081587204a54f41e26da37/src/pv/transfer.c#L209.=C2=A0
Once this happens during a=C2=A0__transfer_write_repeated() ca= ll when nothing (within the function) has been written yet, it sets *eof_ou= t =3D true at line 711, and _never_ tries to write again. It still keeps se= lect()-ing (I haven't run down why it's not aborting at this point = if it really thinks it has an output it can't write to) like mad, howev= er.

Editing the pv source to instead handle a = 0 return from write(2) similar to EINTR and EAGAIN _makes it work without a= hiccup_ on the same send|pv|recv that was previously freezing up every tim= e in a matter of seconds. (In this particular case, it's an initial rec= eive of a tree with a number of filesystems & snapshots).
So _if_ write(2) should ever return 0 is a question I'll le= ave to the kernel developers here, but that appears to be what is tripping = up pv here, at least for me.

I've submitted a = pull request (with a reference back to this thread) at https://codeberg.org/a-j-wood/pv/pulls/92=

I've been unable to force this to happen = with synthetic workloads; I don't know if there is something in the zfs= recv code that is setting / interacting with the pipe in a particular way = to expose this behavior.

=C2=A0 - Eric
=C2=A0
--000000000000d91af5061d558015--