Date: Tue, 5 Apr 2022 18:24:40 -0600 From: Warner Losh <imp@bsdimp.com> To: Alan Somers <asomers@freebsd.org> Cc: freebsd-fs <freebsd-fs@freebsd.org> Subject: Re: Hour-long sleeps in the ZFS write throttle: fix for 13.1 ? Message-ID: <CANCZdfpnE2S2uAdy81KL4mmJLAu_b2gjn59Eh%2BesOZswM8eX8A@mail.gmail.com> In-Reply-To: <CAOtMX2j9_saonWpyUERdkKj-cPdWzsyWNGQSUcEDOa8nBF3r=w@mail.gmail.com> References: <CAOtMX2j9_saonWpyUERdkKj-cPdWzsyWNGQSUcEDOa8nBF3r=w@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000cae50605dbf16376 Content-Type: text/plain; charset="UTF-8" On Tue, Apr 5, 2022 at 3:06 PM Alan Somers <asomers@freebsd.org> wrote: > All year long I've occasionally seen my ZFS processes get blocked in > dmu_tx_wait. They stay blocked for more than an hour but eventually > recover. I finally found the cause: an integer overflow bug in > ustosbt. The fix is simple enough, but my question is: should we try > to commit this in time for 13.1-RELEASE? It's a very disruptive bug, > but also very hard to trigger. It takes a pretty highly congested ZFS > system to trigger it. In theory the bug could affect other > subsystems, too. > > https://github.com/openzfs/zfs/issues/13289 > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263073 These routines were originally not meant for large times (> 1s). However, that was poorly documented and so I fixed it. But did so incorrectly. If you look at the bug, I've posted what I think is the fix (it also matches Alan's description). Warner --000000000000cae50605dbf16376 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">= <div dir=3D"ltr" class=3D"gmail_attr">On Tue, Apr 5, 2022 at 3:06 PM Alan S= omers <<a href=3D"mailto:asomers@freebsd.org">asomers@freebsd.org</a>>= ; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px= 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">All yea= r long I've occasionally seen my ZFS processes get blocked in<br> dmu_tx_wait.=C2=A0 They stay blocked for more than an hour but eventually<b= r> recover.=C2=A0 I finally found the cause: an integer overflow bug in<br> ustosbt.=C2=A0 The fix is simple enough, but my question is: should we try<= br> to commit this in time for 13.1-RELEASE?=C2=A0 It's a very disruptive b= ug,<br> but also very hard to trigger.=C2=A0 It takes a pretty highly congested ZFS= <br> system to trigger it.=C2=A0 In theory the bug could affect other<br> subsystems, too.<br> <br> <a href=3D"https://github.com/openzfs/zfs/issues/13289" rel=3D"noreferrer" = target=3D"_blank">https://github.com/openzfs/zfs/issues/13289</a><br> <a href=3D"https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D263073" rel= =3D"noreferrer" target=3D"_blank">https://bugs.freebsd.org/bugzilla/show_bu= g.cgi?id=3D263073</a></blockquote><div><br></div><div>These routines were o= riginally not meant for large times (> 1s). However,</div><div>that was = poorly documented and so I fixed it. But did so incorrectly.<br></div><div>= If you look at the bug, I've posted what I think is the fix (it also ma= tches</div><div>Alan's description).</div><div><br></div><div>Warner</d= iv></div></div> --000000000000cae50605dbf16376--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfpnE2S2uAdy81KL4mmJLAu_b2gjn59Eh%2BesOZswM8eX8A>