Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 30 May 2024 09:53:59 -0400
From:      Warner Losh <imp@bsdimp.com>
To:        Kumara Babu <nkumarababu@gmail.com>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: Upperlimit for bwait()
Message-ID:  <CANCZdfqUtDvpgTpHx3P5ENmSQ%2Bo=W%2B9X5x-G5Zgu2UOkF_iiGQ@mail.gmail.com>
In-Reply-To: <CAG6t_XAcUDK%2BpPHiUZ9Bwu2fE5wg6vwK_zcuEYe94sb15HnUPg@mail.gmail.com>
References:  <CAG6t_XAcUDK%2BpPHiUZ9Bwu2fE5wg6vwK_zcuEYe94sb15HnUPg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
--00000000000096b2160619ac3307
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Thu, May 30, 2024 at 1:54=E2=80=AFAM Kumara Babu <nkumarababu@gmail.com>=
 wrote:

> Hello,
>
> There have been a few incidents reported on Juniper devices with FreeBSD,
> where buffer IO operations sleep for more than 30 mins. Theoretically, th=
is
> can happen due to faulty hardware or in virtual platforms due to faulty
> connection between guest and host, filesystem corruption, too many buffer
> IO operations, and/or host not responding due to various reasons. When th=
at
> happens, as this buffer IO writes hold a lock before going to sleep, the
> threads waiting for that lock would starve for so long. There is no upper
> limit for this bwait() as of now. If that wait goes beyond 30 mins for a
> sleeping thread OR 15 mins for a thread blocked on turnstile, deadlkres
> crashes the kernel assuming a possible deadlock.
>
Why isn't the I/O timing out? That's the real problem.

> We perhaps could gracefully handle such lengthy buffer IO operations by
> adding a timeout in bwait() - like say 10 minutes. If the buffer IO is no=
t
> completed in a few mins, it probably would not complete forever and/or
> would be slowing down the entire system. So it is better to stop such
> faulty IO operations.
>
I think that's a terrible idea. Why aren't the I/Os timing out?

> For now, since we had seen these instances only with BIO operations, I
> have a patch to set this value only from bufwait(). Please find the patch
> attached. I am not very sure if 10 mins is a good upper limit for all the
> scenarios for bwait(). If it is, then we could just change msleep() in
> bwait() to set a 10 mins upper limit by default.
>
I never see this on any of the thousands of systems I've used.

> Please let me know if this approach works for all the usecases - If not,
> is there a better alternative ?  And is 10 mins okay for a timeout ?
>
Making sure that the I/Os timeout.

And by that, I mean doing what we do in CAM. All the SIMs ensure that
transactions posted to the device will timeout. Most of the SIMs create a
timeout per transaction which expire and complete the CCBs with a timeout,
which the periph drivers then see this status and will fail the I/O with a
timed out status (or maybe retries it a couple of times, depending on the
hardware and its recovery methods (eg is the timeout due to the drive, the
link, the HBA, etc will result in different recovery in the face of
timeouts). NVME nvd does similar things: A timeout will cause the nvme card
to be reset and we try again, but eventually fail.

One might also wonder why 30s is the timeout for most of the commands. I
get that 'special' commands might need a longer timeout, but we likely
should look at lowering this somewhat. 15s is almost certainly safe. 10s is
probably safe. 5s will work, but you start to get P99.9999 outliers on
popular completely working spinning rust, and P99.9 on marginal drives, so
it can be a bit tricky to change (we'll have to phase it in). That could
make things a bit better in terms of worse case recovery time.

So why aren't the I/O's timing out is the real question here.

Warner


> Thanks and Regards,
>
> Kumara
>

--00000000000096b2160619ac3307
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Thu, May 30, 2024 at 1:54=E2=80=AF=
AM Kumara Babu &lt;<a href=3D"mailto:nkumarababu@gmail.com">nkumarababu@gma=
il.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"m=
argin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left=
:1ex"><div dir=3D"ltr"><div><p style=3D"color:rgb(33,33,33);font-family:Apt=
os;font-size:16px">Hello,<br></p><p style=3D"color:rgb(33,33,33);font-famil=
y:Aptos;font-size:16px">There have been a few incidents reported on Juniper=
 devices with FreeBSD, where buffer IO operations sleep for more than 30 mi=
ns. Theoretically, this can happen due to faulty hardware or in virtual pla=
tforms due to faulty connection between guest and host, filesystem corrupti=
on, too many buffer IO operations, and/or host not responding due to variou=
s reasons. When that happens, as this buffer IO writes hold a lock before g=
oing to sleep, the threads waiting for that lock would starve for so long. =
There is no upper limit for this bwait() as of now. If that wait goes beyon=
d 30 mins for a sleeping thread OR 15 mins for a thread blocked on turnstil=
e, deadlkres crashes the kernel assuming a possible deadlock.<br></p></div>=
</div></blockquote><div>Why isn&#39;t the I/O timing out? That&#39;s the re=
al problem.</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px =
0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=
=3D"ltr"><div><p style=3D"color:rgb(33,33,33);font-family:Aptos;font-size:1=
6px"></p><p style=3D"color:rgb(33,33,33);font-family:Aptos;font-size:16px">=
We perhaps could gracefully handle such lengthy buffer IO operations by add=
ing a timeout in bwait() - like say 10 minutes. If the buffer IO is not com=
pleted in a few mins, it probably would not complete forever and/or would b=
e slowing down the entire system. So it is better to stop such faulty IO op=
erations.</p></div></div></blockquote><div>I think that&#39;s a terrible id=
ea. Why aren&#39;t the I/Os timing out?=C2=A0</div><blockquote class=3D"gma=
il_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,2=
04,204);padding-left:1ex"><div dir=3D"ltr"><div><p style=3D"color:rgb(33,33=
,33);font-family:Aptos;font-size:16px">For now, since we had seen these ins=
tances only with BIO operations, I have a patch to set this value only from=
 bufwait(). Please find the patch attached. I am not very sure if 10 mins i=
s a good upper limit for all the scenarios for bwait(). If it is, then we c=
ould just change msleep() in bwait() to set a 10 mins upper limit by defaul=
t.<span>=C2=A0</span></p></div></div></blockquote><div>I never see this on =
any of the thousands of systems I&#39;ve used.</div><blockquote class=3D"gm=
ail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,=
204,204);padding-left:1ex"><div dir=3D"ltr"><div><p style=3D"color:rgb(33,3=
3,33);font-family:Aptos;font-size:16px">Please let me know if this approach=
 works for all the usecases - If not, is there a better alternative ?=C2=A0=
 And is 10 mins okay for a timeout ?</p></div></div></blockquote><div>Makin=
g sure that the I/Os timeout.</div><div><br></div><div>And by that, I mean =
doing what we do in CAM. All the SIMs ensure that transactions posted to th=
e device will timeout. Most of the SIMs create a timeout per transaction wh=
ich expire and complete the CCBs with a timeout, which the periph drivers t=
hen see this status and will fail the I/O with a timed out status (or maybe=
 retries it a couple of times, depending on the hardware and its recovery m=
ethods (eg is the timeout due to the drive, the link, the HBA, etc will res=
ult in different recovery in the face of timeouts). NVME nvd does similar t=
hings: A timeout will cause the nvme card to be reset and we try again, but=
 eventually fail.</div><div><br></div><div>One might also wonder why 30s is=
 the timeout for most of the commands. I get that &#39;special&#39; command=
s might need a longer timeout, but we likely should look at lowering this s=
omewhat. 15s is almost certainly safe. 10s is probably safe. 5s will work, =
but you start to get P99.9999 outliers on popular completely working spinni=
ng rust, and P99.9 on marginal drives, so it can be a bit tricky to change =
(we&#39;ll have to phase it in). That could make things a bit better in ter=
ms of worse case recovery time.</div><div><br></div><div>So why aren&#39;t =
the I/O&#39;s timing out is the real question here.</div><div><br></div><di=
v>Warner</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"m=
argin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left=
:1ex"><div dir=3D"ltr"><div><p class=3D"MsoNormal" style=3D"margin:0in;font=
-size:16px;font-family:Aptos,sans-serif;color:rgb(33,33,33)"><span style=3D=
"font-size:11pt;font-family:Calibri,sans-serif;color:black;line-height:1.2"=
>Thanks and Regards,</span></p><p class=3D"MsoNormal" style=3D"margin:0in;f=
ont-size:16px;font-family:Aptos,sans-serif;color:rgb(33,33,33)"><span style=
=3D"font-size:11pt;font-family:Calibri,sans-serif;color:black;line-height:1=
.2">Kumara</span></p></div></div>
</blockquote></div></div>

--00000000000096b2160619ac3307--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfqUtDvpgTpHx3P5ENmSQ%2Bo=W%2B9X5x-G5Zgu2UOkF_iiGQ>