Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 1 Dec 2021 13:47:01 -0700
From:      Alan Somers <asomers@freebsd.org>
To:        Warner Losh <imp@bsdimp.com>
Cc:        FreeBSD <freebsd-stable@freebsd.org>
Subject:   Re: ZFS deadlocks triggered by HDD timeouts
Message-ID:  <CAOtMX2gnEgGn-h16UJHhrS79ypH357=r2R0DaYAa1J-TOGAKCQ@mail.gmail.com>
In-Reply-To: <CANCZdfqfcbObUUonrEdNViJ-5xvU%2BFeYT%2BapHwmTpiHmfBVaXg@mail.gmail.com>
References:  <CAOtMX2hMu7qXqHt5rhi9CBNDRERpWshcF%2BR9N_VQOrYvYFERQg@mail.gmail.com> <CANCZdfo7W-eFoQ6X4y0rY=k5in6T7Ledjhes39ToO9ZXLXyVbw@mail.gmail.com> <CAOtMX2jmppMTwnK_g4OiWSnGu=Vwxm1FMa-_izdNPTYaJPyiDA@mail.gmail.com> <CANCZdfqfcbObUUonrEdNViJ-5xvU%2BFeYT%2BapHwmTpiHmfBVaXg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Dec 1, 2021 at 1:37 PM Warner Losh <imp@bsdimp.com> wrote:
>
>
>
> On Wed, Dec 1, 2021 at 1:28 PM Alan Somers <asomers@freebsd.org> wrote:
>>
>> On Wed, Dec 1, 2021 at 11:25 AM Warner Losh <imp@bsdimp.com> wrote:
>> >
>> >
>> >
>> > On Wed, Dec 1, 2021, 11:16 AM Alan Somers <asomers@freebsd.org> wrote:
>> >>
>> >> On a stable/13 build from 16-Sep-2021 I see frequent ZFS deadlocks
>> >> triggered by HDD timeouts.  The timeouts are probably caused by
>> >> genuine hardware faults, but they didn't lead to deadlocks in
>> >> 12.2-RELEASE or 13.0-RELEASE.  Unfortunately I don't have much
>> >> additional information.  ZFS's stack traces aren't very informative,
>> >> and dmesg doesn't show anything besides the usual information about
>> >> the disk timeout.  I don't see anything obviously related in the
>> >> commit history for that time range, either.
>> >>
>> >> Has anybody else observed this phenomenon?  Or does anybody have a
>> >> good way to deliberately inject timeouts?  CAM makes it easy enough to
>> >> inject an error, but not a timeout.  If it did, then I could bisect
>> >> the problem.  As it is I can only reproduce it on production servers.
>> >
>> >
>> > What SIM? Timeouts are tricky because they have many sources, some of which are nonlocal...
>> >
>> > Warner
>>
>> mpr(4)
>
>
> Is this just a single drive that's acting up, or is the controller initialized as part of the error recovery?

I'm not doing anything fancy with mprutil or sas3flash, if that's what
you're asking.

> If a single drive,
> are there multiple timeouts that happen at the same time such that we timeout a request while we're waiting for
> the abort command we send to the firmware to be acknowledged?

I don't know.

> Would you be able to run a kgdb script to see
> if you're hitting a situation that I fixed in mpr that would cause I/O to never complete in this rather odd circumstance?
> If you can, and if it is, then there's a change I can MFC :).

Possibly.  When would I run this kgdb script?  Before ZFS locks up,
after, or while the problematic timeout happens?

>
> Warner



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2gnEgGn-h16UJHhrS79ypH357=r2R0DaYAa1J-TOGAKCQ>