Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 3 Dec 2021 17:25:48 -0700
From:      Alan Somers <asomers@freebsd.org>
To:        Warner Losh <imp@bsdimp.com>
Cc:        FreeBSD <freebsd-stable@freebsd.org>
Subject:   Re: ZFS deadlocks triggered by HDD timeouts
Message-ID:  <CAOtMX2gi1ir7QauGu3H%2BdJZdPcj91SbypRQ53npwP1Xxf6Z_DA@mail.gmail.com>
In-Reply-To: <CANCZdfoVQvkM62WuUB4btjg14Vau0rsoaauEGPP_Qitqo8U_Fw@mail.gmail.com>
References:  <CAOtMX2hMu7qXqHt5rhi9CBNDRERpWshcF%2BR9N_VQOrYvYFERQg@mail.gmail.com> <CANCZdfo7W-eFoQ6X4y0rY=k5in6T7Ledjhes39ToO9ZXLXyVbw@mail.gmail.com> <CAOtMX2jmppMTwnK_g4OiWSnGu=Vwxm1FMa-_izdNPTYaJPyiDA@mail.gmail.com> <CANCZdfqfcbObUUonrEdNViJ-5xvU%2BFeYT%2BapHwmTpiHmfBVaXg@mail.gmail.com> <CAOtMX2gnEgGn-h16UJHhrS79ypH357=r2R0DaYAa1J-TOGAKCQ@mail.gmail.com> <CANCZdfr_s_10zePSWoaVyi7ExcG9yqK=v5oDjLnVCVZ05hDJAw@mail.gmail.com> <CAOtMX2hGODt0hiwzOrThOQ=Sm1V%2B9my27pWwzp1L-hz3XWAVeQ@mail.gmail.com> <CANCZdfrruAVxMvuN60b2a_70zD0Q5jNh31BKqVt%2BxX_eo4=nig@mail.gmail.com> <CAOtMX2j5kGy3Ef9dmJbhMhi4sYJ%2BSfYmBk6O4%2BVH-ZrTDdq0uw@mail.gmail.com> <CANCZdfqy=hLkBYLK8rJy2JOGvM0CwqMVpFYstchMp2JW49J2GQ@mail.gmail.com> <CAOtMX2js0dtvpZ9SJM6o3VfAr9-swWBt9725V2pJkZZrxUMh3Q@mail.gmail.com> <CANCZdfoVQvkM62WuUB4btjg14Vau0rsoaauEGPP_Qitqo8U_Fw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Dec 3, 2021 at 5:19 PM Warner Losh <imp@bsdimp.com> wrote:
>
> Hey Alan,
>
> On Fri, Dec 3, 2021 at 8:38 AM Alan Somers <asomers@freebsd.org> wrote:
>>
>> On Wed, Dec 1, 2021 at 3:48 PM Warner Losh <imp@bsdimp.com> wrote:
>> >
>> >
>> >
>> > On Wed, Dec 1, 2021, 3:36 PM Alan Somers <asomers@freebsd.org> wrote:
>> >>
>> >> On Wed, Dec 1, 2021 at 2:46 PM Warner Losh <imp@bsdimp.com> wrote:
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Dec 1, 2021, 2:36 PM Alan Somers <asomers@freebsd.org> wrote:
>> >> >>
>> >> >> On Wed, Dec 1, 2021 at 1:56 PM Warner Losh <imp@bsdimp.com> wrote:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Wed, Dec 1, 2021 at 1:47 PM Alan Somers <asomers@freebsd.org> wrote:
>> >> >> >>
>> >> >> >> On Wed, Dec 1, 2021 at 1:37 PM Warner Losh <imp@bsdimp.com> wrote:
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > On Wed, Dec 1, 2021 at 1:28 PM Alan Somers <asomers@freebsd.org> wrote:
>> >> >> >> >>
>> >> >> >> >> On Wed, Dec 1, 2021 at 11:25 AM Warner Losh <imp@bsdimp.com> wrote:
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > On Wed, Dec 1, 2021, 11:16 AM Alan Somers <asomers@freebsd.org> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> On a stable/13 build from 16-Sep-2021 I see frequent ZFS deadlocks
>> >> >> >> >> >> triggered by HDD timeouts.  The timeouts are probably caused by
>> >> >> >> >> >> genuine hardware faults, but they didn't lead to deadlocks in
>> >> >> >> >> >> 12.2-RELEASE or 13.0-RELEASE.  Unfortunately I don't have much
>> >> >> >> >> >> additional information.  ZFS's stack traces aren't very informative,
>> >> >> >> >> >> and dmesg doesn't show anything besides the usual information about
>> >> >> >> >> >> the disk timeout.  I don't see anything obviously related in the
>> >> >> >> >> >> commit history for that time range, either.
>> >> >> >> >> >>
>> >> >> >> >> >> Has anybody else observed this phenomenon?  Or does anybody have a
>> >> >> >> >> >> good way to deliberately inject timeouts?  CAM makes it easy enough to
>> >> >> >> >> >> inject an error, but not a timeout.  If it did, then I could bisect
>> >> >> >> >> >> the problem.  As it is I can only reproduce it on production servers.
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > What SIM? Timeouts are tricky because they have many sources, some of which are nonlocal...
>> >> >> >> >> >
>> >> >> >> >> > Warner
>> >> >> >> >>
>> >> >> >> >> mpr(4)
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > Is this just a single drive that's acting up, or is the controller initialized as part of the error recovery?
>> >> >> >>
>> >> >> >> I'm not doing anything fancy with mprutil or sas3flash, if that's what
>> >> >> >> you're asking.
>> >> >> >
>> >> >> >
>> >> >> > No. I'm asking if you've enabled debugging on the recovery messages and see that we enter any kind of
>> >> >> > controller reset when the timeouts occur.
>> >> >>
>> >> >> No.  My CAM setup is the default except that I enabled CAM_IO_STATS
>> >> >> and changed the following two sysctls:
>> >> >> kern.cam.da.retry_count=2
>> >> >> kern.cam.da.default_timeout=10
>> >> >>
>> >> >>
>> >> >> >
>> >> >> >>
>> >> >> >> > If a single drive,
>> >> >> >> > are there multiple timeouts that happen at the same time such that we timeout a request while we're waiting for
>> >> >> >> > the abort command we send to the firmware to be acknowledged?
>> >> >> >>
>> >> >> >> I don't know.
>> >> >> >
>> >> >> >
>> >> >> > OK.
>> >> >> >
>> >> >> >>
>> >> >> >> > Would you be able to run a kgdb script to see
>> >> >> >> > if you're hitting a situation that I fixed in mpr that would cause I/O to never complete in this rather odd circumstance?
>> >> >> >> > If you can, and if it is, then there's a change I can MFC :).
>> >> >> >>
>> >> >> >> Possibly.  When would I run this kgdb script?  Before ZFS locks up,
>> >> >> >> after, or while the problematic timeout happens?
>> >> >> >
>> >> >> >
>> >> >> > After the timeouts. I've been doing 'kgdb' followed by 'source mpr-hang.gdb' to run this.
>> >> >> >
>> >> >> > What you are looking for is anything with a qfrozen_cnt > 0.. The script is imperfect and racy
>> >> >> > with normal operations (but not in a bad way), so you may need to run it a couple of times
>> >> >> > to get consistent data. On my systems, there'd be one or two devices with a frozen count > 1
>> >> >> > and no I/O happened on those drives and processes hung. That might not be any different than
>> >> >> > a deadlock :)
>> >> >> >
>> >> >> > Warner
>> >> >> >
>> >> >> > P.S. here's the mpr-hang.gdb script. Not sure if I can make an attachment survive the mailing lists :)
>> >> >>
>> >> >> Thanks, I'll try that.  If this is the problem, do you have any idea
>> >> >> why it wouldn't happen on 12.2-RELEASE (I haven't seen it on
>> >> >> 13.0-RELEASE, but maybe I just don't have enough runtime on that
>> >> >> version).
>> >> >
>> >> >
>> >> > 9781c28c6d63 was merged to stable/13 as a996b55ab34c on Sept 2nd. I fixed a bug
>> >> > with that version in current as a8837c77efd0, but haven't merged it. I kinda expect that
>> >> > this might be the cause of the problem. But in Netflix's fleet we've seen this maybe a
>> >> > couple of times a week over many thousands of machines, so I've been a little cautious
>> >> > in merging it to make sure that it's really fixed. So far, the jury is out.
>> >> >
>> >> > Warner
>> >>
>> >> Well, I'm experiencing this error much more frequently than you then.
>> >> I've seen it on about 10% of similarly-configured servers and they've
>> >> only been running that release for 1 week.
>> >
>> >
>> > You can run my script soon then to see if it's the same thing.
>> >
>> > Warner
>> >
>> >> -Alan
>>
>> That confirms it.  I hit the deadlock again, and qfrozen_cnt was
>> between 1 and 3 for four devices: two da devices (we use multipath)
>> and their accompanying pass devices.  So I should try merging
>> a8837c77efd0 next?
>
>
> Yes. I'd planned on merging it this weekend, but if you wanted a jump
> on me, that's the next step.
>
> Warner

It merged without conflict, and I'm testing it now.  But without a way
to inject timeouts I can't tell whether it's working.
-Alan



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2gi1ir7QauGu3H%2BdJZdPcj91SbypRQ53npwP1Xxf6Z_DA>