From nobody Sat Dec  4 00:28:03 2021
X-Original-To: freebsd-stable@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 03A7818D205D
	for <freebsd-stable@mlmmj.nyi.freebsd.org>; Sat,  4 Dec 2021 00:28:22 +0000 (UTC)
	(envelope-from wlosh@bsdimp.com)
Received: from mail-vk1-xa2d.google.com (mail-vk1-xa2d.google.com [IPv6:2607:f8b0:4864:20::a2d])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4J5Vsd6PcDz4rFs
	for <freebsd-stable@freebsd.org>; Sat,  4 Dec 2021 00:28:21 +0000 (UTC)
	(envelope-from wlosh@bsdimp.com)
Received: by mail-vk1-xa2d.google.com with SMTP id e27so2948624vkd.4
        for <freebsd-stable@freebsd.org>; Fri, 03 Dec 2021 16:28:21 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bsdimp-com.20210112.gappssmtp.com; s=20210112;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=k/lkbrMjeUAzojxbYVWvs9t7CSmgUBOw5UOgXJDGwF8=;
        b=1oE6Q/p3cPJ9iTVZ8OzqVG5sYaBFTS8TueKk1OAnW3l00O9nWTXDn1KN/CyBxGeovJ
         YvM47u+dW3Gx0wPvg9zDMj9swuVgDmVrd91Rq645yrsgIT2A8lq0Ef1XPDr6nv+dEUQ5
         mHC57YU9lEpiWcKZScxtA/dCLsUP8eSpeNl5MWR4E3gLhTcCOoJRCdfs75o4gjZzz/1E
         DNoI0OS3GGDtFaBQgWRUn+LKCIzl+3VOWm3yE8DqCN05Rvs7wUUmgcb4OfWghRVqIkD8
         N9dAHBEJb2j9W20LmtfmqAVdsWTeqXT7ovF48r/7YNWtlHW0YEbVrPDOOcfKZ9gBV2sz
         2huQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=k/lkbrMjeUAzojxbYVWvs9t7CSmgUBOw5UOgXJDGwF8=;
        b=lXkwGyIHENUC5fhSi5IVAm9jUn0oQ/BJZ7x9raT93d/DWwtK8m6+TtUy1bIiBc1W9D
         40TZgMWqyWIHKfHF5AQjsON52XApEXWXoKrwLtsTkR160xqt6CDKE+3TeQfBrBWReoZx
         UCVMkwOEt8sX/FGYIpgVUlCZdsw+eL/MhJ9EaGQJ7KxOEEsJYyLknHDOjX1doaUAPiMo
         PKMbE3mMvvsjejWQ1NMv4pkwPpNb5+VN20gi8OBvwzB1MC4qN7yglpJRtNs7zDryK9Zi
         6AEc8f5eEhxy5CbKOSpXkHCbXkk33pBrIVWcQ4FXm5jumgkunmrGthr48GQ7m4Yo+KRR
         fXEA==
X-Gm-Message-State: AOAM531/vbLtKp4vsNAldRfmHtD/kRSpnIEoBMD3SALeif2F5JWfPCiI
	Iir3BJmzO+gKofs+aODmZs0xf5KyKyjANqjtgtsSzHKxv04hqVRf
X-Google-Smtp-Source: ABdhPJzFregj6US8TmFe2MumxXm8n2+fwcTAPgpZS/TDjJkfXRz9WtcN1fYDSoTORE3aPckoqxdn2cJ0rkvQdMqeMIc=
X-Received: by 2002:a1f:c9c2:: with SMTP id z185mr27953595vkf.26.1638577694824;
 Fri, 03 Dec 2021 16:28:14 -0800 (PST)
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-stable
List-Help: <mailto:stable+help@freebsd.org>
List-Post: <mailto:stable@freebsd.org>
List-Subscribe: <mailto:stable+subscribe@freebsd.org>
List-Unsubscribe: <mailto:stable+unsubscribe@freebsd.org>
Sender: owner-freebsd-stable@freebsd.org
X-BeenThere: freebsd-stable@freebsd.org
MIME-Version: 1.0
References: <CAOtMX2hMu7qXqHt5rhi9CBNDRERpWshcF+R9N_VQOrYvYFERQg@mail.gmail.com>
 <CANCZdfo7W-eFoQ6X4y0rY=k5in6T7Ledjhes39ToO9ZXLXyVbw@mail.gmail.com>
 <CAOtMX2jmppMTwnK_g4OiWSnGu=Vwxm1FMa-_izdNPTYaJPyiDA@mail.gmail.com>
 <CANCZdfqfcbObUUonrEdNViJ-5xvU+FeYT+apHwmTpiHmfBVaXg@mail.gmail.com>
 <CAOtMX2gnEgGn-h16UJHhrS79ypH357=r2R0DaYAa1J-TOGAKCQ@mail.gmail.com>
 <CANCZdfr_s_10zePSWoaVyi7ExcG9yqK=v5oDjLnVCVZ05hDJAw@mail.gmail.com>
 <CAOtMX2hGODt0hiwzOrThOQ=Sm1V+9my27pWwzp1L-hz3XWAVeQ@mail.gmail.com>
 <CANCZdfrruAVxMvuN60b2a_70zD0Q5jNh31BKqVt+xX_eo4=nig@mail.gmail.com>
 <CAOtMX2j5kGy3Ef9dmJbhMhi4sYJ+SfYmBk6O4+VH-ZrTDdq0uw@mail.gmail.com>
 <CANCZdfqy=hLkBYLK8rJy2JOGvM0CwqMVpFYstchMp2JW49J2GQ@mail.gmail.com>
 <CAOtMX2js0dtvpZ9SJM6o3VfAr9-swWBt9725V2pJkZZrxUMh3Q@mail.gmail.com>
 <CANCZdfoVQvkM62WuUB4btjg14Vau0rsoaauEGPP_Qitqo8U_Fw@mail.gmail.com> <CAOtMX2gi1ir7QauGu3H+dJZdPcj91SbypRQ53npwP1Xxf6Z_DA@mail.gmail.com>
In-Reply-To: <CAOtMX2gi1ir7QauGu3H+dJZdPcj91SbypRQ53npwP1Xxf6Z_DA@mail.gmail.com>
From: Warner Losh <imp@bsdimp.com>
Date: Fri, 3 Dec 2021 17:28:03 -0700
Message-ID: <CANCZdfr7TA7H1pb3CTe5P0qkA=uXCNFsSD7a1docWn+8+N=ksg@mail.gmail.com>
Subject: Re: ZFS deadlocks triggered by HDD timeouts
To: Alan Somers <asomers@freebsd.org>
Cc: FreeBSD <freebsd-stable@freebsd.org>
Content-Type: multipart/alternative; boundary="0000000000006fef3905d2471928"
X-Rspamd-Queue-Id: 4J5Vsd6PcDz4rFs
X-Spamd-Bar: ----
Authentication-Results: mx1.freebsd.org;
	none
X-Spamd-Result: default: False [-4.00 / 15.00];
	 REPLY(-4.00)[]
X-ThisMailContainsUnwantedMimeParts: Y

--0000000000006fef3905d2471928
Content-Type: text/plain; charset="UTF-8"

Hey Alan,

On Fri, Dec 3, 2021 at 5:26 PM Alan Somers <asomers@freebsd.org> wrote:

> On Fri, Dec 3, 2021 at 5:19 PM Warner Losh <imp@bsdimp.com> wrote:
> >
> > Hey Alan,
> >
> > On Fri, Dec 3, 2021 at 8:38 AM Alan Somers <asomers@freebsd.org> wrote:
> >>
> >> On Wed, Dec 1, 2021 at 3:48 PM Warner Losh <imp@bsdimp.com> wrote:
> >> >
> >> >
> >> >
> >> > On Wed, Dec 1, 2021, 3:36 PM Alan Somers <asomers@freebsd.org> wrote:
> >> >>
> >> >> On Wed, Dec 1, 2021 at 2:46 PM Warner Losh <imp@bsdimp.com> wrote:
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Wed, Dec 1, 2021, 2:36 PM Alan Somers <asomers@freebsd.org>
> wrote:
> >> >> >>
> >> >> >> On Wed, Dec 1, 2021 at 1:56 PM Warner Losh <imp@bsdimp.com>
> wrote:
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > On Wed, Dec 1, 2021 at 1:47 PM Alan Somers <asomers@freebsd.org>
> wrote:
> >> >> >> >>
> >> >> >> >> On Wed, Dec 1, 2021 at 1:37 PM Warner Losh <imp@bsdimp.com>
> wrote:
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > On Wed, Dec 1, 2021 at 1:28 PM Alan Somers <
> asomers@freebsd.org> wrote:
> >> >> >> >> >>
> >> >> >> >> >> On Wed, Dec 1, 2021 at 11:25 AM Warner Losh <imp@bsdimp.com>
> wrote:
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> > On Wed, Dec 1, 2021, 11:16 AM Alan Somers <
> asomers@freebsd.org> wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> On a stable/13 build from 16-Sep-2021 I see frequent ZFS
> deadlocks
> >> >> >> >> >> >> triggered by HDD timeouts.  The timeouts are probably
> caused by
> >> >> >> >> >> >> genuine hardware faults, but they didn't lead to
> deadlocks in
> >> >> >> >> >> >> 12.2-RELEASE or 13.0-RELEASE.  Unfortunately I don't
> have much
> >> >> >> >> >> >> additional information.  ZFS's stack traces aren't very
> informative,
> >> >> >> >> >> >> and dmesg doesn't show anything besides the usual
> information about
> >> >> >> >> >> >> the disk timeout.  I don't see anything obviously
> related in the
> >> >> >> >> >> >> commit history for that time range, either.
> >> >> >> >> >> >>
> >> >> >> >> >> >> Has anybody else observed this phenomenon?  Or does
> anybody have a
> >> >> >> >> >> >> good way to deliberately inject timeouts?  CAM makes it
> easy enough to
> >> >> >> >> >> >> inject an error, but not a timeout.  If it did, then I
> could bisect
> >> >> >> >> >> >> the problem.  As it is I can only reproduce it on
> production servers.
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> > What SIM? Timeouts are tricky because they have many
> sources, some of which are nonlocal...
> >> >> >> >> >> >
> >> >> >> >> >> > Warner
> >> >> >> >> >>
> >> >> >> >> >> mpr(4)
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > Is this just a single drive that's acting up, or is the
> controller initialized as part of the error recovery?
> >> >> >> >>
> >> >> >> >> I'm not doing anything fancy with mprutil or sas3flash, if
> that's what
> >> >> >> >> you're asking.
> >> >> >> >
> >> >> >> >
> >> >> >> > No. I'm asking if you've enabled debugging on the recovery
> messages and see that we enter any kind of
> >> >> >> > controller reset when the timeouts occur.
> >> >> >>
> >> >> >> No.  My CAM setup is the default except that I enabled
> CAM_IO_STATS
> >> >> >> and changed the following two sysctls:
> >> >> >> kern.cam.da.retry_count=2
> >> >> >> kern.cam.da.default_timeout=10
> >> >> >>
> >> >> >>
> >> >> >> >
> >> >> >> >>
> >> >> >> >> > If a single drive,
> >> >> >> >> > are there multiple timeouts that happen at the same time
> such that we timeout a request while we're waiting for
> >> >> >> >> > the abort command we send to the firmware to be acknowledged?
> >> >> >> >>
> >> >> >> >> I don't know.
> >> >> >> >
> >> >> >> >
> >> >> >> > OK.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> > Would you be able to run a kgdb script to see
> >> >> >> >> > if you're hitting a situation that I fixed in mpr that would
> cause I/O to never complete in this rather odd circumstance?
> >> >> >> >> > If you can, and if it is, then there's a change I can MFC :).
> >> >> >> >>
> >> >> >> >> Possibly.  When would I run this kgdb script?  Before ZFS
> locks up,
> >> >> >> >> after, or while the problematic timeout happens?
> >> >> >> >
> >> >> >> >
> >> >> >> > After the timeouts. I've been doing 'kgdb' followed by 'source
> mpr-hang.gdb' to run this.
> >> >> >> >
> >> >> >> > What you are looking for is anything with a qfrozen_cnt > 0..
> The script is imperfect and racy
> >> >> >> > with normal operations (but not in a bad way), so you may need
> to run it a couple of times
> >> >> >> > to get consistent data. On my systems, there'd be one or two
> devices with a frozen count > 1
> >> >> >> > and no I/O happened on those drives and processes hung. That
> might not be any different than
> >> >> >> > a deadlock :)
> >> >> >> >
> >> >> >> > Warner
> >> >> >> >
> >> >> >> > P.S. here's the mpr-hang.gdb script. Not sure if I can make an
> attachment survive the mailing lists :)
> >> >> >>
> >> >> >> Thanks, I'll try that.  If this is the problem, do you have any
> idea
> >> >> >> why it wouldn't happen on 12.2-RELEASE (I haven't seen it on
> >> >> >> 13.0-RELEASE, but maybe I just don't have enough runtime on that
> >> >> >> version).
> >> >> >
> >> >> >
> >> >> > 9781c28c6d63 was merged to stable/13 as a996b55ab34c on Sept 2nd.
> I fixed a bug
> >> >> > with that version in current as a8837c77efd0, but haven't merged
> it. I kinda expect that
> >> >> > this might be the cause of the problem. But in Netflix's fleet
> we've seen this maybe a
> >> >> > couple of times a week over many thousands of machines, so I've
> been a little cautious
> >> >> > in merging it to make sure that it's really fixed. So far, the
> jury is out.
> >> >> >
> >> >> > Warner
> >> >>
> >> >> Well, I'm experiencing this error much more frequently than you then.
> >> >> I've seen it on about 10% of similarly-configured servers and they've
> >> >> only been running that release for 1 week.
> >> >
> >> >
> >> > You can run my script soon then to see if it's the same thing.
> >> >
> >> > Warner
> >> >
> >> >> -Alan
> >>
> >> That confirms it.  I hit the deadlock again, and qfrozen_cnt was
> >> between 1 and 3 for four devices: two da devices (we use multipath)
> >> and their accompanying pass devices.  So I should try merging
> >> a8837c77efd0 next?
> >
> >
> > Yes. I'd planned on merging it this weekend, but if you wanted a jump
> > on me, that's the next step.
> >
> > Warner
>
> It merged without conflict, and I'm testing it now.  But without a way
> to inject timeouts I can't tell whether it's working.
>

You can enable, at runtime, the 'recovery' messages from the mpr driver.
>From those
you'll know if you are hitting the timeout when timeout active case.

dev.mpr.0.debug_level=info,fault,recovery

is what I think I use.

Warner

--0000000000006fef3905d2471928--