From nobody Wed Dec 1 20:36:58 2021 X-Original-To: freebsd-stable@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 5B36018B99C2 for ; Wed, 1 Dec 2021 20:37:10 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-ua1-x92b.google.com (mail-ua1-x92b.google.com [IPv6:2607:f8b0:4864:20::92b]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4J49qp1v1yz4jf3 for ; Wed, 1 Dec 2021 20:37:10 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-ua1-x92b.google.com with SMTP id ay21so51664529uab.12 for ; Wed, 01 Dec 2021 12:37:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=I0QCHX9PSFT2UkbY8tsMX8C9DevFxwoZNz/X6THiHmU=; b=WLdbDoRWifel3LUfT/Pr6Vu/MbxXP6uCA7w41++Sl9idZM2CeRaYmdBAsR84XjoMJf Ijp5dPbzLhiQOBbR7XhhNxpmEUpF4yyFrhNXQeufetBGoTeNt6ldRDjnsT4HEakSM29J qUsbBKwkzr2RoKVKf8dzD42flGoQNqTjvBY37rqMRItBto7IVZH8ObgZfvgS2BOL7u1c 4UTLC9ECJKc0t4y6dmVxUy5bLJcMNix+ixKvLkBPt/ZWI+1FxhTZOTTPxEl4CFwzcj7i AyOmTVf4h9oT4/FxTuiDjoEBmep1ehB4IzKjTKgPDurPI3SFr1Xvy/QuDrMSkM/Wh5ox GEmA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=I0QCHX9PSFT2UkbY8tsMX8C9DevFxwoZNz/X6THiHmU=; b=ZInj+k7IeO65yZDMXXo1cVzcVyANsU1L/zFP7zqROtb/DkDxsV9omtQQlVDsiK711z BacX+dqBac+Ii6HwNjMZzpGUz/zLdhk5BldFN9BpjSLUXtAZicUOyO4DOENRxWJKKDd9 2d7nDMC048kHJPt3t1talJZCYCmOodnBrjD+29oQ2kiA7ioc+jwzDy4XSKOtZokpMVxZ d6D56qQYX+xaymFIaddY+CZ6zPpnx4m7iRhytbMt2ICq1kwVJVL5BOzEkaJgL4AAIMS8 gSxYw+Cpc88lNER7ktAP9dvcOXeqaBNPgaF3A8lLcAVjq/AOeTJxuVdmNcMhmOk7BQ/6 LFsw== X-Gm-Message-State: AOAM533II++FVdSlk0AEJvLnI+uQGAqjvzhs3OM6TKtwpbl97ql+Pqzq K4tiO6l1i0x3cj4K0t2GaMojstzZ6WfijclZtl5y04pNITXjSQ== X-Google-Smtp-Source: ABdhPJwtxMz5Izopqc+2OGLCVcd+evQyqmps3TVFlKKC7j4QEfjbeuFo3Ziz/SyDtQepX2nsRgs/e4eA03Nf+tyc4BI= X-Received: by 2002:a67:f950:: with SMTP id u16mr10729394vsq.68.1638391029740; Wed, 01 Dec 2021 12:37:09 -0800 (PST) List-Id: Production branch of FreeBSD source code List-Archive: https://lists.freebsd.org/archives/freebsd-stable List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-stable@freebsd.org X-BeenThere: freebsd-stable@freebsd.org MIME-Version: 1.0 References: In-Reply-To: From: Warner Losh Date: Wed, 1 Dec 2021 13:36:58 -0700 Message-ID: Subject: Re: ZFS deadlocks triggered by HDD timeouts To: Alan Somers Cc: FreeBSD Content-Type: multipart/alternative; boundary="00000000000054c93905d21ba341" X-Rspamd-Queue-Id: 4J49qp1v1yz4jf3 X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] X-ThisMailContainsUnwantedMimeParts: Y --00000000000054c93905d21ba341 Content-Type: text/plain; charset="UTF-8" On Wed, Dec 1, 2021 at 1:28 PM Alan Somers wrote: > On Wed, Dec 1, 2021 at 11:25 AM Warner Losh wrote: > > > > > > > > On Wed, Dec 1, 2021, 11:16 AM Alan Somers wrote: > >> > >> On a stable/13 build from 16-Sep-2021 I see frequent ZFS deadlocks > >> triggered by HDD timeouts. The timeouts are probably caused by > >> genuine hardware faults, but they didn't lead to deadlocks in > >> 12.2-RELEASE or 13.0-RELEASE. Unfortunately I don't have much > >> additional information. ZFS's stack traces aren't very informative, > >> and dmesg doesn't show anything besides the usual information about > >> the disk timeout. I don't see anything obviously related in the > >> commit history for that time range, either. > >> > >> Has anybody else observed this phenomenon? Or does anybody have a > >> good way to deliberately inject timeouts? CAM makes it easy enough to > >> inject an error, but not a timeout. If it did, then I could bisect > >> the problem. As it is I can only reproduce it on production servers. > > > > > > What SIM? Timeouts are tricky because they have many sources, some of > which are nonlocal... > > > > Warner > > mpr(4) > Is this just a single drive that's acting up, or is the controller initialized as part of the error recovery? If a single drive, are there multiple timeouts that happen at the same time such that we timeout a request while we're waiting for the abort command we send to the firmware to be acknowledged? Would you be able to run a kgdb script to see if you're hitting a situation that I fixed in mpr that would cause I/O to never complete in this rather odd circumstance? If you can, and if it is, then there's a change I can MFC :). Warner --00000000000054c93905d21ba341--