Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 28 Oct 2015 13:58:21 +0100
From:      Fabian Keil <freebsd-listen@fabiankeil.de>
To:        freebsd-current@freebsd.org
Cc:        "Steven Hartland" <killing@multiplay.co.uk>, Xin Li <delphij@delphij.net>, "Alexander Motin" <mav@ixsystems.com>
Subject:   Re: ZFS-related panic: "possible" spa->spa_errlog_lock deadlock
Message-ID:  <20151028135821.0d375ec5@fabiankeil.de>
In-Reply-To: <540C8039.7010309@delphij.net>
References:  <492dbacb.5942cc9b@fabiankeil.de> <540C66AC.8070809@delphij.net> <4fa875ba.3cc970d7@fabiankeil.de> <540C8039.7010309@delphij.net>

next in thread | previous in thread | raw e-mail | index | archive | help
--Sig_/n/RaV4NiQBkXCK0r/1ymFqH
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

Xin Li <delphij@delphij.net> wrote:

> On 9/7/14 11:23 PM, Fabian Keil wrote:
> > Xin Li <delphij@delphij.net> wrote:
> >  =20
> >> On 9/7/14 9:02 PM, Fabian Keil wrote: =20
> >>> Using a kernel built from FreeBSD 11.0-CURRENT r271182 I got
> >>> the following panic yesterday:
> >>>=20
> >>> [...] Unread portion of the kernel message buffer: [6880]
> >>> panic: deadlkres: possible deadlock detected for
> >>> 0xfffff80015289490, blocked for 1800503 ticks =20
> >>=20
> >> Any chance to get all backtraces (e.g. thread apply all bt full
> >> 16)? I think a different thread that held the lock have been
> >> blocked, probably related to your disconnected vdev. =20
> >=20
> > Output of "thread apply all bt full 16" is available at:=20
> > http://www.fabiankeil.de/tmp/freebsd/kgdb-output-spa_errlog_lock-deadlo=
ck.txt
> >
> >  A lot of the backtraces prematurely end with "Cannot access memory
> > at address", therefore I also added "thread apply all bt" output.
> >=20
> > Apparently there are at least two additional threads blocking below
> > spa_get_stats():
[...]
> Yes, thread 1182 owned the lock and is waiting for the zio be done.
> Other threads that wanted the lock would have to wait.
>=20
> I don't have much clue why the system entered this state, however, as
> the operations should have errored out (the GELI device is gone on
> 21:44:56 based on your log, which suggests all references were closed)
> instead of waiting.

Thanks for the responses.

I finally found the time to analyse the problem which seems
to be that spa_sync() requires at least one writeable vdev to
complete, but holds the lock(s) required to remove or bring back
vdevs.

Letting spa_sync() drop the lock and wait for at least one vdev
to become writeable again seems to make the problem unreproducible
for me, but probably merely shrinks the race window and thus is not
a complete solution.

For details see:
https://www.fabiankeil.de/sourcecode/electrobsd/ZFS-Optionally-let-spa_sync=
-wait-for-writable-vdev.diff
(Experimental, only lightly tested)

Fabian

--Sig_/n/RaV4NiQBkXCK0r/1ymFqH
Content-Type: application/pgp-signature
Content-Description: OpenPGP digital signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iEYEARECAAYFAlYwxm0ACgkQBYqIVf93VJ2AngCfePGkoeHRWCqRLVT27oFZS/bp
vUEAnjYV7S6jmWHQVMYvXEJCN3//79k6
=wBhO
-----END PGP SIGNATURE-----

--Sig_/n/RaV4NiQBkXCK0r/1ymFqH--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20151028135821.0d375ec5>