From nobody Wed May 22 20:45:33 2024 X-Original-To: current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Vl3HK4nB4z5Ky0x for ; Wed, 22 May 2024 20:46:05 +0000 (UTC) (envelope-from Alexander@Leidinger.net) Received: from mailgate.Leidinger.net (mailgate.leidinger.net [IPv6:2a00:1828:2000:313::1:5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature ECDSA (P-256) client-digest SHA256) (Client CN "mailgate.leidinger.net", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Vl3HK1lJQz49V4; Wed, 22 May 2024 20:46:05 +0000 (UTC) (envelope-from Alexander@Leidinger.net) Authentication-Results: mx1.freebsd.org; none List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@FreeBSD.org MIME-Version: 1.0 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=leidinger.net; s=outgoing-alex; t=1716410757; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=t9miOa2nfZWnWnYIOj8cpaa4O24McQLrq0UFsnAImEQ=; b=V7ameBnKHC8s+1mGlV2qhzZ6hNWHO67feoDE49fcrBGT1fj4WnJItKT2RuO1iLdIGBN1TV LYSrmN2ZMj9cqlyRrQf+8s2KeGS9kxh6g3MJfcwKHtTatGGuJYOsYWAvHxJnPP2KcJKctL 8aY3EwdR5B4/I6ZhkPxT7FSq09gO5hES9cmGFeceYypDXUxunWU1H2kMCPNw/fd6/A/MKS N+X+VYenw9UDIc4sOCG8sKX0DY76HW+XonsC7VW1ZcWmPLOIEPVjs2VzfN/GDh3OVy5hDk 4v8IbRl08IuPBQlSDCPL0OolXw5yDpM9VMzRqYqnUs3S/utmIDLunBOb/xl9Cw== Date: Wed, 22 May 2024 22:45:33 +0200 From: Alexander Leidinger To: Warner Losh Cc: Current , Alexander Motin Subject: Re: _mtx_lock_sleep: recursed on non-recursive mutex CAM device lock @ /..../sys/cam/nvme/nvme_da.c:469 In-Reply-To: References: <730565997ef678bbfe87d7861075edae@Leidinger.net> Message-ID: Organization: No organization, this is a private message. Content-Type: multipart/signed; protocol="application/pgp-signature"; boundary="=_dc0f07b562c2d8155c3c32185a8aac04"; micalg=pgp-sha256 X-Spamd-Bar: ---- X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:34240, ipnet:2a00:1828::/32, country:DE] X-Rspamd-Queue-Id: 4Vl3HK1lJQz49V4 This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --=_dc0f07b562c2d8155c3c32185a8aac04 Content-Type: multipart/alternative; boundary="=_b3cd9b8a523236327cf34f2d64752b6f" --=_b3cd9b8a523236327cf34f2d64752b6f Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII; format=flowed Am 2024-05-22 20:53, schrieb Warner Losh: > First order: > > Looks like we're trying to schedule a trim, but that fails due to a > malloc issue. So then, since it's a > malloc issue, we wind up trying to automatically reschedule this I/O, > which recurses into the driver > with a bad lock held and boop. > > Can you reproduce this? So far I had it once. At least I have only one crashdump. I had one more reboot/crash, but no dump. I also have a watchdog running on this system, so not sure what caused the (unusual) reboot. I had a poudriere build running at both times. Since the crashdump I didn't run poudriere anymore. > If so, can you test this patch? I give it a try tomorrow anyway, and I will try to stress the system again with poudriere. The nvme is a cache and also a log device for a zpool, so not really a deterministic way to trigger access to it. Bye, Alexander. -- http://www.Leidinger.net Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF http://www.FreeBSD.org netchild@FreeBSD.org : PGP 0x8F31830F9F2772BF --=_b3cd9b8a523236327cf34f2d64752b6f Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Am 2024-05-22 20:53, schrieb Warner Losh:

First order:
 
Looks like we're trying to schedule a trim, but that fails due to a ma= lloc issue. So then, since it's a
malloc issue, we wind up trying to automatically reschedule this I/O, = which recurses into the driver
with a bad lock held and boop.
 
Can you reproduce this?
 
So far I had it once. At least I have only one crashdump. I had one mo= re reboot/crash, but no dump. I also have a watchdog running on this system= , so not sure what caused the (unusual) reboot. I had a poudriere build run= ning at both times. Since the crashdump I didn't run poudriere anymore.
 
If so, can you test this patch?
 
I give it a try tomorrow anyway, and I will try to stress the system a= gain with poudriere.

The nvme is a cache and also a log device for a zpool, so not really a d= eterministic way to trigger access to it.

Bye,
Alexander.

--
--=_b3cd9b8a523236327cf34f2d64752b6f-- --=_dc0f07b562c2d8155c3c32185a8aac04 Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc; size=833 Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEER9UlYXp1PSd08nWXEg2wmwP42IYFAmZOWYAACgkQEg2wmwP4 2IY4wA/9GhlwJBIeQvQaGnoH632EzWeZR8d3/tOkGxFYUoid9gSW4KDkxElE/i92 3RL2axaAKzhqnIMUo4R7qbJ5TImQqQn4Eh60NAPqm/IdkZoUcAno7Q8npzFiSyMc MZV3t9cY+OnxLfA9FAR628Zx1k8u0nNz4VG5xT2QIa7FtRjxxpfw7VVJOIcNQsPV kMmh4IJ4JbVc4N41VgGfOiLcihbh+6RVu4Yj0GaHSaeexV6knIe1g7jkCoo7vlwf OtKEu8Ua67yiB/VfpFTHcxljFUmOXeadXqw5TVHTAQJXdtJ4No0NK4RbcmVGojEh 0viPxTr1CPlk7sFjFtEPtKTQhHyD5Mpeq8OGDTVKabkROK1iY/4YQeIr2NuzTLyr hRygUld7Wnt2jhEiRbAXuIP3Mp5PRvNVSAZ+txNwMCHLveCVMtGuxvBDMYHI9lni mEmQxo9yJb85A4J7MQNRBJkohfAR/4kxIrP83xJj5lhaNI/DgVe0JrBXXJH6Q5gI +Muq4n327h/mYGy8SQfSebOnq4Mbsusi9eLurGs7gjAbPqf5SyGteFlJ2fIGQOmc gRSYUpw0ZcyogKHBqHZ2tvhoRTbBVBjX1z4cEEhMYwaebNRkYofyL8oN5YT+arkf dpJX7q6ls/ChwVAFNmn5n6qq0t2heuCLIXCH0YdpRbFU9n0tQG8= =BqXp -----END PGP SIGNATURE----- --=_dc0f07b562c2d8155c3c32185a8aac04--