From nobody Sat May 25 08:34:32 2024 X-Original-To: current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4VmZx13VsKz5L1Xl for ; Sat, 25 May 2024 08:35:33 +0000 (UTC) (envelope-from Alexander@Leidinger.net) Received: from mailgate.Leidinger.net (bastille.leidinger.net [89.238.82.207]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature ECDSA (P-256) client-digest SHA256) (Client CN "mailgate.leidinger.net", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4VmZx03z4pz4fbw; Sat, 25 May 2024 08:35:32 +0000 (UTC) (envelope-from Alexander@Leidinger.net) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=leidinger.net header.s=outgoing-alex header.b=oAGjbFE7; dmarc=pass (policy=quarantine) header.from=leidinger.net; spf=pass (mx1.freebsd.org: domain of Alexander@Leidinger.net designates 89.238.82.207 as permitted sender) smtp.mailfrom=Alexander@Leidinger.net List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@FreeBSD.org MIME-Version: 1.0 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=leidinger.net; s=outgoing-alex; t=1716626125; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=sikNIlpUHEMA6O2StUTQPkUlId6kAwtHCFJvLqXHbZ4=; b=oAGjbFE7RKMTb/vbVlotc5ZHscGSGPkQllPmemz8z7+8csmABd5jJAmmHRwfQvJCMwk4U1 W07tfmSFIsDo/HAyciy4+z9wMX8cBIf2cb4cInasKYWhRzGx8ATm5fhwbQOsyVYqeOgDWU +dmremgi5Tx7VOAhxW4jc78DZkraJpubbHbDsrRJ8u0S46Pv5T63LBHEEnMDj6Wp3yOqxP YOl7+by4LoKiYYLGc68zBPiqMD7FLv8qDW9cCsOsFM++CffwRVhrPGxcUhnYX02QvIGSUn AqI5Hgpx7N7oFvd+2A4JTSxlAAOh0F0ZMAydIKhhB7fvYW6i8E0/RPubX5h3QA== Date: Sat, 25 May 2024 10:34:32 +0200 From: Alexander Leidinger To: Warner Losh Cc: Current , Alexander Motin Subject: Re: _mtx_lock_sleep: recursed on non-recursive mutex CAM device lock @ /..../sys/cam/nvme/nvme_da.c:469 In-Reply-To: References: <730565997ef678bbfe87d7861075edae@Leidinger.net> Message-ID: <4e7ebc2b51104ade3ee2a86859c9fb9a@Leidinger.net> Organization: No organization, this is a private message. Content-Type: multipart/signed; protocol="application/pgp-signature"; boundary="=_b3d06bc92298dfaee154a8e512810367"; micalg=pgp-sha256 X-Spamd-Bar: ------ X-Spamd-Result: default: False [-6.10 / 15.00]; SIGNED_PGP(-2.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[leidinger.net,quarantine]; R_SPF_ALLOW(-0.20)[+mx:c]; MIME_GOOD(-0.20)[multipart/signed,multipart/alternative,text/plain]; R_DKIM_ALLOW(-0.20)[leidinger.net:s=outgoing-alex]; ASN(0.00)[asn:34240, ipnet:89.238.64.0/18, country:DE]; ARC_NA(0.00)[]; HAS_ORG_HEADER(0.00)[]; MISSING_XM_UA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:+,3:~,4:~]; DKIM_TRACE(0.00)[leidinger.net:+]; MID_RHS_MATCH_FROM(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; FROM_EQ_ENVFROM(0.00)[]; FROM_HAS_DN(0.00)[]; RCVD_COUNT_ZERO(0.00)[0]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MLMMJ_DEST(0.00)[current@freebsd.org]; TO_DN_ALL(0.00)[]; HAS_ATTACHMENT(0.00)[] X-Rspamd-Queue-Id: 4VmZx03z4pz4fbw This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --=_b3d06bc92298dfaee154a8e512810367 Content-Type: multipart/alternative; boundary="=_42204c80d8c879abe751c11d81fb0a5b" --=_42204c80d8c879abe751c11d81fb0a5b Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII; format=flowed Am 2024-05-22 22:45, schrieb Alexander Leidinger: > Am 2024-05-22 20:53, schrieb Warner Losh: > >> First order: >> >> Looks like we're trying to schedule a trim, but that fails due to a >> malloc issue. So then, since it's a >> malloc issue, we wind up trying to automatically reschedule this I/O, >> which recurses into the driver >> with a bad lock held and boop. >> >> Can you reproduce this? > > So far I had it once. At least I have only one crashdump. I had one > more reboot/crash, but no dump. I also have a watchdog running on this > system, so not sure what caused the (unusual) reboot. I had a poudriere > build running at both times. Since the crashdump I didn't run poudriere > anymore. > >> If so, can you test this patch? > > I give it a try tomorrow anyway, and I will try to stress the system > again with poudriere. > > The nvme is a cache and also a log device for a zpool, so not really a > deterministic way to trigger access to it. I've run a lot of poudriere builds together with other load (about 30 jails with mysql, postgresql, redis, webmail, postfix, imap, java stuff, ...) on this system since thursday. So far no panic in the nvme part. Bye, Alexander. -- http://www.Leidinger.net Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF http://www.FreeBSD.org netchild@FreeBSD.org : PGP 0x8F31830F9F2772BF --=_42204c80d8c879abe751c11d81fb0a5b Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Am 2024-05-22 22:45, schrieb Alexander Leidinger:

Am 2024-05-22 20:53, schrieb Warner Losh:

First order:
 
Looks like we're trying to schedule a trim, but that fails due to a ma= lloc issue. So then, since it's a
malloc issue, we wind up trying to automatically reschedule this I/O, = which recurses into the driver
with a bad lock held and boop.
 
Can you reproduce this?
 
So far I had it once. At least I have only one crashdump. I had one mo= re reboot/crash, but no dump. I also have a watchdog running on this system= , so not sure what caused the (unusual) reboot. I had a poudriere build run= ning at both times. Since the crashdump I didn't run poudriere anymore.
 
If so, can you test this patch?
 
I give it a try tomorrow anyway, and I will try to stress the system a= gain with poudriere.

The nvme is a cache and also a log device for a zpool, so not really a d= eterministic way to trigger access to it.

I've run a lot of poudriere builds together with other load (about 30 ja= ils with mysql, postgresql, redis, webmail, postfix, imap, java stuff, ...)= on this system since thursday. So far no panic in the nvme part.

Bye,
Alexander.

--
--=_42204c80d8c879abe751c11d81fb0a5b-- --=_b3d06bc92298dfaee154a8e512810367 Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc; size=833 Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEER9UlYXp1PSd08nWXEg2wmwP42IYFAmZRoqoACgkQEg2wmwP4 2IZxsw/9EG+rs+xTeo+5EBG2di3Z5isg9XgVzkasK34f7kQlFZxmJlc7V1BD9gW9 TbuS9radXY2BFq/v+iEdyA1vXkj3fNSz+4jx4NkghkH5FZqDQ84arTPgp1siK/vl ENzha3d720dOCGcTu+z428sF9ykiDwHAXeymuCFcsFuogf4ARh4wmU76An/BwL2H yOfbf78DY4+Z5ZKxD3nNDzgN5vX5hf2WirOmZtfCD73ukiPsJr7htUaOguxYp2ur wL1+rIfgyI3XyFjrPq9YlGiqTEQX8/u0gj2kRT27saVPmzDU6dyita8KH4UbqGfv 8r4fHAjSm06bkXZU8RPOD8OvIyZgLDqX/sZlBDdImvB77x3wy1Qskg7pOPbDFZK+ vDz2kcuW62zmavTZCgULcNxW39Ond50aae3jO9zhj9Cksw2AHFeqycWzl7LVCNPj bmfEfUmYk6LbCkfOVlsHp9Gt56XKSoozumGOurqAWG+FnVUr0hpBBcz654nDaydd NZLGVvubu6m80on1ICux6GuY6f/E8q2dljmKbluKflGlyVBKXt53os6PUR4oqiB4 5lRlAKFff/saY3DkXAR7V/Dw8elr2ZlWIeOnulqpO2S4OB8JoP8/KiDskLILrQo5 HIz5KPauEGdwU28d4iIGMMVXk17xqJZ9YnpDBR7mejl17VEo9pQ= =0OL2 -----END PGP SIGNATURE----- --=_b3d06bc92298dfaee154a8e512810367--