From nobody Tue Dec  2 14:45:59 2025
X-Original-To: dev-commits-src-branches@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4dLNqq75Z9z6KXvC
	for <dev-commits-src-branches@mlmmj.nyi.freebsd.org>; Tue, 02 Dec 2025 14:45:59 +0000 (UTC)
	(envelope-from git@FreeBSD.org)
Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (4096 bits) client-digest SHA256)
	(Client CN "mxrelay.nyi.freebsd.org", Issuer "R12" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4dLNqq6W6rz42Yq
	for <dev-commits-src-branches@FreeBSD.org>; Tue, 02 Dec 2025 14:45:59 +0000 (UTC)
	(envelope-from git@FreeBSD.org)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim;
	t=1764686759;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding;
	bh=oxoaiwkeotGMKIiqHXyKZT4RVAt0VGNvT7rpnluqFiE=;
	b=KBOjgaZ7GrBmL28mbEKpjkFirtSOv22nbNjfZwDFPksKxc8V8yXvP2hc+tDVWCJl+lp7cC
	dGjcEp4LUHkltigBBzLkVWGN6Je61AgQ2KnN+0REP3ACxaRIptIlNa4HgOzGc3l00B6zKU
	k35ezNsQ/Z3B6Xx8YyMYWx7WucXUqzDQD2otfq8THSME49iWP3kL0ldgDHkUw1Dc/jTu7X
	ygrktOpclHPLbeSWIeLpY5t28ju/Kc441qGtpXPSx0Uz9/wrtOn+R9F1+bboogXGA6TVq4
	NYYTxuhNO6RAN6vW5s4HLoSALgyJxl2hUXm/ckDF4usf73PAUFnnOAldlyJ+2g==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org;
	s=dkim; t=1764686759;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding;
	bh=oxoaiwkeotGMKIiqHXyKZT4RVAt0VGNvT7rpnluqFiE=;
	b=ULGtkZovhvk64hS2QZ5yK2fSh33GX0WLcdHhUxPoJUkyVpI1kLviDc5dm3VgxU4+GnD8qy
	GMWiNX9gdhGC1MieUWp8KJWCUYuM6UX/Wq2fQ01awcoLNfDzVFGA+cp1Gj7kY25m5B9qLT
	4Tle/03je1yrnP+degBhoYpN7rBWNSFio8w3nf0NWfdll0nwzOhUaXkpdLh6KOIXmy7UZL
	TLc84m3KdRPW3qYKUmUKcWfutKkax1S+SKvOQav5F4RMm0fD0XQEZ0wCUF+tqO5nCY5KtD
	vUUp3B6TxO4XIjnJgX3mFGUaGDCgLHq8mCIaxZXN8fl90gZnhkXj5QHPEAVf6Q==
ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1764686759; a=rsa-sha256; cv=none;
	b=eT1ZS6eCpsu6Vcu0bmW7a20PdZrqSRpJADvNpKfQwF38GHL+5ZQHgXvKknbVw0qs/zFS2v
	FQye7X8DK0Ce795J54Y/vBmnIxO7G9nj6mmWZRcm0zKTa5xPHOLBSSbgZx6+7sh6S3kyD4
	dg26BcnsfqcLwrBXxicHy/32jXKOBG2Nz5IQuUE8Z6SM0QxvCRbEuScj+7EsYCnH4dUZUK
	iyY/AmU4QfMgZlFg7Wv4T6dmoJj1TopjwLIuU+vChKBlkG0N/Y0xAz3qC1utmAbX2ZCdhF
	F1IKfG5t+FHoD74F5qZ6m2CY60fKEvpe5NnstQXxWv3e8G3HKanV4vGcnPfkTQ==
ARC-Authentication-Results: i=1;
	mx1.freebsd.org;
	none
Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5])
	by mxrelay.nyi.freebsd.org (Postfix) with ESMTP id 4dLNqq63D8z19hb
	for <dev-commits-src-branches@FreeBSD.org>; Tue, 02 Dec 2025 14:45:59 +0000 (UTC)
	(envelope-from git@FreeBSD.org)
Received: from git (uid 1279)
	(envelope-from git@FreeBSD.org)
	id 2f02f
	by gitrepo.freebsd.org (DragonFly Mail Agent v0.13+ on gitrepo.freebsd.org);
	Tue, 02 Dec 2025 14:45:59 +0000
To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-branches@FreeBSD.org
From: Mark Johnston <markj@FreeBSD.org>
Subject: git: 1630af436a61 - stable/14 - vmm: Fix a deadlock between vm_smp_rendezvous() and vcpu_lock_all()
List-Id: Commits to the stable branches of the FreeBSD src repository <dev-commits-src-branches.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/dev-commits-src-branches
List-Help: <mailto:dev-commits-src-branches+help@freebsd.org>
List-Post: <mailto:dev-commits-src-branches@freebsd.org>
List-Subscribe: <mailto:dev-commits-src-branches+subscribe@freebsd.org>
List-Unsubscribe: <mailto:dev-commits-src-branches+unsubscribe@freebsd.org>
X-BeenThere: dev-commits-src-branches@freebsd.org
Sender: owner-dev-commits-src-branches@FreeBSD.org
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Git-Committer: markj
X-Git-Repository: src
X-Git-Refname: refs/heads/stable/14
X-Git-Reftype: branch
X-Git-Commit: 1630af436a6145e90f7b1aac1a91fbbfe9b22f57
Auto-Submitted: auto-generated
Date: Tue, 02 Dec 2025 14:45:59 +0000
Message-Id: <692efba7.2f02f.33327be2@gitrepo.freebsd.org>

The branch stable/14 has been updated by markj:

URL: https://cgit.FreeBSD.org/src/commit/?id=1630af436a6145e90f7b1aac1a91fbbfe9b22f57

commit 1630af436a6145e90f7b1aac1a91fbbfe9b22f57
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2025-10-17 13:09:38 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2025-12-02 14:44:36 +0000

    vmm: Fix a deadlock between vm_smp_rendezvous() and vcpu_lock_all()
    
    vm_smp_rendezvous() invokes a callback on all vCPUs, blocking the
    initiator until all vCPUs have responded.  vcpu_lock_all() blocks each
    vCPU by waiting for it to go idle and setting the vCPU state to frozen.
    These two operations can deadlock on each other, particularly when
    booting a Windows guest, when vcpu_lock_all() blocks waiting for a
    rendezvous initiator, and the initiator is blocked waiting for the vCPU
    thread which called vcpu_lock_all() to invoke the rendezvous callback.
    
    Implement vcpu_lock_all() in a way that avoids deadlocks with
    vm_smp_rendezvous().  In particular, when traversing vCPUs, invoke the
    rendezvous callback on the vCPU's behalf to help the initiator finish.
    We can only safely do so when the vCPU is IDLE or we have already locked
    it, otherwise we may be racing with the target vCPU thread.  Thus:
    - Use an exclusive lock to serialize vcpu_lock_all() callers, which lets
      us lock vCPUs out of order without fear of deadlock with parallel
      vcpu_lock_all() callers.
    - If a rendezvous is pending, lock all idle vCPUs and invoke the
      callback on their behalf.  If the vcpu_lock_all() caller is itself a
      vCPU thread, this will handle that thread.
    - Block waiting for all non-idle vCPUs to idle, or until one of them
      initiates a rendezvous, in which case we go back and invoke callbacks
      on behalf of already-locked vCPUs.
    
    Note that on !amd64 no changes are needed since there is no rendezvous
    mechanism, so there is a separate vcpu_set_state_all() for them based on
    the previous vcpu_lock_all().  These will be merged together once vcpu
    state handling is consolidated into sys/dev/vmm.
    
    Reviewed by:    corvink (previous version)
    MFC after:      3 weeks
    Differential Revision:  https://reviews.freebsd.org/D52968
    
    (cherry picked from commit f39768e52e513264da60add0ca2412bddda271ff)
---
 sys/amd64/include/vmm.h |   3 +-
 sys/amd64/vmm/vmm.c     | 177 ++++++++++++++++++++++++++++++++++++++++--------
 sys/amd64/vmm/vmm_dev.c |  30 ++------
 3 files changed, 157 insertions(+), 53 deletions(-)

diff --git a/sys/amd64/include/vmm.h b/sys/amd64/include/vmm.h
index 368c3a6cb95b..09f2ea7cb0e5 100644
--- a/sys/amd64/include/vmm.h
+++ b/sys/amd64/include/vmm.h
@@ -225,7 +225,7 @@ extern u_int vm_maxcpu;			/* maximum virtual cpus */
 int vm_create(const char *name, struct vm **retvm);
 struct vcpu *vm_alloc_vcpu(struct vm *vm, int vcpuid);
 void vm_disable_vcpu_creation(struct vm *vm);
-void vm_slock_vcpus(struct vm *vm);
+void vm_lock_vcpus(struct vm *vm);
 void vm_unlock_vcpus(struct vm *vm);
 void vm_destroy(struct vm *vm);
 int vm_reinit(struct vm *vm);
@@ -378,6 +378,7 @@ enum vcpu_state {
 };
 
 int vcpu_set_state(struct vcpu *vcpu, enum vcpu_state state, bool from_idle);
+int vcpu_set_state_all(struct vm *vm, enum vcpu_state state);
 enum vcpu_state vcpu_get_state(struct vcpu *vcpu, int *hostcpu);
 
 static int __inline
diff --git a/sys/amd64/vmm/vmm.c b/sys/amd64/vmm/vmm.c
index 677dcd05b9b2..787599bcfc97 100644
--- a/sys/amd64/vmm/vmm.c
+++ b/sys/amd64/vmm/vmm.c
@@ -557,9 +557,9 @@ vm_alloc_vcpu(struct vm *vm, int vcpuid)
 }
 
 void
-vm_slock_vcpus(struct vm *vm)
+vm_lock_vcpus(struct vm *vm)
 {
-	sx_slock(&vm->vcpus_init_lock);
+	sx_xlock(&vm->vcpus_init_lock);
 }
 
 void
@@ -1342,6 +1342,54 @@ save_guest_fpustate(struct vcpu *vcpu)
 
 static VMM_STAT(VCPU_IDLE_TICKS, "number of ticks vcpu was idle");
 
+/*
+ * Invoke the rendezvous function on the specified vcpu if applicable.  Return
+ * true if the rendezvous is finished, false otherwise.
+ */
+static bool
+vm_rendezvous(struct vcpu *vcpu)
+{
+	struct vm *vm = vcpu->vm;
+	int vcpuid;
+
+	mtx_assert(&vcpu->vm->rendezvous_mtx, MA_OWNED);
+	KASSERT(vcpu->vm->rendezvous_func != NULL,
+	    ("vm_rendezvous: no rendezvous pending"));
+
+	/* 'rendezvous_req_cpus' must be a subset of 'active_cpus' */
+	CPU_AND(&vm->rendezvous_req_cpus, &vm->rendezvous_req_cpus,
+	    &vm->active_cpus);
+
+	vcpuid = vcpu->vcpuid;
+	if (CPU_ISSET(vcpuid, &vm->rendezvous_req_cpus) &&
+	    !CPU_ISSET(vcpuid, &vm->rendezvous_done_cpus)) {
+		VMM_CTR0(vcpu, "Calling rendezvous func");
+		(*vm->rendezvous_func)(vcpu, vm->rendezvous_arg);
+		CPU_SET(vcpuid, &vm->rendezvous_done_cpus);
+	}
+	if (CPU_CMP(&vm->rendezvous_req_cpus,
+	    &vm->rendezvous_done_cpus) == 0) {
+		VMM_CTR0(vcpu, "Rendezvous completed");
+		CPU_ZERO(&vm->rendezvous_req_cpus);
+		vm->rendezvous_func = NULL;
+		wakeup(&vm->rendezvous_func);
+		return (true);
+	}
+	return (false);
+}
+
+static void
+vcpu_wait_idle(struct vcpu *vcpu)
+{
+	KASSERT(vcpu->state != VCPU_IDLE, ("vcpu already idle"));
+
+	vcpu->reqidle = 1;
+	vcpu_notify_event_locked(vcpu, false);
+	VMM_CTR1(vcpu, "vcpu state change from %s to "
+	    "idle requested", vcpu_state2str(vcpu->state));
+	msleep_spin(&vcpu->state, &vcpu->mtx, "vmstat", hz);
+}
+
 static int
 vcpu_set_state_locked(struct vcpu *vcpu, enum vcpu_state newstate,
     bool from_idle)
@@ -1356,13 +1404,8 @@ vcpu_set_state_locked(struct vcpu *vcpu, enum vcpu_state newstate,
 	 * ioctl() operating on a vcpu at any point.
 	 */
 	if (from_idle) {
-		while (vcpu->state != VCPU_IDLE) {
-			vcpu->reqidle = 1;
-			vcpu_notify_event_locked(vcpu, false);
-			VMM_CTR1(vcpu, "vcpu state change from %s to "
-			    "idle requested", vcpu_state2str(vcpu->state));
-			msleep_spin(&vcpu->state, &vcpu->mtx, "vmstat", hz);
-		}
+		while (vcpu->state != VCPU_IDLE)
+			vcpu_wait_idle(vcpu);
 	} else {
 		KASSERT(vcpu->state != VCPU_IDLE, ("invalid transition from "
 		    "vcpu idle state"));
@@ -1414,6 +1457,95 @@ vcpu_set_state_locked(struct vcpu *vcpu, enum vcpu_state newstate,
 	return (0);
 }
 
+/*
+ * Try to lock all of the vCPUs in the VM while taking care to avoid deadlocks
+ * with vm_smp_rendezvous().
+ *
+ * The complexity here suggests that the rendezvous mechanism needs a rethink.
+ */
+int
+vcpu_set_state_all(struct vm *vm, enum vcpu_state newstate)
+{
+	cpuset_t locked;
+	struct vcpu *vcpu;
+	int error, i;
+	uint16_t maxcpus;
+
+	KASSERT(newstate != VCPU_IDLE,
+	    ("vcpu_set_state_all: invalid target state %d", newstate));
+
+	error = 0;
+	CPU_ZERO(&locked);
+	maxcpus = vm->maxcpus;
+
+	mtx_lock(&vm->rendezvous_mtx);
+restart:
+	if (vm->rendezvous_func != NULL) {
+		/*
+		 * If we have a pending rendezvous, then the initiator may be
+		 * blocked waiting for other vCPUs to execute the callback.  The
+		 * current thread may be a vCPU thread so we must not block
+		 * waiting for the initiator, otherwise we get a deadlock.
+		 * Thus, execute the callback on behalf of any idle vCPUs.
+		 */
+		for (i = 0; i < maxcpus; i++) {
+			vcpu = vm_vcpu(vm, i);
+			if (vcpu == NULL)
+				continue;
+			vcpu_lock(vcpu);
+			if (vcpu->state == VCPU_IDLE) {
+				(void)vcpu_set_state_locked(vcpu, VCPU_FROZEN,
+				    true);
+				CPU_SET(i, &locked);
+			}
+			if (CPU_ISSET(i, &locked)) {
+				/*
+				 * We can safely execute the callback on this
+				 * vCPU's behalf.
+				 */
+				vcpu_unlock(vcpu);
+				(void)vm_rendezvous(vcpu);
+				vcpu_lock(vcpu);
+			}
+			vcpu_unlock(vcpu);
+		}
+	}
+
+	/*
+	 * Now wait for remaining vCPUs to become idle.  This may include the
+	 * initiator of a rendezvous that is currently blocked on the rendezvous
+	 * mutex.
+	 */
+	CPU_FOREACH_ISCLR(i, &locked) {
+		if (i >= maxcpus)
+			break;
+		vcpu = vm_vcpu(vm, i);
+		if (vcpu == NULL)
+			continue;
+		vcpu_lock(vcpu);
+		while (vcpu->state != VCPU_IDLE) {
+			mtx_unlock(&vm->rendezvous_mtx);
+			vcpu_wait_idle(vcpu);
+			vcpu_unlock(vcpu);
+			mtx_lock(&vm->rendezvous_mtx);
+			if (vm->rendezvous_func != NULL)
+				goto restart;
+			vcpu_lock(vcpu);
+		}
+		error = vcpu_set_state_locked(vcpu, newstate, true);
+		vcpu_unlock(vcpu);
+		if (error != 0) {
+			/* Roll back state changes. */
+			CPU_FOREACH_ISSET(i, &locked)
+				(void)vcpu_set_state(vcpu, VCPU_IDLE, false);
+			break;
+		}
+		CPU_SET(i, &locked);
+	}
+	mtx_unlock(&vm->rendezvous_mtx);
+	return (error);
+}
+
 static void
 vcpu_require_state(struct vcpu *vcpu, enum vcpu_state newstate)
 {
@@ -1435,36 +1567,23 @@ vcpu_require_state_locked(struct vcpu *vcpu, enum vcpu_state newstate)
 static int
 vm_handle_rendezvous(struct vcpu *vcpu)
 {
-	struct vm *vm = vcpu->vm;
+	struct vm *vm;
 	struct thread *td;
-	int error, vcpuid;
 
-	error = 0;
-	vcpuid = vcpu->vcpuid;
 	td = curthread;
+	vm = vcpu->vm;
+
 	mtx_lock(&vm->rendezvous_mtx);
 	while (vm->rendezvous_func != NULL) {
-		/* 'rendezvous_req_cpus' must be a subset of 'active_cpus' */
-		CPU_AND(&vm->rendezvous_req_cpus, &vm->rendezvous_req_cpus, &vm->active_cpus);
-
-		if (CPU_ISSET(vcpuid, &vm->rendezvous_req_cpus) &&
-		    !CPU_ISSET(vcpuid, &vm->rendezvous_done_cpus)) {
-			VMM_CTR0(vcpu, "Calling rendezvous func");
-			(*vm->rendezvous_func)(vcpu, vm->rendezvous_arg);
-			CPU_SET(vcpuid, &vm->rendezvous_done_cpus);
-		}
-		if (CPU_CMP(&vm->rendezvous_req_cpus,
-		    &vm->rendezvous_done_cpus) == 0) {
-			VMM_CTR0(vcpu, "Rendezvous completed");
-			CPU_ZERO(&vm->rendezvous_req_cpus);
-			vm->rendezvous_func = NULL;
-			wakeup(&vm->rendezvous_func);
+		if (vm_rendezvous(vcpu))
 			break;
-		}
+
 		VMM_CTR0(vcpu, "Wait for rendezvous completion");
 		mtx_sleep(&vm->rendezvous_func, &vm->rendezvous_mtx, 0,
 		    "vmrndv", hz);
 		if (td_ast_pending(td, TDA_SUSPEND)) {
+			int error;
+
 			mtx_unlock(&vm->rendezvous_mtx);
 			error = thread_check_susp(td, true);
 			if (error != 0)
diff --git a/sys/amd64/vmm/vmm_dev.c b/sys/amd64/vmm/vmm_dev.c
index 5214cd3f1447..032b7471664e 100644
--- a/sys/amd64/vmm/vmm_dev.c
+++ b/sys/amd64/vmm/vmm_dev.c
@@ -179,32 +179,16 @@ vcpu_unlock_one(struct vmmdev_softc *sc, int vcpuid, struct vcpu *vcpu)
 static int
 vcpu_lock_all(struct vmmdev_softc *sc)
 {
-	struct vcpu *vcpu;
 	int error;
-	uint16_t i, j, maxcpus;
-
-	error = 0;
-	vm_slock_vcpus(sc->vm);
-	maxcpus = vm_get_maxcpus(sc->vm);
-	for (i = 0; i < maxcpus; i++) {
-		vcpu = vm_vcpu(sc->vm, i);
-		if (vcpu == NULL)
-			continue;
-		error = vcpu_lock_one(vcpu);
-		if (error)
-			break;
-	}
 
-	if (error) {
-		for (j = 0; j < i; j++) {
-			vcpu = vm_vcpu(sc->vm, j);
-			if (vcpu == NULL)
-				continue;
-			vcpu_unlock_one(sc, j, vcpu);
-		}
+	/*
+	 * Serialize vcpu_lock_all() callers.  Individual vCPUs are not locked
+	 * in a consistent order so we need to serialize to avoid deadlocks.
+	 */
+	vm_lock_vcpus(sc->vm);
+	error = vcpu_set_state_all(sc->vm, VCPU_FROZEN);
+	if (error != 0)
 		vm_unlock_vcpus(sc->vm);
-	}
-
 	return (error);
 }