From owner-svn-src-head@freebsd.org  Thu Jul  5 03:34:00 2018
Return-Path: <owner-svn-src-head@freebsd.org>
Delivered-To: svn-src-head@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id BE366103FE00;
 Thu,  5 Jul 2018 03:33:59 +0000 (UTC)
 (envelope-from araujo@FreeBSD.org)
Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org
 [IPv6:2610:1c1:1:606c::19:3])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "mxrelay.nyi.freebsd.org",
 Issuer "Let's Encrypt Authority X3" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 6FD29842BC;
 Thu,  5 Jul 2018 03:33:59 +0000 (UTC)
 (envelope-from araujo@FreeBSD.org)
Received: from repo.freebsd.org (repo.freebsd.org
 [IPv6:2610:1c1:1:6068::e6a:0])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 51B0E14845;
 Thu,  5 Jul 2018 03:33:59 +0000 (UTC)
 (envelope-from araujo@FreeBSD.org)
Received: from repo.freebsd.org ([127.0.1.37])
 by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id w653Xx8l044936;
 Thu, 5 Jul 2018 03:33:59 GMT (envelope-from araujo@FreeBSD.org)
Received: (from araujo@localhost)
 by repo.freebsd.org (8.15.2/8.15.2/Submit) id w653XwbR044932;
 Thu, 5 Jul 2018 03:33:58 GMT (envelope-from araujo@FreeBSD.org)
Message-Id: <201807050333.w653XwbR044932@repo.freebsd.org>
X-Authentication-Warning: repo.freebsd.org: araujo set sender to
 araujo@FreeBSD.org using -f
From: Marcelo Araujo <araujo@FreeBSD.org>
Date: Thu, 5 Jul 2018 03:33:58 +0000 (UTC)
To: src-committers@freebsd.org, svn-src-all@freebsd.org,
 svn-src-head@freebsd.org
Subject: svn commit: r335974 - head/usr.sbin/bhyve
X-SVN-Group: head
X-SVN-Commit-Author: araujo
X-SVN-Commit-Paths: head/usr.sbin/bhyve
X-SVN-Commit-Revision: 335974
X-SVN-Commit-Repository: base
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.27
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
 <svn-src-head.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head/>
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 05 Jul 2018 03:34:00 -0000

Author: araujo
Date: Thu Jul  5 03:33:58 2018
New Revision: 335974
URL: https://svnweb.freebsd.org/changeset/base/335974

Log:
  - Add bhyve NVMe device emulation.
  
  The initial work on bhyve NVMe device emulation was done by the GSoC student
  Shunsuke Mie and was heavily modified in performan, functionality and
  guest support by Leon Dang.
  
  bhyve:
  	-s <n>,nvme,devpath,maxq=#,qsz=#,ioslots=#,sectsz=#,ser=A-Z
  
  	accepted devpath:
  		/dev/blockdev
  		/path/to/image
  		ram=size_in_MiB
  
  Tested with guest OS: FreeBSD Head, Linux Fedora fc27, Ubuntu 18.04,
                        OpenSuse 15.0, Windows Server 2016 Datacenter.
  Tested with all accepted device paths: Real nvme, zdev and also with ram.
  Tested on: AMD Ryzen Threadripper 1950X 16-Core Processor and
             Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz.
  
  Tests at: https://people.freebsd.org/~araujo/bhyve_nvme/nvme.txt
  
  Submitted by:	Shunsuke Mie <sux2mfgj_gmail.com>,
  		Leon Dang <leon_digitalmsx.com>
  Reviewed by:	chuck (early version), grehan
  Relnotes:	Yes
  Sponsored by:	iXsystems Inc.
  Differential Revision:	https://reviews.freebsd.org/D14022

Added:
  head/usr.sbin/bhyve/pci_nvme.c   (contents, props changed)
Modified:
  head/usr.sbin/bhyve/Makefile
  head/usr.sbin/bhyve/bhyve.8
  head/usr.sbin/bhyve/block_if.h

Modified: head/usr.sbin/bhyve/Makefile
==============================================================================
--- head/usr.sbin/bhyve/Makefile	Thu Jul  5 02:43:10 2018	(r335973)
+++ head/usr.sbin/bhyve/Makefile	Thu Jul  5 03:33:58 2018	(r335974)
@@ -41,6 +41,7 @@ SRCS=	\
 	pci_hostbridge.c	\
 	pci_irq.c		\
 	pci_lpc.c		\
+	pci_nvme.c		\
 	pci_passthru.c		\
 	pci_virtio_block.c	\
 	pci_virtio_console.c	\

Modified: head/usr.sbin/bhyve/bhyve.8
==============================================================================
--- head/usr.sbin/bhyve/bhyve.8	Thu Jul  5 02:43:10 2018	(r335973)
+++ head/usr.sbin/bhyve/bhyve.8	Thu Jul  5 03:33:58 2018	(r335974)
@@ -24,7 +24,7 @@
 .\"
 .\" $FreeBSD$
 .\"
-.Dd Jun 11, 2018
+.Dd Jul 05, 2018
 .Dt BHYVE 8
 .Os
 .Sh NAME
@@ -241,6 +241,8 @@ The LPC bridge emulation can only be configured on bus
 Raw framebuffer device attached to VNC server.
 .It Li xhci
 eXtensible Host Controller Interface (xHCI) USB controller.
+.It Li nvme
+NVM Express (NVMe) controller.
 .El
 .It Op Ar conf
 This optional parameter describes the backend for device emulations.
@@ -432,6 +434,27 @@ xHCI USB devices:
 .It Li tablet
 A USB tablet device which provides precise cursor synchronization
 when using VNC.
+.El
+.Pp
+NVMe devices:
+.Bl -tag -width 10n
+.It Li devpath
+Accepted device paths are:
+.Ar /dev/blockdev
+or
+.Ar /path/to/image
+or
+.Ar ram=size_in_MiB .
+.It Li maxq
+Max number of queues.
+.It Li qsz
+Max elements in each queue.
+.It Li ioslots
+Max number of concurrent I/O requests.
+.It Li sectsz
+Sector size (defaults to blockif sector size).
+.It Li ser
+Serial number with maximum 20 characters.
 .El
 .El
 .It Fl S

Modified: head/usr.sbin/bhyve/block_if.h
==============================================================================
--- head/usr.sbin/bhyve/block_if.h	Thu Jul  5 02:43:10 2018	(r335973)
+++ head/usr.sbin/bhyve/block_if.h	Thu Jul  5 03:33:58 2018	(r335974)
@@ -44,12 +44,12 @@
 #define BLOCKIF_IOV_MAX		33	/* not practical to be IOV_MAX */
 
 struct blockif_req {
-	struct iovec	br_iov[BLOCKIF_IOV_MAX];
 	int		br_iovcnt;
 	off_t		br_offset;
 	ssize_t		br_resid;
 	void		(*br_callback)(struct blockif_req *req, int err);
 	void		*br_param;
+	struct iovec	br_iov[BLOCKIF_IOV_MAX];
 };
 
 struct blockif_ctxt;

Added: head/usr.sbin/bhyve/pci_nvme.c
==============================================================================
--- /dev/null	00:00:00 1970	(empty, because file is newly added)
+++ head/usr.sbin/bhyve/pci_nvme.c	Thu Jul  5 03:33:58 2018	(r335974)
@@ -0,0 +1,1853 @@
+/*-
+ * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
+ *
+ * Copyright (c) 2017 Shunsuke Mie
+ * Copyright (c) 2018 Leon Dang
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+/*
+ * bhyve PCIe-NVMe device emulation.
+ *
+ * options:
+ *  -s <n>,nvme,devpath,maxq=#,qsz=#,ioslots=#,sectsz=#,ser=A-Z
+ *
+ *  accepted devpath:
+ *    /dev/blockdev
+ *    /path/to/image
+ *    ram=size_in_MiB
+ *
+ *  maxq    = max number of queues
+ *  qsz     = max elements in each queue
+ *  ioslots = max number of concurrent io requests
+ *  sectsz  = sector size (defaults to blockif sector size)
+ *  ser     = serial number (20-chars max)
+ *
+ */
+
+/* TODO:
+    - create async event for smart and log
+    - intr coalesce
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD$");
+
+#include <sys/types.h>
+
+#include <assert.h>
+#include <pthread.h>
+#include <semaphore.h>
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <machine/atomic.h>
+#include <machine/vmm.h>
+#include <vmmapi.h>
+
+#include <dev/nvme/nvme.h>
+
+#include "bhyverun.h"
+#include "block_if.h"
+#include "pci_emul.h"
+
+
+static int nvme_debug = 0;
+#define	DPRINTF(params) if (nvme_debug) printf params
+#define	WPRINTF(params) printf params
+
+/* defaults; can be overridden */
+#define	NVME_MSIX_BAR		4
+
+#define	NVME_IOSLOTS		8
+
+#define	NVME_QUEUES		16
+#define	NVME_MAX_QENTRIES	2048
+
+#define	NVME_PRP2_ITEMS		(PAGE_SIZE/sizeof(uint64_t))
+#define	NVME_MAX_BLOCKIOVS	512
+
+/* helpers */
+
+#define	NVME_DOORBELL_OFFSET	offsetof(struct nvme_registers, doorbell)
+
+enum nvme_controller_register_offsets {
+	NVME_CR_CAP_LOW = 0x00,
+	NVME_CR_CAP_HI  = 0x04,
+	NVME_CR_VS      = 0x08,
+	NVME_CR_INTMS   = 0x0c,
+	NVME_CR_INTMC   = 0x10,
+	NVME_CR_CC      = 0x14,
+	NVME_CR_CSTS    = 0x1c,
+	NVME_CR_NSSR    = 0x20,
+	NVME_CR_AQA     = 0x24,
+	NVME_CR_ASQ_LOW = 0x28,
+	NVME_CR_ASQ_HI  = 0x2c,
+	NVME_CR_ACQ_LOW = 0x30,
+	NVME_CR_ACQ_HI  = 0x34,
+};
+
+enum nvme_cmd_cdw11 {
+	NVME_CMD_CDW11_PC  = 0x0001,
+	NVME_CMD_CDW11_IEN = 0x0002,
+	NVME_CMD_CDW11_IV  = 0xFFFF0000,
+};
+
+#define	NVME_CMD_GET_OPC(opc) \
+	((opc) >> NVME_CMD_OPC_SHIFT & NVME_CMD_OPC_MASK)
+
+#define	NVME_CQ_INTEN	0x01
+#define	NVME_CQ_INTCOAL	0x02
+
+struct nvme_completion_queue {
+	struct nvme_completion *qbase;
+	uint32_t	size;
+	uint16_t	tail; /* nvme progress */
+	uint16_t	head; /* guest progress */
+	uint16_t	intr_vec;
+	uint32_t	intr_en;
+	pthread_mutex_t	mtx;
+};
+
+struct nvme_submission_queue {
+	struct nvme_command *qbase;
+	uint32_t	size;
+	uint16_t	head; /* nvme progress */
+	uint16_t	tail; /* guest progress */
+	uint16_t	cqid; /* completion queue id */
+	int		busy; /* queue is being processed */
+	int		qpriority;
+};
+
+enum nvme_storage_type {
+	NVME_STOR_BLOCKIF = 0,
+	NVME_STOR_RAM = 1,
+};
+
+struct pci_nvme_blockstore {
+	enum nvme_storage_type type;
+	void		*ctx;
+	uint64_t	size;
+	uint32_t	sectsz;
+	uint32_t	sectsz_bits;
+};
+
+struct pci_nvme_ioreq {
+	struct pci_nvme_softc *sc;
+	struct pci_nvme_ioreq *next;
+	struct nvme_submission_queue *nvme_sq;
+	uint16_t	sqid;
+
+	/* command information */
+	uint16_t	opc;
+	uint16_t	cid;
+	uint32_t	nsid;
+
+	uint64_t	prev_gpaddr;
+	size_t		prev_size;
+
+	/*
+	 * lock if all iovs consumed (big IO);
+	 * complete transaction before continuing
+	 */
+	pthread_mutex_t	mtx;
+	pthread_cond_t	cv;
+
+	struct blockif_req io_req;
+
+	/* pad to fit up to 512 page descriptors from guest IO request */
+	struct iovec	iovpadding[NVME_MAX_BLOCKIOVS-BLOCKIF_IOV_MAX];
+};
+
+struct pci_nvme_softc {
+	struct pci_devinst *nsc_pi;
+
+	pthread_mutex_t	mtx;
+
+	struct nvme_registers regs;
+
+	struct nvme_namespace_data  nsdata;
+	struct nvme_controller_data ctrldata;
+
+	struct pci_nvme_blockstore nvstore;
+
+	uint16_t	max_qentries; /* max entries per queue */
+	uint32_t	max_queues;
+	uint32_t	num_cqueues;
+	uint32_t	num_squeues;
+
+	struct pci_nvme_ioreq *ioreqs;
+	struct pci_nvme_ioreq *ioreqs_free; /* free list of ioreqs */
+	uint32_t	pending_ios;
+	uint32_t	ioslots;
+	sem_t		iosemlock;
+
+	/* status and guest memory mapped queues */
+	struct nvme_completion_queue *compl_queues;
+	struct nvme_submission_queue *submit_queues;
+
+	/* controller features */
+	uint32_t	intr_coales_aggr_time;   /* 0x08: uS to delay intr */
+	uint32_t	intr_coales_aggr_thresh; /* 0x08: compl-Q entries */
+	uint32_t	async_ev_config;         /* 0x0B: async event config */
+};
+
+
+static void pci_nvme_io_partial(struct blockif_req *br, int err);
+
+/* Controller Configuration utils */
+#define	NVME_CC_GET_EN(cc) \
+	((cc) >> NVME_CC_REG_EN_SHIFT & NVME_CC_REG_EN_MASK)
+#define	NVME_CC_GET_CSS(cc) \
+	((cc) >> NVME_CC_REG_CSS_SHIFT & NVME_CC_REG_CSS_MASK)
+#define	NVME_CC_GET_SHN(cc) \
+	((cc) >> NVME_CC_REG_SHN_SHIFT & NVME_CC_REG_SHN_MASK)
+#define	NVME_CC_GET_IOSQES(cc) \
+	((cc) >> NVME_CC_REG_IOSQES_SHIFT & NVME_CC_REG_IOSQES_MASK)
+#define	NVME_CC_GET_IOCQES(cc) \
+	((cc) >> NVME_CC_REG_IOCQES_SHIFT & NVME_CC_REG_IOCQES_MASK)
+
+#define	NVME_CC_WRITE_MASK \
+	((NVME_CC_REG_EN_MASK << NVME_CC_REG_EN_SHIFT) | \
+	 (NVME_CC_REG_IOSQES_MASK << NVME_CC_REG_IOSQES_SHIFT) | \
+	 (NVME_CC_REG_IOCQES_MASK << NVME_CC_REG_IOCQES_SHIFT))
+
+#define	NVME_CC_NEN_WRITE_MASK \
+	((NVME_CC_REG_CSS_MASK << NVME_CC_REG_CSS_SHIFT) | \
+	 (NVME_CC_REG_MPS_MASK << NVME_CC_REG_MPS_SHIFT) | \
+	 (NVME_CC_REG_AMS_MASK << NVME_CC_REG_AMS_SHIFT))
+
+/* Controller Status utils */
+#define	NVME_CSTS_GET_RDY(sts) \
+	((sts) >> NVME_CSTS_REG_RDY_SHIFT & NVME_CSTS_REG_RDY_MASK)
+
+#define	NVME_CSTS_RDY	(1 << NVME_CSTS_REG_RDY_SHIFT)
+
+/* Completion Queue status word utils */
+#define	NVME_STATUS_P	(1 << NVME_STATUS_P_SHIFT)
+#define	NVME_STATUS_MASK \
+	((NVME_STATUS_SCT_MASK << NVME_STATUS_SCT_SHIFT) |\
+	 (NVME_STATUS_SC_MASK << NVME_STATUS_SC_SHIFT))
+
+static __inline void
+pci_nvme_status_tc(uint16_t *status, uint16_t type, uint16_t code)
+{
+
+	*status &= ~NVME_STATUS_MASK;
+	*status |= (type & NVME_STATUS_SCT_MASK) << NVME_STATUS_SCT_SHIFT |
+		(code & NVME_STATUS_SC_MASK) << NVME_STATUS_SC_SHIFT;
+}
+
+static __inline void
+pci_nvme_status_genc(uint16_t *status, uint16_t code)
+{
+
+	pci_nvme_status_tc(status, NVME_SCT_GENERIC, code);
+}
+
+static __inline void
+pci_nvme_toggle_phase(uint16_t *status, int prev)
+{
+
+	if (prev)
+		*status &= ~NVME_STATUS_P;
+	else
+		*status |= NVME_STATUS_P;
+}
+
+static void
+pci_nvme_init_ctrldata(struct pci_nvme_softc *sc)
+{
+	struct nvme_controller_data *cd = &sc->ctrldata;
+
+	cd->vid = 0xFB5D;
+	cd->ssvid = 0x0000;
+
+	cd->mn[0] = 'b';
+	cd->mn[1] = 'h';
+	cd->mn[2] = 'y';
+	cd->mn[3] = 'v';
+	cd->mn[4] = 'e';
+	cd->mn[5] = '-';
+	cd->mn[6] = 'N';
+	cd->mn[7] = 'V';
+	cd->mn[8] = 'M';
+	cd->mn[9] = 'e';
+
+	cd->fr[0] = '1';
+	cd->fr[1] = '.';
+	cd->fr[2] = '0';
+
+	/* Num of submission commands that we can handle at a time (2^rab) */
+	cd->rab   = 4;
+
+	/* FreeBSD OUI */
+	cd->ieee[0] = 0x58;
+	cd->ieee[1] = 0x9c;
+	cd->ieee[2] = 0xfc;
+
+	cd->mic = 0;
+
+	cd->mdts = 9;	/* max data transfer size (2^mdts * CAP.MPSMIN) */
+
+	cd->ver = 0x00010300;
+
+	cd->oacs = 1 << NVME_CTRLR_DATA_OACS_FORMAT_SHIFT;
+	cd->acl = 2;
+	cd->aerl = 4;
+
+	cd->lpa = 0;	/* TODO: support some simple things like SMART */
+	cd->elpe = 0;	/* max error log page entries */
+	cd->npss = 1;	/* number of power states support */
+
+	/* Warning Composite Temperature Threshold */
+	cd->wctemp = 0x0157;
+
+	cd->sqes = (6 << NVME_CTRLR_DATA_SQES_MAX_SHIFT) |
+	    (6 << NVME_CTRLR_DATA_SQES_MIN_SHIFT);
+	cd->cqes = (4 << NVME_CTRLR_DATA_CQES_MAX_SHIFT) |
+	    (4 << NVME_CTRLR_DATA_CQES_MIN_SHIFT);
+	cd->nn = 1;	/* number of namespaces */
+
+	cd->fna = 0x03;
+
+	cd->power_state[0].mp = 10;
+}
+
+static void
+pci_nvme_init_nsdata(struct pci_nvme_softc *sc)
+{
+	struct nvme_namespace_data *nd;
+
+	nd = &sc->nsdata;
+
+	nd->nsze = sc->nvstore.size / sc->nvstore.sectsz;
+	nd->ncap = nd->nsze;
+	nd->nuse = nd->nsze;
+
+	/* Get LBA and backstore information from backing store */
+	nd->nlbaf = 1;
+	/* LBA data-sz = 2^lbads */
+	nd->lbaf[0] = sc->nvstore.sectsz_bits << NVME_NS_DATA_LBAF_LBADS_SHIFT;
+
+	nd->flbas = 0;
+}
+
+static void
+pci_nvme_reset(struct pci_nvme_softc *sc)
+{
+	DPRINTF(("%s\r\n", __func__));
+
+	sc->regs.cap_lo = (sc->max_qentries & NVME_CAP_LO_REG_MQES_MASK) |
+	    (1 << NVME_CAP_LO_REG_CQR_SHIFT) |
+	    (60 << NVME_CAP_LO_REG_TO_SHIFT);
+
+	sc->regs.cap_hi = 1 << NVME_CAP_HI_REG_CSS_NVM_SHIFT;
+
+	sc->regs.vs = 0x00010300;	/* NVMe v1.3 */
+
+	sc->regs.cc = 0;
+	sc->regs.csts = 0;
+
+	if (sc->submit_queues != NULL) {
+		pthread_mutex_lock(&sc->mtx);
+		sc->num_cqueues = sc->num_squeues = sc->max_queues;
+
+		for (int i = 0; i <= sc->max_queues; i++) {
+			/*
+			 * The Admin Submission Queue is at index 0.
+			 * It must not be changed at reset otherwise the
+			 * emulation will be out of sync with the guest.
+			 */
+			if (i != 0) {
+				sc->submit_queues[i].qbase = NULL;
+				sc->submit_queues[i].size = 0;
+				sc->submit_queues[i].cqid = 0;
+
+				sc->compl_queues[i].qbase = NULL;
+				sc->compl_queues[i].size = 0;
+			}
+			sc->submit_queues[i].tail = 0;
+			sc->submit_queues[i].head = 0;
+			sc->submit_queues[i].busy = 0;
+
+			sc->compl_queues[i].tail = 0;
+			sc->compl_queues[i].head = 0;
+		}
+
+		pthread_mutex_unlock(&sc->mtx);
+	} else
+		sc->submit_queues = calloc(sc->max_queues + 1,
+		                        sizeof(struct nvme_submission_queue));
+
+	if (sc->compl_queues == NULL) {
+		sc->compl_queues = calloc(sc->max_queues + 1,
+		                        sizeof(struct nvme_completion_queue));
+
+		for (int i = 0; i <= sc->num_cqueues; i++)
+			pthread_mutex_init(&sc->compl_queues[i].mtx, NULL);
+	}
+}
+
+static void
+pci_nvme_init_controller(struct vmctx *ctx, struct pci_nvme_softc *sc)
+{
+	uint16_t acqs, asqs;
+
+	DPRINTF(("%s\r\n", __func__));
+
+	asqs = (sc->regs.aqa & NVME_AQA_REG_ASQS_MASK) + 1;
+	sc->submit_queues[0].size = asqs;
+	sc->submit_queues[0].qbase = vm_map_gpa(ctx, sc->regs.asq,
+	            sizeof(struct nvme_command) * asqs);
+
+	DPRINTF(("%s mapping Admin-SQ guest 0x%lx, host: %p\r\n",
+	        __func__, sc->regs.asq, sc->submit_queues[0].qbase));
+
+	acqs = ((sc->regs.aqa >> NVME_AQA_REG_ACQS_SHIFT) & 
+	    NVME_AQA_REG_ACQS_MASK) + 1;
+	sc->compl_queues[0].size = acqs;
+	sc->compl_queues[0].qbase = vm_map_gpa(ctx, sc->regs.acq,
+	         sizeof(struct nvme_completion) * acqs);
+	DPRINTF(("%s mapping Admin-CQ guest 0x%lx, host: %p\r\n",
+	        __func__, sc->regs.acq, sc->compl_queues[0].qbase));
+}
+
+static int
+nvme_opc_delete_io_sq(struct pci_nvme_softc* sc, struct nvme_command* command,
+	struct nvme_completion* compl)
+{
+	uint16_t qid = command->cdw10 & 0xffff;
+
+	DPRINTF(("%s DELETE_IO_SQ %u\r\n", __func__, qid));
+	if (qid == 0 || qid > sc->num_cqueues) {
+		WPRINTF(("%s NOT PERMITTED queue id %u / num_squeues %u\r\n",
+		        __func__, qid, sc->num_squeues));
+		pci_nvme_status_tc(&compl->status, NVME_SCT_COMMAND_SPECIFIC,
+		    NVME_SC_INVALID_QUEUE_IDENTIFIER);
+		return (1);
+	}
+
+	sc->submit_queues[qid].qbase = NULL;
+	pci_nvme_status_genc(&compl->status, NVME_SC_SUCCESS);
+	return (1);
+}
+
+static int
+nvme_opc_create_io_sq(struct pci_nvme_softc* sc, struct nvme_command* command,
+	struct nvme_completion* compl)
+{
+	if (command->cdw11 & NVME_CMD_CDW11_PC) {
+		uint16_t qid = command->cdw10 & 0xffff;
+		struct nvme_submission_queue *nsq;
+
+		if (qid > sc->num_squeues) {
+			WPRINTF(("%s queue index %u > num_squeues %u\r\n",
+			        __func__, qid, sc->num_squeues));
+			pci_nvme_status_tc(&compl->status,
+			    NVME_SCT_COMMAND_SPECIFIC,
+			    NVME_SC_INVALID_QUEUE_IDENTIFIER);
+			return (1);
+		}
+
+		nsq = &sc->submit_queues[qid];
+		nsq->size = ((command->cdw10 >> 16) & 0xffff) + 1;
+
+		nsq->qbase = vm_map_gpa(sc->nsc_pi->pi_vmctx, command->prp1,
+		              sizeof(struct nvme_command) * (size_t)nsq->size);
+		nsq->cqid = (command->cdw11 >> 16) & 0xffff;
+		nsq->qpriority = (command->cdw11 >> 1) & 0x03;
+
+		DPRINTF(("%s sq %u size %u gaddr %p cqid %u\r\n", __func__,
+		        qid, nsq->size, nsq->qbase, nsq->cqid));
+
+		pci_nvme_status_genc(&compl->status, NVME_SC_SUCCESS);
+
+		DPRINTF(("%s completed creating IOSQ qid %u\r\n",
+		         __func__, qid));
+	} else {
+		/* 
+		 * Guest sent non-cont submission queue request.
+		 * This setting is unsupported by this emulation.
+		 */
+		WPRINTF(("%s unsupported non-contig (list-based) "
+		         "create i/o submission queue\r\n", __func__));
+
+		pci_nvme_status_genc(&compl->status, NVME_SC_INVALID_FIELD);
+	}
+	return (1);
+}
+
+static int
+nvme_opc_delete_io_cq(struct pci_nvme_softc* sc, struct nvme_command* command,
+	struct nvme_completion* compl)
+{
+	uint16_t qid = command->cdw10 & 0xffff;
+
+	DPRINTF(("%s DELETE_IO_CQ %u\r\n", __func__, qid));
+	if (qid == 0 || qid > sc->num_cqueues) {
+		WPRINTF(("%s queue index %u / num_cqueues %u\r\n",
+		        __func__, qid, sc->num_cqueues));
+		pci_nvme_status_tc(&compl->status, NVME_SCT_COMMAND_SPECIFIC,
+		    NVME_SC_INVALID_QUEUE_IDENTIFIER);
+		return (1);
+	}
+
+	sc->compl_queues[qid].qbase = NULL;
+	pci_nvme_status_genc(&compl->status, NVME_SC_SUCCESS);
+	return (1);
+}
+
+static int
+nvme_opc_create_io_cq(struct pci_nvme_softc* sc, struct nvme_command* command,
+	struct nvme_completion* compl)
+{
+	if (command->cdw11 & NVME_CMD_CDW11_PC) {
+		uint16_t qid = command->cdw10 & 0xffff;
+		struct nvme_completion_queue *ncq;
+
+		if (qid > sc->num_cqueues) {
+			WPRINTF(("%s queue index %u > num_cqueues %u\r\n",
+			        __func__, qid, sc->num_cqueues));
+			pci_nvme_status_tc(&compl->status,
+			    NVME_SCT_COMMAND_SPECIFIC,
+			    NVME_SC_INVALID_QUEUE_IDENTIFIER);
+			return (1);
+		}
+
+		ncq = &sc->compl_queues[qid];
+		ncq->intr_en = (command->cdw11 & NVME_CMD_CDW11_IEN) >> 1;
+		ncq->intr_vec = (command->cdw11 >> 16) & 0xffff;
+		ncq->size = ((command->cdw10 >> 16) & 0xffff) + 1;
+
+		ncq->qbase = vm_map_gpa(sc->nsc_pi->pi_vmctx,
+		             command->prp1,
+		             sizeof(struct nvme_command) * (size_t)ncq->size);
+
+		pci_nvme_status_genc(&compl->status, NVME_SC_SUCCESS);
+	} else {
+		/* 
+		 * Non-contig completion queue unsupported.
+		 */
+		WPRINTF(("%s unsupported non-contig (list-based) "
+		         "create i/o completion queue\r\n",
+		         __func__));
+
+		/* 0x12 = Invalid Use of Controller Memory Buffer */
+		pci_nvme_status_genc(&compl->status, 0x12);
+	}
+
+	return (1);
+}
+
+static int
+nvme_opc_get_log_page(struct pci_nvme_softc* sc, struct nvme_command* command,
+	struct nvme_completion* compl)
+{
+	uint32_t logsize = (1 + ((command->cdw10 >> 16) & 0xFFF)) * 2;
+	uint8_t logpage = command->cdw10 & 0xFF;
+	void *data;
+
+	DPRINTF(("%s log page %u len %u\r\n", __func__, logpage, logsize));
+
+	if (logpage >= 1 && logpage <= 3)
+		data = vm_map_gpa(sc->nsc_pi->pi_vmctx, command->prp1,
+		                  PAGE_SIZE);
+
+	pci_nvme_status_genc(&compl->status, NVME_SC_SUCCESS);
+
+	switch (logpage) {
+	case 0x01: /* Error information */
+		memset(data, 0, logsize > PAGE_SIZE ? PAGE_SIZE : logsize);
+		break;
+	case 0x02: /* SMART/Health information */
+		/* TODO: present some smart info */
+		memset(data, 0, logsize > PAGE_SIZE ? PAGE_SIZE : logsize);
+		break;
+	case 0x03: /* Firmware slot information */
+		memset(data, 0, logsize > PAGE_SIZE ? PAGE_SIZE : logsize);
+		break;
+	default:
+		WPRINTF(("%s get log page %x command not supported\r\n",
+		        __func__, logpage));
+
+		pci_nvme_status_tc(&compl->status, NVME_SCT_COMMAND_SPECIFIC,
+		    NVME_SC_INVALID_LOG_PAGE);
+	}
+
+	return (1);
+}
+
+static int
+nvme_opc_identify(struct pci_nvme_softc* sc, struct nvme_command* command,
+	struct nvme_completion* compl)
+{
+	void *dest;
+
+	DPRINTF(("%s identify 0x%x nsid 0x%x\r\n", __func__,
+	        command->cdw10 & 0xFF, command->nsid));
+
+	switch (command->cdw10 & 0xFF) {
+	case 0x00: /* return Identify Namespace data structure */
+		dest = vm_map_gpa(sc->nsc_pi->pi_vmctx, command->prp1,
+		                  sizeof(sc->nsdata));
+		memcpy(dest, &sc->nsdata, sizeof(sc->nsdata));
+		break;
+	case 0x01: /* return Identify Controller data structure */
+		dest = vm_map_gpa(sc->nsc_pi->pi_vmctx, command->prp1,
+		                  sizeof(sc->ctrldata));
+		memcpy(dest, &sc->ctrldata, sizeof(sc->ctrldata));
+		break;
+	case 0x02: /* list of 1024 active NSIDs > CDW1.NSID */
+		dest = vm_map_gpa(sc->nsc_pi->pi_vmctx, command->prp1,
+		                  sizeof(uint32_t) * 1024);
+		((uint32_t *)dest)[0] = 1;
+		((uint32_t *)dest)[1] = 0;
+		break;
+	case 0x11:
+		pci_nvme_status_genc(&compl->status,
+		    NVME_SC_INVALID_NAMESPACE_OR_FORMAT);
+		return (1);
+	case 0x03: /* list of NSID structures in CDW1.NSID, 4096 bytes */
+	case 0x10:
+	case 0x12:
+	case 0x13:
+	case 0x14:
+	case 0x15:
+	default:
+		DPRINTF(("%s unsupported identify command requested 0x%x\r\n",
+		         __func__, command->cdw10 & 0xFF));
+		pci_nvme_status_genc(&compl->status, NVME_SC_INVALID_FIELD);
+		return (1);
+	}
+
+	pci_nvme_status_genc(&compl->status, NVME_SC_SUCCESS);
+	return (1);
+}
+
+static int
+nvme_opc_set_features(struct pci_nvme_softc* sc, struct nvme_command* command,
+	struct nvme_completion* compl)
+{
+	int feature = command->cdw10 & 0x0F;
+	uint32_t iv;
+
+	DPRINTF(("%s feature 0x%x\r\n", __func__, feature));
+	compl->cdw0 = 0;
+
+	switch (feature) {
+	case NVME_FEAT_ARBITRATION:
+		DPRINTF(("  arbitration 0x%x\r\n", command->cdw11));
+		break;
+	case NVME_FEAT_POWER_MANAGEMENT:
+		DPRINTF(("  power management 0x%x\r\n", command->cdw11));
+		break;
+	case NVME_FEAT_LBA_RANGE_TYPE:
+		DPRINTF(("  lba range 0x%x\r\n", command->cdw11));
+		break;
+	case NVME_FEAT_TEMPERATURE_THRESHOLD:
+		DPRINTF(("  temperature threshold 0x%x\r\n", command->cdw11));
+		break;
+	case NVME_FEAT_ERROR_RECOVERY:
+		DPRINTF(("  error recovery 0x%x\r\n", command->cdw11));
+		break;
+	case NVME_FEAT_VOLATILE_WRITE_CACHE:
+		DPRINTF(("  volatile write cache 0x%x\r\n", command->cdw11));
+		break;
+	case NVME_FEAT_NUMBER_OF_QUEUES:
+		sc->num_squeues = command->cdw11 & 0xFFFF;
+		sc->num_cqueues = (command->cdw11 >> 16) & 0xFFFF;
+		DPRINTF(("  number of queues (submit %u, completion %u)\r\n",
+		        sc->num_squeues, sc->num_cqueues));
+
+		if (sc->num_squeues == 0 || sc->num_squeues > sc->max_queues)
+			sc->num_squeues = sc->max_queues;
+		if (sc->num_cqueues == 0 || sc->num_cqueues > sc->max_queues)
+			sc->num_cqueues = sc->max_queues;
+
+		compl->cdw0 = (sc->num_squeues & 0xFFFF) |
+		              ((sc->num_cqueues & 0xFFFF) << 16);
+
+		break;
+	case NVME_FEAT_INTERRUPT_COALESCING:
+		DPRINTF(("  interrupt coalescing 0x%x\r\n", command->cdw11));
+
+		/* in uS */
+		sc->intr_coales_aggr_time = ((command->cdw11 >> 8) & 0xFF)*100;
+
+		sc->intr_coales_aggr_thresh = command->cdw11 & 0xFF;
+		break;
+	case NVME_FEAT_INTERRUPT_VECTOR_CONFIGURATION:
+		iv = command->cdw11 & 0xFFFF;
+
+		DPRINTF(("  interrupt vector configuration 0x%x\r\n",
+		        command->cdw11));
+
+		for (uint32_t i = 0; i <= sc->num_cqueues; i++) {
+			if (sc->compl_queues[i].intr_vec == iv) {
+				if (command->cdw11 & (1 << 16))
+					sc->compl_queues[i].intr_en |=
+					                      NVME_CQ_INTCOAL;  
+				else
+					sc->compl_queues[i].intr_en &=
+					                     ~NVME_CQ_INTCOAL;  
+			}
+		}
+		break;
+	case NVME_FEAT_WRITE_ATOMICITY:
+		DPRINTF(("  write atomicity 0x%x\r\n", command->cdw11));
+		break;
+	case NVME_FEAT_ASYNC_EVENT_CONFIGURATION:
+		DPRINTF(("  async event configuration 0x%x\r\n",
+		        command->cdw11));
+		sc->async_ev_config = command->cdw11;
+		break;
+	case NVME_FEAT_SOFTWARE_PROGRESS_MARKER:
+		DPRINTF(("  software progress marker 0x%x\r\n",
+		        command->cdw11));
+		break;
+	case 0x0C:
+		DPRINTF(("  autonomous power state transition 0x%x\r\n",
+		        command->cdw11));
+		break;
+	default:
+		WPRINTF(("%s invalid feature\r\n", __func__));
+		pci_nvme_status_genc(&compl->status, NVME_SC_INVALID_FIELD);
+		return (1);
+	}
+
+	pci_nvme_status_genc(&compl->status, NVME_SC_SUCCESS);
+	return (1);
+}
+
+static int
+nvme_opc_get_features(struct pci_nvme_softc* sc, struct nvme_command* command,
+	struct nvme_completion* compl)
+{
+	int feature = command->cdw10 & 0x0F;
+
+	DPRINTF(("%s feature 0x%x\r\n", __func__, feature));
+
+	compl->cdw0 = 0;
+
+	switch (feature) {
+	case NVME_FEAT_ARBITRATION:
+		DPRINTF(("  arbitration\r\n"));
+		break;
+	case NVME_FEAT_POWER_MANAGEMENT:
+		DPRINTF(("  power management\r\n"));
+		break;
+	case NVME_FEAT_LBA_RANGE_TYPE:
+		DPRINTF(("  lba range\r\n"));
+		break;
+	case NVME_FEAT_TEMPERATURE_THRESHOLD:
+		DPRINTF(("  temperature threshold\r\n"));
+		switch ((command->cdw11 >> 20) & 0x3) {
+		case 0:
+			/* Over temp threshold */
+			compl->cdw0 = 0xFFFF;
+			break;
+		case 1:
+			/* Under temp threshold */
+			compl->cdw0 = 0;
+			break;
+		default:
+			WPRINTF(("  invalid threshold type select\r\n"));
+			pci_nvme_status_genc(&compl->status,
+			    NVME_SC_INVALID_FIELD);
+			return (1);
+		}
+		break;
+	case NVME_FEAT_ERROR_RECOVERY:
+		DPRINTF(("  error recovery\r\n"));
+		break;
+	case NVME_FEAT_VOLATILE_WRITE_CACHE:
+		DPRINTF(("  volatile write cache\r\n"));
+		break;
+	case NVME_FEAT_NUMBER_OF_QUEUES:
+		compl->cdw0 = 0;
+		if (sc->num_squeues == 0)
+			compl->cdw0 |= sc->max_queues & 0xFFFF;
+		else
+			compl->cdw0 |= sc->num_squeues & 0xFFFF;
+
+		if (sc->num_cqueues == 0)
+			compl->cdw0 |= (sc->max_queues & 0xFFFF) << 16;
+		else
+			compl->cdw0 |= (sc->num_cqueues & 0xFFFF) << 16;
+
+		DPRINTF(("  number of queues (submit %u, completion %u)\r\n",
+		        compl->cdw0 & 0xFFFF,
+		        (compl->cdw0 >> 16) & 0xFFFF));
+
+		break;
+	case NVME_FEAT_INTERRUPT_COALESCING:
+		DPRINTF(("  interrupt coalescing\r\n"));
+		break;
+	case NVME_FEAT_INTERRUPT_VECTOR_CONFIGURATION:
+		DPRINTF(("  interrupt vector configuration\r\n"));
+		break;
+	case NVME_FEAT_WRITE_ATOMICITY:
+		DPRINTF(("  write atomicity\r\n"));
+		break;
+	case NVME_FEAT_ASYNC_EVENT_CONFIGURATION:
+		DPRINTF(("  async event configuration\r\n"));
+		sc->async_ev_config = command->cdw11;
+		break;
+	case NVME_FEAT_SOFTWARE_PROGRESS_MARKER:
+		DPRINTF(("  software progress marker\r\n"));
+		break;
+	case 0x0C:
+		DPRINTF(("  autonomous power state transition\r\n"));
+		break;
+	default:
+		WPRINTF(("%s invalid feature 0x%x\r\n", __func__, feature));
+		pci_nvme_status_genc(&compl->status, NVME_SC_INVALID_FIELD);
+		return (1);
+	}
+
+	pci_nvme_status_genc(&compl->status, NVME_SC_SUCCESS);
+	return (1);
+}
+
+static int
+nvme_opc_abort(struct pci_nvme_softc* sc, struct nvme_command* command,
+	struct nvme_completion* compl)
+{
+	DPRINTF(("%s submission queue %u, command ID 0x%x\r\n", __func__,
+	        command->cdw10 & 0xFFFF, (command->cdw10 >> 16) & 0xFFFF));
+
+	/* TODO: search for the command ID and abort it */
+
+	compl->cdw0 = 1;
+	pci_nvme_status_genc(&compl->status, NVME_SC_SUCCESS);
+	return (1);
+}
+
+static int
+nvme_opc_async_event_req(struct pci_nvme_softc* sc,
+	struct nvme_command* command, struct nvme_completion* compl)
+{
+	DPRINTF(("%s async event request 0x%x\r\n", __func__, command->cdw11));
+
+	/*
+	 * TODO: raise events when they happen based on the Set Features cmd.
+	 * These events happen async, so only set completion successful if
+	 * there is an event reflective of the request to get event.
+	 */
+	pci_nvme_status_tc(&compl->status, NVME_SCT_COMMAND_SPECIFIC,
+	    NVME_SC_ASYNC_EVENT_REQUEST_LIMIT_EXCEEDED);
+	return (0);
+}
+
+static void
+pci_nvme_handle_admin_cmd(struct pci_nvme_softc* sc, uint64_t value)
+{
+	struct nvme_completion compl;
+	struct nvme_command *cmd;
+	struct nvme_submission_queue *sq;
+	struct nvme_completion_queue *cq;
+	int do_intr = 0;
+	uint16_t sqhead;
+
+	DPRINTF(("%s index %u\r\n", __func__, (uint32_t)value));
+
+	sq = &sc->submit_queues[0];
+
+	sqhead = atomic_load_acq_short(&sq->head);
+
+	if (atomic_testandset_int(&sq->busy, 1)) {
+		DPRINTF(("%s SQ busy, head %u, tail %u\r\n",
+		        __func__, sqhead, sq->tail));
+		return;
+	}
+
+	DPRINTF(("sqhead %u, tail %u\r\n", sqhead, sq->tail));
+	
+	while (sqhead != atomic_load_acq_short(&sq->tail)) {
+		cmd = &(sq->qbase)[sqhead];
+		compl.status = 0;
+
+		switch (NVME_CMD_GET_OPC(cmd->opc_fuse)) {
+		case NVME_OPC_DELETE_IO_SQ:
+			DPRINTF(("%s command DELETE_IO_SQ\r\n", __func__));
+			do_intr |= nvme_opc_delete_io_sq(sc, cmd, &compl);
+			break;
+		case NVME_OPC_CREATE_IO_SQ:
+			DPRINTF(("%s command CREATE_IO_SQ\r\n", __func__));
+			do_intr |= nvme_opc_create_io_sq(sc, cmd, &compl);
+			break;
+		case NVME_OPC_DELETE_IO_CQ:
+			DPRINTF(("%s command DELETE_IO_CQ\r\n", __func__));
+			do_intr |= nvme_opc_delete_io_cq(sc, cmd, &compl);
+			break;
+		case NVME_OPC_CREATE_IO_CQ:
+			DPRINTF(("%s command CREATE_IO_CQ\r\n", __func__));
+			do_intr |= nvme_opc_create_io_cq(sc, cmd, &compl);
+			break;
+		case NVME_OPC_GET_LOG_PAGE:
+			DPRINTF(("%s command GET_LOG_PAGE\r\n", __func__));
+			do_intr |= nvme_opc_get_log_page(sc, cmd, &compl);
+			break;
+		case NVME_OPC_IDENTIFY:
+			DPRINTF(("%s command IDENTIFY\r\n", __func__));
+			do_intr |= nvme_opc_identify(sc, cmd, &compl);
+			break;
+		case NVME_OPC_ABORT:
+			DPRINTF(("%s command ABORT\r\n", __func__));
+			do_intr |= nvme_opc_abort(sc, cmd, &compl);
+			break;

*** DIFF OUTPUT TRUNCATED AT 1000 LINES ***