Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 10 Apr 2022 14:27:07 -0600
From:      Warner Losh <imp@bsdimp.com>
To:        Ravi Pokala <rpokala@freebsd.org>
Cc:        Warner Losh <imp@freebsd.org>, src-committers <src-committers@freebsd.org>,  "<dev-commits-src-all@freebsd.org>" <dev-commits-src-all@freebsd.org>, dev-commits-src-main@freebsd.org
Subject:   Re: 214df80a9cb3 - main - nvme: new define for size of host memory buffer sizes
Message-ID:  <CANCZdfrJQXM6U0M_Kxhgt8hdr7-1aQj_o4444SmEhy8QmhkFWA@mail.gmail.com>
In-Reply-To: <CANCZdfq9t=3MUA03H3ANE_VU=BpWJqZgC8hgZZh0Q8FtfPm0Lw@mail.gmail.com>
References:  <202204090506.239567Ag038413@gitrepo.freebsd.org> <3BC728DB-39BB-43BC-BECE-720FECB5B20D@panasas.com> <CANCZdfq9t=3MUA03H3ANE_VU=BpWJqZgC8hgZZh0Q8FtfPm0Lw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000785a2f05dc52a7de
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Sat, Apr 9, 2022 at 5:15 PM Warner Losh <imp@bsdimp.com> wrote:

>
>
> On Sat, Apr 9, 2022 at 4:50 PM Ravi Pokala <rpokala@freebsd.org> wrote:
>
>> -----Original Message-----
>> From: <owner-src-committers@freebsd.org> on behalf of Warner Losh
>> <imp@FreeBSD.org>
>> Date: 2022-04-08, Friday at 22:06
>> To: <src-committers@FreeBSD.org>, <dev-commits-src-all@FreeBSD.org>,
>> <dev-commits-src-main@FreeBSD.org>
>> Subject: git: 214df80a9cb3 - main - nvme: new define for size of host
>> memory buffer sizes
>>
>>     The branch main has been updated by imp:
>>
>>     URL:
>> https://cgit.FreeBSD.org/src/commit/?id=3D214df80a9cb3e95a140b13af7d19de=
ec2bbfae76
>>
>>     commit 214df80a9cb3e95a140b13af7d19deec2bbfae76
>>     Author:     Warner Losh <imp@FreeBSD.org>
>>     AuthorDate: 2022-04-09 05:01:06 +0000
>>     Commit:     Warner Losh <imp@FreeBSD.org>
>>     CommitDate: 2022-04-09 05:05:25 +0000
>>
>>         nvme: new define for size of host memory buffer sizes
>>
>>         The nvme spec defines the various fields that specify sizes for
>> host
>>         memory buffers in terms of 4096 chunks. So, rather than use a
>> bare 4096
>>         here, use NVME_HMB_UNITS. This is explicitly not the host page
>> size of
>>         4096, nor the default memory page size (mps) of the NVMe drive,
>> but its
>>         own thing and needs its own define.
>>
>> Hi Warner,
>>
>> Are you sure about that?
>>
>
> NVMe-1.4, Figure 297: Host Memory Buffer =E2=80=93 Host Memory Buffer Des=
criptor
>> Entry
>>
>> | Buffer Size (BSIZE): Indicates the number of contiguous
>> | memory page size (CC.MPS) units for this descriptor.
>> |
>> | Buffer Address (BADD): Indicates the host memory address for
>> | this descriptor aligned to the memory page size (CC.MPS).
>> | The lower bits (n:0) of this field indicate the offset
>> | within the memory page is 0h (e.g., if the memory page size
>> | is 4 KiB, then bits 11:00 shall be 0h; if the memory page
>> | size is 8 KiB, then bits 12:00 shall be 0h).
>>
>> They both reference mps, not 4096 bytes.
>>
>
> So, some I'm 100% sure about. There's one that was previously incorrectly
> hard wired to
> 4k. More below.
>
> From Table 275 of Rev 2.0:
>
> Host Memory Buffer Preferred Size (HMPRE): This field indicates
> the preferred size that the host is requested to allocate for the
> Host Memory Buffer feature in 4 KiB units. This value shall be
> greater than or equal to the Host Memory Buffer Minimum Size.
> If this field is non-zero, then the Host Memory Buffer feature is
> supported. If this field is cleared to 0h, then the Host Memory
> Buffer feature is not supported.
>
> Host Memory Buffer Minimum Size (HMMIN): This field indicates
> the minimum size that the host is requested to allocate for the
> Host Memory Buffer feature in 4 KiB units. If this field is cleared
> to 0h, then the host is requested to allocate any amount of host
> memory possible up to the HMPRE value.
>
> Host Memory Buffer Minimum Descriptor Entry Size (HMMINDS):
> This field indicates the minimum usable size of a Host Memory
> Buffer Descriptor Entry in 4 KiB units. If this field is cleared to 0h,
> then the controller does not indicate any limitations on the Host
> Memory Buffer Descriptor Entry size.
>
> These are the hmmin, hmminds and hmpre fields of 'cdata' in the
> driver.
>
>     diff --git a/sys/dev/nvme/nvme_ctrlr.c b/sys/dev/nvme/nvme_ctrlr.c
>>     index 95a2b5c4285d..6996b3151b0d 100644
>>     --- a/sys/dev/nvme/nvme_ctrlr.c
>>     +++ b/sys/dev/nvme/nvme_ctrlr.c
>>     @@ -936,11 +936,11 @@ nvme_ctrlr_hmb_alloc(struct nvme_controller
>> *ctrlr)
>>         max =3D (uint64_t)physmem * PAGE_SIZE / 20;
>>         TUNABLE_UINT64_FETCH("hw.nvme.hmb_max", &max);
>>
>
> max is a percent of available memory...
>
>
>>     -   min =3D (long long unsigned)ctrlr->cdata.hmmin * 4096;
>>     +   min =3D (long long unsigned)ctrlr->cdata.hmmin * NVME_HMB_UNITS;
>>         if (max =3D=3D 0 || max < min)
>>                 return;
>>     -   pref =3D MIN((long long unsigned)ctrlr->cdata.hmpre * 4096, max)=
;
>>     -   minc =3D MAX(ctrlr->cdata.hmminds * 4096, PAGE_SIZE);
>>     +   pref =3D MIN((long long unsigned)ctrlr->cdata.hmpre *
>> NVME_HMB_UNITS, max);
>>     +   minc =3D MAX(ctrlr->cdata.hmminds * NVME_HMB_UNITS, PAGE_SIZE);
>>
>
> So for all of these, we're good. They are in 4kiB chunks.
>
>
>>         if (min > 0 && ctrlr->cdata.hmmaxd > 0)
>>                 minc =3D MAX(minc, min / ctrlr->cdata.hmmaxd);
>>         ctrlr->hmb_chunk =3D pref;
>>     @@ -1023,7 +1023,7 @@ again:
>>         for (i =3D 0; i < ctrlr->hmb_nchunks; i++) {
>>                 ctrlr->hmb_desc_vaddr[i].addr =3D
>>                     htole64(ctrlr->hmb_chunks[i].hmbc_paddr);
>>     -           ctrlr->hmb_desc_vaddr[i].size =3D htole32(ctrlr->hmb_chu=
nk
>> / 4096);
>>     +           ctrlr->hmb_desc_vaddr[i].size =3D htole32(ctrlr->hmb_chu=
nk
>> / NVME_HMB_UNITS);
>>
>
> This one, you are correct is wrong. I'll fix it when I bring in the
> changes to fully support
> MPS !=3D 0. For MPS =3D=3D 0, which are the only drives this driver suppo=
rts
> correctly, this
> is a nop. It was wrong before, though. The text you quoted is correct
> about this. There
> are a couple of PAGE_SIZEs tinbetween these two chunks hat should be
> ctrlr->min_page_size instead (since that's the page size we're really
> using, not the
> host's). But PAGE_SIZE !=3D 4k doesn't currently work with the nvme drive=
r
> due to
> confusion like this.
>

https://reviews.freebsd.org/D34871 should address that.

https://reviews.freebsd.org/D34865 through
https://reviews.freebsd.org/D34873 are my current
series. With them applied, I'm able to boot with 16k pages on a ZFS-based
nvme arm64 system.
I believe it would also allow us to use different drive page sizes too, but
I haven't tried to figure
that out :)...

Warner


> We also currently set the MPS field in the CC incorrectly when it isn't 0=
.
>
> Good catch. I'll update my pending changes.
>
> Warner
>
>
>>         }
>>         bus_dmamap_sync(ctrlr->hmb_desc_tag, ctrlr->hmb_desc_map,
>>             BUS_DMASYNC_PREWRITE);
>>
>>
>>

--000000000000785a2f05dc52a7de
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Sat, Apr 9, 2022 at 5:15 PM Warner=
 Losh &lt;<a href=3D"mailto:imp@bsdimp.com">imp@bsdimp.com</a>&gt; wrote:<b=
r></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex=
;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr">=
<div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" =
class=3D"gmail_attr">On Sat, Apr 9, 2022 at 4:50 PM Ravi Pokala &lt;<a href=
=3D"mailto:rpokala@freebsd.org" target=3D"_blank">rpokala@freebsd.org</a>&g=
t; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0p=
x 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">-----O=
riginal Message-----<br>
From: &lt;<a href=3D"mailto:owner-src-committers@freebsd.org" target=3D"_bl=
ank">owner-src-committers@freebsd.org</a>&gt; on behalf of Warner Losh &lt;=
imp@FreeBSD.org&gt;<br>
Date: 2022-04-08, Friday at 22:06<br>
To: &lt;src-committers@FreeBSD.org&gt;, &lt;dev-commits-src-all@FreeBSD.org=
&gt;, &lt;dev-commits-src-main@FreeBSD.org&gt;<br>
Subject: git: 214df80a9cb3 - main - nvme: new define for size of host memor=
y buffer sizes<br>
<br>
=C2=A0 =C2=A0 The branch main has been updated by imp:<br>
<br>
=C2=A0 =C2=A0 URL: <a href=3D"https://cgit.FreeBSD.org/src/commit/?id=3D214=
df80a9cb3e95a140b13af7d19deec2bbfae76" rel=3D"noreferrer" target=3D"_blank"=
>https://cgit.FreeBSD.org/src/commit/?id=3D214df80a9cb3e95a140b13af7d19deec=
2bbfae76</a><br>
<br>
=C2=A0 =C2=A0 commit 214df80a9cb3e95a140b13af7d19deec2bbfae76<br>
=C2=A0 =C2=A0 Author:=C2=A0 =C2=A0 =C2=A0Warner Losh &lt;imp@FreeBSD.org&gt=
;<br>
=C2=A0 =C2=A0 AuthorDate: 2022-04-09 05:01:06 +0000<br>
=C2=A0 =C2=A0 Commit:=C2=A0 =C2=A0 =C2=A0Warner Losh &lt;imp@FreeBSD.org&gt=
;<br>
=C2=A0 =C2=A0 CommitDate: 2022-04-09 05:05:25 +0000<br>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 nvme: new define for size of host memory buffer=
 sizes<br>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 The nvme spec defines the various fields that s=
pecify sizes for host<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 memory buffers in terms of 4096 chunks. So, rat=
her than use a bare 4096<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 here, use NVME_HMB_UNITS. This is explicitly no=
t the host page size of<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 4096, nor the default memory page size (mps) of=
 the NVMe drive, but its<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 own thing and needs its own define.<br>
<br>
Hi Warner,<br>
<br>
Are you sure about that?<br></blockquote><div><br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg=
b(204,204,204);padding-left:1ex">NVMe-1.4, Figure 297: Host Memory Buffer =
=E2=80=93 Host Memory Buffer Descriptor Entry<br>
<br>
| Buffer Size (BSIZE): Indicates the number of contiguous<br>
| memory page size (CC.MPS) units for this descriptor.<br>
| <br>
| Buffer Address (BADD): Indicates the host memory address for<br>
| this descriptor aligned to the memory page size (CC.MPS). <br>
| The lower bits (n:0) of this field indicate the offset<br>
| within the memory page is 0h (e.g., if the memory page size<br>
| is 4 KiB, then bits 11:00 shall be 0h; if the memory page<br>
| size is 8 KiB, then bits 12:00 shall be 0h).<br>
<br>
They both reference mps, not 4096 bytes.<br></blockquote><div><br></div><di=
v>So, some I&#39;m 100% sure about. There&#39;s one that was previously inc=
orrectly hard wired to</div><div>4k. More below.</div><div><br></div><div>F=
rom Table 275 of Rev 2.0:</div><div><br></div><div>Host Memory Buffer Prefe=
rred Size (HMPRE): This field indicates</div><div>the preferred size that t=
he host is requested to allocate for the</div><div>Host Memory Buffer featu=
re in 4 KiB units. This value shall be</div><div>greater than or equal to t=
he Host Memory Buffer Minimum Size.</div><div>If this field is non-zero, th=
en the Host Memory Buffer feature is</div><div>supported. If this field is =
cleared to 0h, then the Host Memory</div><div>Buffer feature is not support=
ed.<br></div><div><br></div><div>Host Memory Buffer Minimum Size (HMMIN): T=
his field indicates</div><div>the minimum size that the host is requested t=
o allocate for the</div><div>Host Memory Buffer feature in 4 KiB units. If =
this field is cleared</div><div>to 0h, then the host is requested to alloca=
te any amount of host</div><div>memory possible up to the HMPRE value.<br><=
/div><div><br></div><div>Host Memory Buffer Minimum Descriptor Entry Size (=
HMMINDS):</div><div>This field indicates the minimum usable size of a Host =
Memory</div><div>Buffer Descriptor Entry in 4 KiB units. If this field is c=
leared to 0h,</div><div>then the controller does not indicate any limitatio=
ns on the Host</div><div>Memory Buffer Descriptor Entry size.<br></div><div=
><br></div><div>These are the hmmin, hmminds and hmpre fields of &#39;cdata=
&#39; in the</div><div>driver.</div><div><br></div><blockquote class=3D"gma=
il_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,2=
04,204);padding-left:1ex">=C2=A0 =C2=A0 diff --git a/sys/dev/nvme/nvme_ctrl=
r.c b/sys/dev/nvme/nvme_ctrlr.c<br>
=C2=A0 =C2=A0 index 95a2b5c4285d..6996b3151b0d 100644<br>
=C2=A0 =C2=A0 --- a/sys/dev/nvme/nvme_ctrlr.c<br>
=C2=A0 =C2=A0 +++ b/sys/dev/nvme/nvme_ctrlr.c<br>
=C2=A0 =C2=A0 @@ -936,11 +936,11 @@ nvme_ctrlr_hmb_alloc(struct nvme_contro=
ller *ctrlr)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 max =3D (uint64_t)physmem * PAGE_SIZE / 20;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 TUNABLE_UINT64_FETCH(&quot;hw.nvme.hmb_max&quot=
;, &amp;max);<br></blockquote><div><br></div><div>max is a percent of avail=
able memory...</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" styl=
e=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);paddin=
g-left:1ex">
=C2=A0 =C2=A0 -=C2=A0 =C2=A0min =3D (long long unsigned)ctrlr-&gt;cdata.hmm=
in * 4096;<br>
=C2=A0 =C2=A0 +=C2=A0 =C2=A0min =3D (long long unsigned)ctrlr-&gt;cdata.hmm=
in * NVME_HMB_UNITS;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (max =3D=3D 0 || max &lt; min)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return;<br>
=C2=A0 =C2=A0 -=C2=A0 =C2=A0pref =3D MIN((long long unsigned)ctrlr-&gt;cdat=
a.hmpre * 4096, max);<br>
=C2=A0 =C2=A0 -=C2=A0 =C2=A0minc =3D MAX(ctrlr-&gt;cdata.hmminds * 4096, PA=
GE_SIZE);<br>
=C2=A0 =C2=A0 +=C2=A0 =C2=A0pref =3D MIN((long long unsigned)ctrlr-&gt;cdat=
a.hmpre * NVME_HMB_UNITS, max);<br>
=C2=A0 =C2=A0 +=C2=A0 =C2=A0minc =3D MAX(ctrlr-&gt;cdata.hmminds * NVME_HMB=
_UNITS, PAGE_SIZE);<br></blockquote><div><br></div><div>So for all of these=
, we&#39;re good. They are in 4kiB chunks.</div><div>=C2=A0</div><blockquot=
e class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px s=
olid rgb(204,204,204);padding-left:1ex">
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (min &gt; 0 &amp;&amp; ctrlr-&gt;cdata.hmmax=
d &gt; 0)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 minc =3D MAX(minc, =
min / ctrlr-&gt;cdata.hmmaxd);<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ctrlr-&gt;hmb_chunk =3D pref;<br>
=C2=A0 =C2=A0 @@ -1023,7 +1023,7 @@ again:<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 for (i =3D 0; i &lt; ctrlr-&gt;hmb_nchunks; i++=
) {<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ctrlr-&gt;hmb_desc_=
vaddr[i].addr =3D<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 htole=
64(ctrlr-&gt;hmb_chunks[i].hmbc_paddr);<br>
=C2=A0 =C2=A0 -=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ctrlr-&gt;hmb_desc_=
vaddr[i].size =3D htole32(ctrlr-&gt;hmb_chunk / 4096);<br>
=C2=A0 =C2=A0 +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ctrlr-&gt;hmb_desc_=
vaddr[i].size =3D htole32(ctrlr-&gt;hmb_chunk / NVME_HMB_UNITS);<br></block=
quote><div><br></div><div>This one, you are correct is wrong. I&#39;ll fix =
it when I bring in the changes to fully support</div><div>MPS !=3D 0. For M=
PS =3D=3D 0, which are the only drives this driver supports correctly, this=
</div><div>is a nop. It was wrong before, though. The text you quoted=C2=A0=
is correct about this. There</div><div>are a couple of PAGE_SIZEs tinbetwee=
n=C2=A0these two chunks hat should be</div><div>ctrlr-&gt;min_page_size ins=
tead (since that&#39;s the page size we&#39;re really using, not the</div><=
div>host&#39;s). But PAGE_SIZE !=3D 4k doesn&#39;t currently work with the =
nvme driver due to</div><div>confusion like this.</div></div></div></blockq=
uote><div><br></div><div><a href=3D"https://reviews.freebsd.org/D34871">htt=
ps://reviews.freebsd.org/D34871</a> should address that.<br></div><div><br>=
</div><div><a href=3D"https://reviews.freebsd.org/D34865">https://reviews.f=
reebsd.org/D34865</a> through=C2=A0<a href=3D"https://reviews.freebsd.org/D=
34873">https://reviews.freebsd.org/D34873</a>; are my current<br></div><div>=
series. With them applied, I&#39;m able to boot with 16k pages on a ZFS-bas=
ed nvme arm64 system.</div><div>I believe it would also allow us to use dif=
ferent drive page sizes too, but I haven&#39;t tried to figure</div><div>th=
at out :)...</div><div><br></div><div>Warner</div><div>=C2=A0</div><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px=
 solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div class=3D"gm=
ail_quote"><div>We also currently set the MPS field in the CC incorrectly w=
hen it isn&#39;t 0.</div><div><br></div><div>Good catch. I&#39;ll update my=
 pending changes.</div><div><br></div><div>Warner</div><div>=C2=A0</div><bl=
ockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-lef=
t:1px solid rgb(204,204,204);padding-left:1ex">
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 bus_dmamap_sync(ctrlr-&gt;hmb_desc_tag, ctrlr-&=
gt;hmb_desc_map,<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 BUS_DMASYNC_PREWRITE);<br>
<br>
<br>
</blockquote></div></div>
</blockquote></div></div>

--000000000000785a2f05dc52a7de--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfrJQXM6U0M_Kxhgt8hdr7-1aQj_o4444SmEhy8QmhkFWA>