Date: Sun, 10 Apr 2022 14:27:07 -0600 From: Warner Losh <imp@bsdimp.com> To: Ravi Pokala <rpokala@freebsd.org> Cc: Warner Losh <imp@freebsd.org>, src-committers <src-committers@freebsd.org>, "<dev-commits-src-all@freebsd.org>" <dev-commits-src-all@freebsd.org>, dev-commits-src-main@freebsd.org Subject: Re: 214df80a9cb3 - main - nvme: new define for size of host memory buffer sizes Message-ID: <CANCZdfrJQXM6U0M_Kxhgt8hdr7-1aQj_o4444SmEhy8QmhkFWA@mail.gmail.com> In-Reply-To: <CANCZdfq9t=3MUA03H3ANE_VU=BpWJqZgC8hgZZh0Q8FtfPm0Lw@mail.gmail.com> References: <202204090506.239567Ag038413@gitrepo.freebsd.org> <3BC728DB-39BB-43BC-BECE-720FECB5B20D@panasas.com> <CANCZdfq9t=3MUA03H3ANE_VU=BpWJqZgC8hgZZh0Q8FtfPm0Lw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000785a2f05dc52a7de Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sat, Apr 9, 2022 at 5:15 PM Warner Losh <imp@bsdimp.com> wrote: > > > On Sat, Apr 9, 2022 at 4:50 PM Ravi Pokala <rpokala@freebsd.org> wrote: > >> -----Original Message----- >> From: <owner-src-committers@freebsd.org> on behalf of Warner Losh >> <imp@FreeBSD.org> >> Date: 2022-04-08, Friday at 22:06 >> To: <src-committers@FreeBSD.org>, <dev-commits-src-all@FreeBSD.org>, >> <dev-commits-src-main@FreeBSD.org> >> Subject: git: 214df80a9cb3 - main - nvme: new define for size of host >> memory buffer sizes >> >> The branch main has been updated by imp: >> >> URL: >> https://cgit.FreeBSD.org/src/commit/?id=3D214df80a9cb3e95a140b13af7d19de= ec2bbfae76 >> >> commit 214df80a9cb3e95a140b13af7d19deec2bbfae76 >> Author: Warner Losh <imp@FreeBSD.org> >> AuthorDate: 2022-04-09 05:01:06 +0000 >> Commit: Warner Losh <imp@FreeBSD.org> >> CommitDate: 2022-04-09 05:05:25 +0000 >> >> nvme: new define for size of host memory buffer sizes >> >> The nvme spec defines the various fields that specify sizes for >> host >> memory buffers in terms of 4096 chunks. So, rather than use a >> bare 4096 >> here, use NVME_HMB_UNITS. This is explicitly not the host page >> size of >> 4096, nor the default memory page size (mps) of the NVMe drive, >> but its >> own thing and needs its own define. >> >> Hi Warner, >> >> Are you sure about that? >> > > NVMe-1.4, Figure 297: Host Memory Buffer =E2=80=93 Host Memory Buffer Des= criptor >> Entry >> >> | Buffer Size (BSIZE): Indicates the number of contiguous >> | memory page size (CC.MPS) units for this descriptor. >> | >> | Buffer Address (BADD): Indicates the host memory address for >> | this descriptor aligned to the memory page size (CC.MPS). >> | The lower bits (n:0) of this field indicate the offset >> | within the memory page is 0h (e.g., if the memory page size >> | is 4 KiB, then bits 11:00 shall be 0h; if the memory page >> | size is 8 KiB, then bits 12:00 shall be 0h). >> >> They both reference mps, not 4096 bytes. >> > > So, some I'm 100% sure about. There's one that was previously incorrectly > hard wired to > 4k. More below. > > From Table 275 of Rev 2.0: > > Host Memory Buffer Preferred Size (HMPRE): This field indicates > the preferred size that the host is requested to allocate for the > Host Memory Buffer feature in 4 KiB units. This value shall be > greater than or equal to the Host Memory Buffer Minimum Size. > If this field is non-zero, then the Host Memory Buffer feature is > supported. If this field is cleared to 0h, then the Host Memory > Buffer feature is not supported. > > Host Memory Buffer Minimum Size (HMMIN): This field indicates > the minimum size that the host is requested to allocate for the > Host Memory Buffer feature in 4 KiB units. If this field is cleared > to 0h, then the host is requested to allocate any amount of host > memory possible up to the HMPRE value. > > Host Memory Buffer Minimum Descriptor Entry Size (HMMINDS): > This field indicates the minimum usable size of a Host Memory > Buffer Descriptor Entry in 4 KiB units. If this field is cleared to 0h, > then the controller does not indicate any limitations on the Host > Memory Buffer Descriptor Entry size. > > These are the hmmin, hmminds and hmpre fields of 'cdata' in the > driver. > > diff --git a/sys/dev/nvme/nvme_ctrlr.c b/sys/dev/nvme/nvme_ctrlr.c >> index 95a2b5c4285d..6996b3151b0d 100644 >> --- a/sys/dev/nvme/nvme_ctrlr.c >> +++ b/sys/dev/nvme/nvme_ctrlr.c >> @@ -936,11 +936,11 @@ nvme_ctrlr_hmb_alloc(struct nvme_controller >> *ctrlr) >> max =3D (uint64_t)physmem * PAGE_SIZE / 20; >> TUNABLE_UINT64_FETCH("hw.nvme.hmb_max", &max); >> > > max is a percent of available memory... > > >> - min =3D (long long unsigned)ctrlr->cdata.hmmin * 4096; >> + min =3D (long long unsigned)ctrlr->cdata.hmmin * NVME_HMB_UNITS; >> if (max =3D=3D 0 || max < min) >> return; >> - pref =3D MIN((long long unsigned)ctrlr->cdata.hmpre * 4096, max)= ; >> - minc =3D MAX(ctrlr->cdata.hmminds * 4096, PAGE_SIZE); >> + pref =3D MIN((long long unsigned)ctrlr->cdata.hmpre * >> NVME_HMB_UNITS, max); >> + minc =3D MAX(ctrlr->cdata.hmminds * NVME_HMB_UNITS, PAGE_SIZE); >> > > So for all of these, we're good. They are in 4kiB chunks. > > >> if (min > 0 && ctrlr->cdata.hmmaxd > 0) >> minc =3D MAX(minc, min / ctrlr->cdata.hmmaxd); >> ctrlr->hmb_chunk =3D pref; >> @@ -1023,7 +1023,7 @@ again: >> for (i =3D 0; i < ctrlr->hmb_nchunks; i++) { >> ctrlr->hmb_desc_vaddr[i].addr =3D >> htole64(ctrlr->hmb_chunks[i].hmbc_paddr); >> - ctrlr->hmb_desc_vaddr[i].size =3D htole32(ctrlr->hmb_chu= nk >> / 4096); >> + ctrlr->hmb_desc_vaddr[i].size =3D htole32(ctrlr->hmb_chu= nk >> / NVME_HMB_UNITS); >> > > This one, you are correct is wrong. I'll fix it when I bring in the > changes to fully support > MPS !=3D 0. For MPS =3D=3D 0, which are the only drives this driver suppo= rts > correctly, this > is a nop. It was wrong before, though. The text you quoted is correct > about this. There > are a couple of PAGE_SIZEs tinbetween these two chunks hat should be > ctrlr->min_page_size instead (since that's the page size we're really > using, not the > host's). But PAGE_SIZE !=3D 4k doesn't currently work with the nvme drive= r > due to > confusion like this. > https://reviews.freebsd.org/D34871 should address that. https://reviews.freebsd.org/D34865 through https://reviews.freebsd.org/D34873 are my current series. With them applied, I'm able to boot with 16k pages on a ZFS-based nvme arm64 system. I believe it would also allow us to use different drive page sizes too, but I haven't tried to figure that out :)... Warner > We also currently set the MPS field in the CC incorrectly when it isn't 0= . > > Good catch. I'll update my pending changes. > > Warner > > >> } >> bus_dmamap_sync(ctrlr->hmb_desc_tag, ctrlr->hmb_desc_map, >> BUS_DMASYNC_PREWRITE); >> >> >> --000000000000785a2f05dc52a7de Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">= <div dir=3D"ltr" class=3D"gmail_attr">On Sat, Apr 9, 2022 at 5:15 PM Warner= Losh <<a href=3D"mailto:imp@bsdimp.com">imp@bsdimp.com</a>> wrote:<b= r></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex= ;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr">= <div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" = class=3D"gmail_attr">On Sat, Apr 9, 2022 at 4:50 PM Ravi Pokala <<a href= =3D"mailto:rpokala@freebsd.org" target=3D"_blank">rpokala@freebsd.org</a>&g= t; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0p= x 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">-----O= riginal Message-----<br> From: <<a href=3D"mailto:owner-src-committers@freebsd.org" target=3D"_bl= ank">owner-src-committers@freebsd.org</a>> on behalf of Warner Losh <= imp@FreeBSD.org><br> Date: 2022-04-08, Friday at 22:06<br> To: <src-committers@FreeBSD.org>, <dev-commits-src-all@FreeBSD.org= >, <dev-commits-src-main@FreeBSD.org><br> Subject: git: 214df80a9cb3 - main - nvme: new define for size of host memor= y buffer sizes<br> <br> =C2=A0 =C2=A0 The branch main has been updated by imp:<br> <br> =C2=A0 =C2=A0 URL: <a href=3D"https://cgit.FreeBSD.org/src/commit/?id=3D214= df80a9cb3e95a140b13af7d19deec2bbfae76" rel=3D"noreferrer" target=3D"_blank"= >https://cgit.FreeBSD.org/src/commit/?id=3D214df80a9cb3e95a140b13af7d19deec= 2bbfae76</a><br> <br> =C2=A0 =C2=A0 commit 214df80a9cb3e95a140b13af7d19deec2bbfae76<br> =C2=A0 =C2=A0 Author:=C2=A0 =C2=A0 =C2=A0Warner Losh <imp@FreeBSD.org>= ;<br> =C2=A0 =C2=A0 AuthorDate: 2022-04-09 05:01:06 +0000<br> =C2=A0 =C2=A0 Commit:=C2=A0 =C2=A0 =C2=A0Warner Losh <imp@FreeBSD.org>= ;<br> =C2=A0 =C2=A0 CommitDate: 2022-04-09 05:05:25 +0000<br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 nvme: new define for size of host memory buffer= sizes<br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 The nvme spec defines the various fields that s= pecify sizes for host<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 memory buffers in terms of 4096 chunks. So, rat= her than use a bare 4096<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 here, use NVME_HMB_UNITS. This is explicitly no= t the host page size of<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 4096, nor the default memory page size (mps) of= the NVMe drive, but its<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 own thing and needs its own define.<br> <br> Hi Warner,<br> <br> Are you sure about that?<br></blockquote><div><br></div><blockquote class= =3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg= b(204,204,204);padding-left:1ex">NVMe-1.4, Figure 297: Host Memory Buffer = =E2=80=93 Host Memory Buffer Descriptor Entry<br> <br> | Buffer Size (BSIZE): Indicates the number of contiguous<br> | memory page size (CC.MPS) units for this descriptor.<br> | <br> | Buffer Address (BADD): Indicates the host memory address for<br> | this descriptor aligned to the memory page size (CC.MPS). <br> | The lower bits (n:0) of this field indicate the offset<br> | within the memory page is 0h (e.g., if the memory page size<br> | is 4 KiB, then bits 11:00 shall be 0h; if the memory page<br> | size is 8 KiB, then bits 12:00 shall be 0h).<br> <br> They both reference mps, not 4096 bytes.<br></blockquote><div><br></div><di= v>So, some I'm 100% sure about. There's one that was previously inc= orrectly hard wired to</div><div>4k. More below.</div><div><br></div><div>F= rom Table 275 of Rev 2.0:</div><div><br></div><div>Host Memory Buffer Prefe= rred Size (HMPRE): This field indicates</div><div>the preferred size that t= he host is requested to allocate for the</div><div>Host Memory Buffer featu= re in 4 KiB units. This value shall be</div><div>greater than or equal to t= he Host Memory Buffer Minimum Size.</div><div>If this field is non-zero, th= en the Host Memory Buffer feature is</div><div>supported. If this field is = cleared to 0h, then the Host Memory</div><div>Buffer feature is not support= ed.<br></div><div><br></div><div>Host Memory Buffer Minimum Size (HMMIN): T= his field indicates</div><div>the minimum size that the host is requested t= o allocate for the</div><div>Host Memory Buffer feature in 4 KiB units. If = this field is cleared</div><div>to 0h, then the host is requested to alloca= te any amount of host</div><div>memory possible up to the HMPRE value.<br><= /div><div><br></div><div>Host Memory Buffer Minimum Descriptor Entry Size (= HMMINDS):</div><div>This field indicates the minimum usable size of a Host = Memory</div><div>Buffer Descriptor Entry in 4 KiB units. If this field is c= leared to 0h,</div><div>then the controller does not indicate any limitatio= ns on the Host</div><div>Memory Buffer Descriptor Entry size.<br></div><div= ><br></div><div>These are the hmmin, hmminds and hmpre fields of 'cdata= ' in the</div><div>driver.</div><div><br></div><blockquote class=3D"gma= il_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,2= 04,204);padding-left:1ex">=C2=A0 =C2=A0 diff --git a/sys/dev/nvme/nvme_ctrl= r.c b/sys/dev/nvme/nvme_ctrlr.c<br> =C2=A0 =C2=A0 index 95a2b5c4285d..6996b3151b0d 100644<br> =C2=A0 =C2=A0 --- a/sys/dev/nvme/nvme_ctrlr.c<br> =C2=A0 =C2=A0 +++ b/sys/dev/nvme/nvme_ctrlr.c<br> =C2=A0 =C2=A0 @@ -936,11 +936,11 @@ nvme_ctrlr_hmb_alloc(struct nvme_contro= ller *ctrlr)<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 max =3D (uint64_t)physmem * PAGE_SIZE / 20;<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 TUNABLE_UINT64_FETCH("hw.nvme.hmb_max"= ;, &max);<br></blockquote><div><br></div><div>max is a percent of avail= able memory...</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" styl= e=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);paddin= g-left:1ex"> =C2=A0 =C2=A0 -=C2=A0 =C2=A0min =3D (long long unsigned)ctrlr->cdata.hmm= in * 4096;<br> =C2=A0 =C2=A0 +=C2=A0 =C2=A0min =3D (long long unsigned)ctrlr->cdata.hmm= in * NVME_HMB_UNITS;<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (max =3D=3D 0 || max < min)<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return;<br> =C2=A0 =C2=A0 -=C2=A0 =C2=A0pref =3D MIN((long long unsigned)ctrlr->cdat= a.hmpre * 4096, max);<br> =C2=A0 =C2=A0 -=C2=A0 =C2=A0minc =3D MAX(ctrlr->cdata.hmminds * 4096, PA= GE_SIZE);<br> =C2=A0 =C2=A0 +=C2=A0 =C2=A0pref =3D MIN((long long unsigned)ctrlr->cdat= a.hmpre * NVME_HMB_UNITS, max);<br> =C2=A0 =C2=A0 +=C2=A0 =C2=A0minc =3D MAX(ctrlr->cdata.hmminds * NVME_HMB= _UNITS, PAGE_SIZE);<br></blockquote><div><br></div><div>So for all of these= , we're good. They are in 4kiB chunks.</div><div>=C2=A0</div><blockquot= e class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px s= olid rgb(204,204,204);padding-left:1ex"> =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (min > 0 && ctrlr->cdata.hmmax= d > 0)<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 minc =3D MAX(minc, = min / ctrlr->cdata.hmmaxd);<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 ctrlr->hmb_chunk =3D pref;<br> =C2=A0 =C2=A0 @@ -1023,7 +1023,7 @@ again:<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 for (i =3D 0; i < ctrlr->hmb_nchunks; i++= ) {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ctrlr->hmb_desc_= vaddr[i].addr =3D<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 htole= 64(ctrlr->hmb_chunks[i].hmbc_paddr);<br> =C2=A0 =C2=A0 -=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ctrlr->hmb_desc_= vaddr[i].size =3D htole32(ctrlr->hmb_chunk / 4096);<br> =C2=A0 =C2=A0 +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ctrlr->hmb_desc_= vaddr[i].size =3D htole32(ctrlr->hmb_chunk / NVME_HMB_UNITS);<br></block= quote><div><br></div><div>This one, you are correct is wrong. I'll fix = it when I bring in the changes to fully support</div><div>MPS !=3D 0. For M= PS =3D=3D 0, which are the only drives this driver supports correctly, this= </div><div>is a nop. It was wrong before, though. The text you quoted=C2=A0= is correct about this. There</div><div>are a couple of PAGE_SIZEs tinbetwee= n=C2=A0these two chunks hat should be</div><div>ctrlr->min_page_size ins= tead (since that's the page size we're really using, not the</div><= div>host's). But PAGE_SIZE !=3D 4k doesn't currently work with the = nvme driver due to</div><div>confusion like this.</div></div></div></blockq= uote><div><br></div><div><a href=3D"https://reviews.freebsd.org/D34871">htt= ps://reviews.freebsd.org/D34871</a> should address that.<br></div><div><br>= </div><div><a href=3D"https://reviews.freebsd.org/D34865">https://reviews.f= reebsd.org/D34865</a> through=C2=A0<a href=3D"https://reviews.freebsd.org/D= 34873">https://reviews.freebsd.org/D34873</a> are my current<br></div><div>= series. With them applied, I'm able to boot with 16k pages on a ZFS-bas= ed nvme arm64 system.</div><div>I believe it would also allow us to use dif= ferent drive page sizes too, but I haven't tried to figure</div><div>th= at out :)...</div><div><br></div><div>Warner</div><div>=C2=A0</div><blockqu= ote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px= solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div class=3D"gm= ail_quote"><div>We also currently set the MPS field in the CC incorrectly w= hen it isn't 0.</div><div><br></div><div>Good catch. I'll update my= pending changes.</div><div><br></div><div>Warner</div><div>=C2=A0</div><bl= ockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-lef= t:1px solid rgb(204,204,204);padding-left:1ex"> =C2=A0 =C2=A0 =C2=A0 =C2=A0 }<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 bus_dmamap_sync(ctrlr->hmb_desc_tag, ctrlr-&= gt;hmb_desc_map,<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 BUS_DMASYNC_PREWRITE);<br> <br> <br> </blockquote></div></div> </blockquote></div></div> --000000000000785a2f05dc52a7de--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfrJQXM6U0M_Kxhgt8hdr7-1aQj_o4444SmEhy8QmhkFWA>