Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 27 Dec 2024 15:03:45 -0700
From:      Warner Losh <imp@bsdimp.com>
To:        Zhenlei Huang <zlei@freebsd.org>
Cc:        Warner Losh <imp@freebsd.org>, stable@freebsd.org,  Edward Tomasz Napierala <trasz@freebsd.org>
Subject:   Re: MFC fixes for uninitialized kernel stack variables in sys/cam or do direct fix for pvscsi driver
Message-ID:  <CANCZdfpKGMFyPAVqxPcahZXgSL=tFBin3ratjoQTtJrDeM2NUg@mail.gmail.com>
In-Reply-To: <0DDE1B66-B794-472D-A901-54FA2FF1E853@FreeBSD.org>
References:  <0DDE1B66-B794-472D-A901-54FA2FF1E853@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000a7bbda062a47a309
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Mon, Dec 2, 2024 at 2:41=E2=80=AFAM Zhenlei Huang <zlei@freebsd.org> wro=
te:

> Hi Warner,
>
> Recently I upgraded some ESXi vms from 13.3 to 13.4 and noticed weird
> report for sas speed.
> The boot console has the following,
>
> ```
> da0 at pvscsi0 bus 0 scbus2 target 0 lun 0
> da0: <VMware Virtual disk 2.0> Fixed Direct Access SPC-4 SCSI device
> da0: 4294967.295MB/s transfers
> ```
> But camcontrol report the correct value,
> ```
> # camcontrol inquiry da0 -R
> pass1: 750.000MB/s transfers, Command Queueing Enabled
> ```
>
> The `4294967.295MB` is actually 0xffff_ffff or -1 but I do not see any
> logic set those values.
>
> Finally I managed to get the stack trace,
> ```
> _scsi_announce_periph
> scsi_announce_periph_sbuf
> xpt_announce_periph_sbuf
> dadone_proberc
> xpt_done_process
> xpt_done_td
> fork_exit
> fork_trampoline
> ```
>
> and noticed that the last param `cts` of `_scsi_announce_periph(struct
> cam_periph *periph, u_int *speed, u_int *freq, struct ccb_trans_settings
> *cts)`
> is from kernel stack and is not properly initialized, latter I found some
> commits related to this,
>
> 076686fe0703 cam: make sure to clear CCBs allocated on the stack
> ec5325dbca62 cam: make sure to clear even more CCBs allocated on the stac=
k
> 0f206cc91279 cam: add missing zeroing of a stack-allocated CCB.
> 616a676a0535 cam: clear stack-allocated CCB in the target layer
>
> I applied them to stable/13, rebuild and reboot, now the speed of da0 is
> reported correctly. I also tried to patch the pvscsi driver with few line=
s
> and
> it also works as intended.
>
> ```
> --- a/sys/dev/vmware/pvscsi/pvscsi.c
> +++ b/sys/dev/vmware/pvscsi/pvscsi.c
> @@ -1444,6 +1444,10 @@ pvscsi_action(struct cam_sim *sim, union ccb *ccb)
>                 cts->proto_specific.scsi.flags =3D CTS_SCSI_FLAGS_TAG_ENB=
;
>                 cts->proto_specific.scsi.valid =3D CTS_SCSI_VALID_TQ;
>
> +               /* Prefer connection speed over sas port speed */
> +               /* cts->xport_specific.sas.bitrate =3D 0; */
> +               cts->xport_specific.sas.valid =3D 0;
> +
>                 ccb_h->status =3D CAM_REQ_CMP;
>                 xpt_done(ccb);
> ```
>
> Things come clear and I know why this weird speed happens, now it is time
> to decide how to fix it.
>
> Fixing the consumer of cam, aka pvscsi driver, is quite simple and
> promising. I did a quick search it appears other consumers set
> `cts->xport_specific.sas.valid` correctly. It does not convince me as I'm
> quite new to cam subsystem.
>

Yes. sas.valid is set when the sas.bitrate and other data has been set
correctly. Setting it to 0 like this ensures it's ignored.  So if you know
the speed, set sas.bitrate to that speed and sas.valid to 1.

I'm not sure I answered the question right, but if not, please clarify or
point out what I missed and I'll try again.


> Which one do you prefer, MFC commits to stable/13, or do direct fix for
> pvscsi driver to stable/13 ?
>

[[ Sorry for the delay, I missed this all month ]]

I generally prefer a MFC, unless the code is no longer in -current. Even if
there's two different fixes for this logical bug, fixing it in current,
then MFCing that (with the current hash) is fine, even if the stable/13
changes are completely different. For stable/13 I guess it matters a bit
less than stable/14 since I'll be merging to it less, but if it's a commit
from -current that doesn't need to be made to -stable because of the new
commit on stable, I tend to include the MFC hash text.

Warner

--000000000000a7bbda062a47a309
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote g=
mail_quote_container"><div dir=3D"ltr" class=3D"gmail_attr">On Mon, Dec 2, =
2024 at 2:41=E2=80=AFAM Zhenlei Huang &lt;<a href=3D"mailto:zlei@freebsd.or=
g">zlei@freebsd.org</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quot=
e" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204)=
;padding-left:1ex">Hi Warner,<br>
<br>
Recently I upgraded some ESXi vms from 13.3 to 13.4 and noticed weird repor=
t for sas speed.<br>
The boot console has the following,<br>
<br>
```<br>
da0 at pvscsi0 bus 0 scbus2 target 0 lun 0<br>
da0: &lt;VMware Virtual disk 2.0&gt; Fixed Direct Access SPC-4 SCSI device<=
br>
da0: 4294967.295MB/s transfers<br>
```<br>
But camcontrol report the correct value,<br>
```<br>
# camcontrol inquiry da0 -R<br>
pass1: 750.000MB/s transfers, Command Queueing Enabled<br>
```<br>
<br>
The `4294967.295MB` is actually 0xffff_ffff or -1 but I do not see any logi=
c set those values.<br>
<br>
Finally I managed to get the stack trace,<br>
```<br>
_scsi_announce_periph<br>
scsi_announce_periph_sbuf<br>
xpt_announce_periph_sbuf<br>
dadone_proberc<br>
xpt_done_process<br>
xpt_done_td<br>
fork_exit<br>
fork_trampoline<br>
```<br>
<br>
and noticed that the last param `cts` of `_scsi_announce_periph(struct cam_=
periph *periph, u_int *speed, u_int *freq, struct ccb_trans_settings *cts)`=
<br>
is from kernel stack and is not properly initialized, latter I found some c=
ommits related to this,<br>
<br>
076686fe0703 cam: make sure to clear CCBs allocated on the stack<br>
ec5325dbca62 cam: make sure to clear even more CCBs allocated on the stack<=
br>
0f206cc91279 cam: add missing zeroing of a stack-allocated CCB.<br>
616a676a0535 cam: clear stack-allocated CCB in the target layer<br>
<br>
I applied them to stable/13, rebuild and reboot, now the speed of da0 is re=
ported correctly. I also tried to patch the pvscsi driver with few lines an=
d<br>
it also works as intended.<br>
<br>
```<br>
--- a/sys/dev/vmware/pvscsi/pvscsi.c<br>
+++ b/sys/dev/vmware/pvscsi/pvscsi.c<br>
@@ -1444,6 +1444,10 @@ pvscsi_action(struct cam_sim *sim, union ccb *ccb)<b=
r>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 cts-&gt;proto_speci=
fic.scsi.flags =3D CTS_SCSI_FLAGS_TAG_ENB;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 cts-&gt;proto_speci=
fic.scsi.valid =3D CTS_SCSI_VALID_TQ;<br>
<br>
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* Prefer connectio=
n speed over sas port speed */<br>
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* cts-&gt;xport_sp=
ecific.sas.bitrate =3D 0; */<br>
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0cts-&gt;xport_speci=
fic.sas.valid =3D 0;<br>
+<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ccb_h-&gt;status =
=3D CAM_REQ_CMP;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 xpt_done(ccb);<br>
```<br>
<br>
Things come clear and I know why this weird speed happens, now it is time t=
o decide how to fix it.<br>
<br>
Fixing the consumer of cam, aka pvscsi driver, is quite simple and promisin=
g. I did a quick search it appears other consumers set `cts-&gt;xport_speci=
fic.sas.valid` correctly. It does not convince me as I&#39;m quite new to c=
am subsystem.<br></blockquote><div><br></div><div>Yes. sas.valid is set whe=
n the sas.bitrate and other data has been set correctly. Setting it to 0 li=
ke this ensures it&#39;s ignored.=C2=A0 So if you know the speed, set sas.b=
itrate to that speed and sas.valid to 1.</div><div><br></div><div>I&#39;m n=
ot sure I answered the question right, but if not, please clarify or point =
out what I missed and I&#39;ll try again.</div><div>=C2=A0</div><blockquote=
 class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px so=
lid rgb(204,204,204);padding-left:1ex">
Which one do you prefer, MFC commits to stable/13, or do direct fix for pvs=
csi driver to stable/13 ?<br></blockquote><div><br></div><div>[[ Sorry for =
the delay, I missed this all month ]]</div><div><br></div><div>I generally =
prefer a MFC, unless the code is no longer in -current. Even if there&#39;s=
 two different fixes for this logical bug, fixing it in current, then MFCin=
g that (with the current hash) is fine, even if the stable/13 changes are c=
ompletely different. For stable/13 I guess it matters a bit less than stabl=
e/14 since I&#39;ll be merging to it less, but if it&#39;s a commit from -c=
urrent that doesn&#39;t need to be made to -stable because of the new commi=
t on stable, I tend to include the MFC hash text.</div><div><br></div><div>=
Warner</div><div><br></div></div></div>

--000000000000a7bbda062a47a309--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfpKGMFyPAVqxPcahZXgSL=tFBin3ratjoQTtJrDeM2NUg>