Date: Fri, 27 Dec 2024 15:03:45 -0700 From: Warner Losh <imp@bsdimp.com> To: Zhenlei Huang <zlei@freebsd.org> Cc: Warner Losh <imp@freebsd.org>, stable@freebsd.org, Edward Tomasz Napierala <trasz@freebsd.org> Subject: Re: MFC fixes for uninitialized kernel stack variables in sys/cam or do direct fix for pvscsi driver Message-ID: <CANCZdfpKGMFyPAVqxPcahZXgSL=tFBin3ratjoQTtJrDeM2NUg@mail.gmail.com> In-Reply-To: <0DDE1B66-B794-472D-A901-54FA2FF1E853@FreeBSD.org> References: <0DDE1B66-B794-472D-A901-54FA2FF1E853@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000a7bbda062a47a309 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, Dec 2, 2024 at 2:41=E2=80=AFAM Zhenlei Huang <zlei@freebsd.org> wro= te: > Hi Warner, > > Recently I upgraded some ESXi vms from 13.3 to 13.4 and noticed weird > report for sas speed. > The boot console has the following, > > ``` > da0 at pvscsi0 bus 0 scbus2 target 0 lun 0 > da0: <VMware Virtual disk 2.0> Fixed Direct Access SPC-4 SCSI device > da0: 4294967.295MB/s transfers > ``` > But camcontrol report the correct value, > ``` > # camcontrol inquiry da0 -R > pass1: 750.000MB/s transfers, Command Queueing Enabled > ``` > > The `4294967.295MB` is actually 0xffff_ffff or -1 but I do not see any > logic set those values. > > Finally I managed to get the stack trace, > ``` > _scsi_announce_periph > scsi_announce_periph_sbuf > xpt_announce_periph_sbuf > dadone_proberc > xpt_done_process > xpt_done_td > fork_exit > fork_trampoline > ``` > > and noticed that the last param `cts` of `_scsi_announce_periph(struct > cam_periph *periph, u_int *speed, u_int *freq, struct ccb_trans_settings > *cts)` > is from kernel stack and is not properly initialized, latter I found some > commits related to this, > > 076686fe0703 cam: make sure to clear CCBs allocated on the stack > ec5325dbca62 cam: make sure to clear even more CCBs allocated on the stac= k > 0f206cc91279 cam: add missing zeroing of a stack-allocated CCB. > 616a676a0535 cam: clear stack-allocated CCB in the target layer > > I applied them to stable/13, rebuild and reboot, now the speed of da0 is > reported correctly. I also tried to patch the pvscsi driver with few line= s > and > it also works as intended. > > ``` > --- a/sys/dev/vmware/pvscsi/pvscsi.c > +++ b/sys/dev/vmware/pvscsi/pvscsi.c > @@ -1444,6 +1444,10 @@ pvscsi_action(struct cam_sim *sim, union ccb *ccb) > cts->proto_specific.scsi.flags =3D CTS_SCSI_FLAGS_TAG_ENB= ; > cts->proto_specific.scsi.valid =3D CTS_SCSI_VALID_TQ; > > + /* Prefer connection speed over sas port speed */ > + /* cts->xport_specific.sas.bitrate =3D 0; */ > + cts->xport_specific.sas.valid =3D 0; > + > ccb_h->status =3D CAM_REQ_CMP; > xpt_done(ccb); > ``` > > Things come clear and I know why this weird speed happens, now it is time > to decide how to fix it. > > Fixing the consumer of cam, aka pvscsi driver, is quite simple and > promising. I did a quick search it appears other consumers set > `cts->xport_specific.sas.valid` correctly. It does not convince me as I'm > quite new to cam subsystem. > Yes. sas.valid is set when the sas.bitrate and other data has been set correctly. Setting it to 0 like this ensures it's ignored. So if you know the speed, set sas.bitrate to that speed and sas.valid to 1. I'm not sure I answered the question right, but if not, please clarify or point out what I missed and I'll try again. > Which one do you prefer, MFC commits to stable/13, or do direct fix for > pvscsi driver to stable/13 ? > [[ Sorry for the delay, I missed this all month ]] I generally prefer a MFC, unless the code is no longer in -current. Even if there's two different fixes for this logical bug, fixing it in current, then MFCing that (with the current hash) is fine, even if the stable/13 changes are completely different. For stable/13 I guess it matters a bit less than stable/14 since I'll be merging to it less, but if it's a commit from -current that doesn't need to be made to -stable because of the new commit on stable, I tend to include the MFC hash text. Warner --000000000000a7bbda062a47a309 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote g= mail_quote_container"><div dir=3D"ltr" class=3D"gmail_attr">On Mon, Dec 2, = 2024 at 2:41=E2=80=AFAM Zhenlei Huang <<a href=3D"mailto:zlei@freebsd.or= g">zlei@freebsd.org</a>> wrote:<br></div><blockquote class=3D"gmail_quot= e" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204)= ;padding-left:1ex">Hi Warner,<br> <br> Recently I upgraded some ESXi vms from 13.3 to 13.4 and noticed weird repor= t for sas speed.<br> The boot console has the following,<br> <br> ```<br> da0 at pvscsi0 bus 0 scbus2 target 0 lun 0<br> da0: <VMware Virtual disk 2.0> Fixed Direct Access SPC-4 SCSI device<= br> da0: 4294967.295MB/s transfers<br> ```<br> But camcontrol report the correct value,<br> ```<br> # camcontrol inquiry da0 -R<br> pass1: 750.000MB/s transfers, Command Queueing Enabled<br> ```<br> <br> The `4294967.295MB` is actually 0xffff_ffff or -1 but I do not see any logi= c set those values.<br> <br> Finally I managed to get the stack trace,<br> ```<br> _scsi_announce_periph<br> scsi_announce_periph_sbuf<br> xpt_announce_periph_sbuf<br> dadone_proberc<br> xpt_done_process<br> xpt_done_td<br> fork_exit<br> fork_trampoline<br> ```<br> <br> and noticed that the last param `cts` of `_scsi_announce_periph(struct cam_= periph *periph, u_int *speed, u_int *freq, struct ccb_trans_settings *cts)`= <br> is from kernel stack and is not properly initialized, latter I found some c= ommits related to this,<br> <br> 076686fe0703 cam: make sure to clear CCBs allocated on the stack<br> ec5325dbca62 cam: make sure to clear even more CCBs allocated on the stack<= br> 0f206cc91279 cam: add missing zeroing of a stack-allocated CCB.<br> 616a676a0535 cam: clear stack-allocated CCB in the target layer<br> <br> I applied them to stable/13, rebuild and reboot, now the speed of da0 is re= ported correctly. I also tried to patch the pvscsi driver with few lines an= d<br> it also works as intended.<br> <br> ```<br> --- a/sys/dev/vmware/pvscsi/pvscsi.c<br> +++ b/sys/dev/vmware/pvscsi/pvscsi.c<br> @@ -1444,6 +1444,10 @@ pvscsi_action(struct cam_sim *sim, union ccb *ccb)<b= r> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 cts->proto_speci= fic.scsi.flags =3D CTS_SCSI_FLAGS_TAG_ENB;<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 cts->proto_speci= fic.scsi.valid =3D CTS_SCSI_VALID_TQ;<br> <br> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* Prefer connectio= n speed over sas port speed */<br> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* cts->xport_sp= ecific.sas.bitrate =3D 0; */<br> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0cts->xport_speci= fic.sas.valid =3D 0;<br> +<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ccb_h->status = =3D CAM_REQ_CMP;<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 xpt_done(ccb);<br> ```<br> <br> Things come clear and I know why this weird speed happens, now it is time t= o decide how to fix it.<br> <br> Fixing the consumer of cam, aka pvscsi driver, is quite simple and promisin= g. I did a quick search it appears other consumers set `cts->xport_speci= fic.sas.valid` correctly. It does not convince me as I'm quite new to c= am subsystem.<br></blockquote><div><br></div><div>Yes. sas.valid is set whe= n the sas.bitrate and other data has been set correctly. Setting it to 0 li= ke this ensures it's ignored.=C2=A0 So if you know the speed, set sas.b= itrate to that speed and sas.valid to 1.</div><div><br></div><div>I'm n= ot sure I answered the question right, but if not, please clarify or point = out what I missed and I'll try again.</div><div>=C2=A0</div><blockquote= class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px so= lid rgb(204,204,204);padding-left:1ex"> Which one do you prefer, MFC commits to stable/13, or do direct fix for pvs= csi driver to stable/13 ?<br></blockquote><div><br></div><div>[[ Sorry for = the delay, I missed this all month ]]</div><div><br></div><div>I generally = prefer a MFC, unless the code is no longer in -current. Even if there's= two different fixes for this logical bug, fixing it in current, then MFCin= g that (with the current hash) is fine, even if the stable/13 changes are c= ompletely different. For stable/13 I guess it matters a bit less than stabl= e/14 since I'll be merging to it less, but if it's a commit from -c= urrent that doesn't need to be made to -stable because of the new commi= t on stable, I tend to include the MFC hash text.</div><div><br></div><div>= Warner</div><div><br></div></div></div> --000000000000a7bbda062a47a309--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfpKGMFyPAVqxPcahZXgSL=tFBin3ratjoQTtJrDeM2NUg>