Date: Mon, 14 Dec 2009 17:09:08 -0500 From: Alexander Sack <pisymbol@gmail.com> To: freebsd-current@freebsd.org Cc: freebsd-scsi@freebsd.org Subject: Re: aac(4) handling of probe when no devices are there Message-ID: <3c0b01820912141409t74a3554ctd224db485ceeb80c@mail.gmail.com> In-Reply-To: <3c0b01820912141347y366a7252y5d9711b1141b9b70@mail.gmail.com> References: <3c0b01820912141347y366a7252y5d9711b1141b9b70@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Dec 14, 2009 at 4:47 PM, Alexander Sack <pisymbol@gmail.com> wrote: > Hello Again: > > I guess I have a technical question/concern that I was looking for > feedback. =A0 During the probe sequence, aac(4) conditionally responds > to INQUIRY commands depending on target LUN: > > aac_cam.c/aac_cam_complete(): > 532 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (command =3D=3D IN= QUIRY) { > 533 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (c= cb->ccb_h.status =3D=3D CAM_REQ_CMP) { > 534 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 devic= e =3D ccb->csio.data_ptr[0] & 0x1f; > 535 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* > 536 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* = We want DASD and PROC devices to only be > 537 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* = visible through the pass device. > 538 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/ > 539 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if ((= device =3D=3D T_DIRECT) || > 540 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 (device =3D=3D T_PROCESSOR) || > 541 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 (sc->flags & AAC_FLAGS_CAM_PASSONLY)) > 542 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 ccb->csio.data_ptr[0] =3D > 543 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 ((device & 0xe0) | T_NODEVICE); > 544 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 } els= e if (ccb->ccb_h.status =3D=3D > CAM_SEL_TIMEOUT && > 545 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 ccb->ccb_h.target_lun !=3D 0) { > 546 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 /* fix for INQUIRYs on Lun>0 */ > 547 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 ccb->ccb_h.status =3D > CAM_DEV_NOT_THERE; > 548 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 } > 549 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 } > > Why is CAM_DEV_NOT_THERE skipped on LUN 0? =A0This is true on my target > 6.1-amd64 machine as well as CURRENT. =A0The reason why I ask this is > because now that aac(4) is sequential scanned, there are a lot of cam > interrupts that come in on my 6.x machine where the threshold is only > 500 and I get the interrupt storm threshold warning for swi2 pretty > quickly: > > Interrupt storm detected on "swi2:"; throttling interrupt source > > Obviously its contingent on the number of adapters you have on your > system. =A0On CURRENT I didn't see this because the threshold is double > (I think its a 1000 by default). > > The issue is the number of xpt_async(AC_LOST_DEVICE, ..) calls during > the scan. =A0The probe sequence in CURRENT as well as 6.1 handles > CAM_SEL_TIMEOUT a little differently depending on context. > > scsi_xpt.c/probedone(): > 1090 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 } else if (cam_periph_error(done_ccb= , 0, > 1091 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 done_ccb->ccb_h.target_lun > 0 > 1092 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 ? SF_RETRY_UA|SF_QUIET_IR > 1093 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 : SF_RETRY_UA, > 1094 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 &softc->saved_ccb) =3D=3D > ERESTART) { > 1095 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return; > 1096 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 } else if ((done_ccb->ccb_h.status &= CAM_DEV_QFRZN) !=3D 0) { > 1097 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* Don't wedge the q= ueue */ > 1098 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 xpt_release_devq(don= e_ccb->ccb_h.path, /*count*/1, > 1099 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0/*run_queue*/TRUE); > 1100 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 } > 1101 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* > 1102 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* If we get to this point, we got= an error status back > 1103 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* from the inquiry and the error = status doesn't require > 1104 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* automatically retrying the comm= and. =A0Therefore, the > 1105 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* inquiry failed. =A0If we had in= quiry information before > 1106 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* for this device, but this lates= t inquiry command failed, > 1107 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* the device has probably gone aw= ay. =A0If this device isn't > 1108 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* already marked unconfigured, no= tify the peripheral > 1109 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* drivers that this device is no = more. > 1110 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/ > 1111 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if ((path->device->flags & CAM_DEV_U= NCONFIGURED) =3D=3D 0) > 1112 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* Send the async no= tification. */ > 1113 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 xpt_async(AC_LOST_DE= VICE, path, NULL); > 1114 > 1115 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 xpt_release_ccb(done_ccb); > 1116 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; > 1117 =A0 =A0 =A0 =A0 } > > But on cam_periph_error(), this will issue a xpt_async(AC_LOST_DEVICE, > path, NULL) regardless of whether or not the device has been scene > already (as per the comment above), i.e. on every initial bus scan, > you will get into (on an aac(4) card with LUN > 0): > > cam_periph.c/cam_periph_error(): > 1697 =A0 =A0 =A0 =A0 case CAM_SEL_TIMEOUT: > 1698 =A0 =A0 =A0 =A0 { > . > . > 1729 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* > 1730 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* Let peripheral drivers know tha= t this device has gone > 1731 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* away. > 1732 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/ > 1733 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 xpt_async(AC_LOST_DEVICE, newpath, N= ULL); > 1734 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 xpt_free_path(newpath); > 1735 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; > > Is this really right? This generates A LOT of interrupts noise when no > devices are attached during the initial scan, i.e. we are treating the > initial scan of failed INQUIRY commands on the SCSI BUS as if we > really lost a device during a selection timeout. =A0(we even generate a > path to issue the async event). I should have properly titled the thread a little bit better, but basically we always generate a ton of software CAM interrupts during a LUN scan for targets on aac(4) that do not really exist (i.e. nothing is truly there). We do this because we treat the initial INQUIRY sent down equivalent to a selection timeout instead of the device is not really there. There seems to be an historical workaround for part of this issue but I am trying to delve deeper in order to do the *right thing* for our 6.1 deployments (as well as 7.x and CURRENT). -aps
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3c0b01820912141409t74a3554ctd224db485ceeb80c>