Date: Mon, 26 Oct 2015 21:08:42 -0600 From: Scott Long <scott4long@yahoo.com> To: "Pokala, Ravi" <rpokala@panasas.com> Cc: "freebsd-geom@freebsd.org" <freebsd-geom@freebsd.org>, "freebsd-scsi@freebsd.org" <freebsd-scsi@freebsd.org>, "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>, "ken@freebsd.org" <ken@freebsd.org>, "imp@freebsd.org" <imp@freebsd.org>, "scottl@freebsd.org" <scottl@freebsd.org> Subject: Re: Low-level trace-buffers in CAM Message-ID: <2C9653BA-CC10-497B-ACEF-0C76AAE2FF44@yahoo.com> In-Reply-To: <A760883F-317D-46C9-AD7C-B8F5D96A49DA@panasas.com> References: <A760883F-317D-46C9-AD7C-B8F5D96A49DA@panasas.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi Ravi,
Yes I was one of the people who never followed up. I'll be at the vendor su=
mmit next week, let's chat some more.
Scott
Sent from my iPhone
> On Oct 26, 2015, at 8:22 PM, Pokala, Ravi <rpokala@panasas.com> wrote:
>=20
> Hi folks,
>=20
> ----------------------------------------------------------------
> This is an updated re-send of a message I originally sent about a year ago=
, during MeetBSD 2014. A few people expressed interest in person, but no one=
ever followed up on the lists. I'm bringing this up again, in the hopes tha=
t I can get some feedback before / during next week's Dev/Vendor Summit. I'm=
CCing some folks who expressed interest at that time, and some folks that I=
was told would be interested.
> ----------------------------------------------------------------
>=20
> At Panasas, we found it useful to have a trace buffer for each ATA drive i=
n the system, for gathering information about command sequences and latency.=
I implemented that functionality for our old 7-ish ATA driver, and it works=
quite well, with fairly low overhead. (I ran our usual suite of performance=
tests, and any differences were lost in the noise.) These traces allowed us=
to capture and replay access patterns (though obviously not with the same d=
ata), analyze and compare latency, etc.
>=20
> As Panasas moves toward a more modern base, we want to re-implement this f=
unctionality for ATA, SCSI, NVMe, etc. And push it upstream, of course! :-)
>=20
> First, some example output; I wrote code for three different formats:
> - The binary trace buffer dump
> Useful for manipulating in C; the other two formats are post-processed o=
ut of this, and I also wrote a replay utility and a basic stats analyzer whi=
ch operated on the binary dump.
> - Tabular hex
> Useful for parsing with scripts; I never actually used this format, but=
it was trivial so I just went ahead and wrote it.
> - Human-readable
> Useful for eyeballing; on more than one occasion, I was able to see tha=
t I had mis-assembled a command in userspace, etc.
>=20
> These examples are actually really long lines; I've broken them up and ind=
ented for the sake of email:
>=20
> # Tabular hex format, for use w/ scripts
> # Request time ATA request information Response time =20
> ATA response information sts/err flags
> 1445839793003737 c8 0000 0004 000000002fa8 1445839793003806 =20
> c8 0000 0004 000000002fa8 50 00 0000007d
> 1445839793003874 c8 0000 0020 000000113ac8 1445839793011541 =20
> c8 0000 0020 000000113ac8 50 00 0000007d
> 1445839793011795 c8 0000 0008 000000118ca8 1445839793017702 =20
> c8 0000 0008 000000118ca8 50 00 0000007d
> 1445839793017808 c8 0000 0080 000000117628 1445839793018975 =20
> c8 0000 0080 000000117628 50 00 0000007d
> 1445839793019079 c8 0000 0020 0000001177a8 1445839793019686 =20
> c8 0000 0020 0000001177a8 50 00 0000007d
>=20
> # Human-readable
> FLAGS REQUEST TIME RESPONSE TIME ELAPSED TIME =
=20
> REQUEST COMMAND STATUS ERROR =20
> RESPONSE COMMAND =20
> ----------- -------------------- -------------------- -----------------=
---=20
> ---------------------------------------- -------- --------=20
> ----------------------------------------
> ____C...._V 1445839793003737 1445839793003806 69 =
=20
> READ DMA (LBA 12200 + 4 sectors) _R_S____ ________=20
> ---- =20
> ____C...._V 1445839793003874 1445839793011541 7667 =
=20
> READ DMA (LBA 1129160 + 32 sectors) _R_S____ ________=20
> ---- =20
> ____C...._V 1445839793011795 1445839793017702 5907 =
=20
> READ DMA (LBA 1150120 + 8 sectors) _R_S____ ________=20
> ---- =20
> ____C...._V 1445839793017808 1445839793018975 1167 =
=20
> READ DMA (LBA 1144360 + 128 sectors) _R_S____ ________=20
> ---- =20
> ____C...._V 1445839793019079 1445839793019686 607 =
=20
> READ DMA (LBA 1144744 + 32 sectors) _R_S____ ________=20
> ---- =20
>=20
> The obvious place to put this sort of tracing would be in GEOM, but that a=
ctually isn't what I want, for a few reasons. First of all, I'm primarily in=
terested in the exact register / taskfile / CDB values sent to the drive - o=
pcode, feature codes, LBAs, etc. Second, I'm interested in hardware-level la=
tency - how long does a request stay in the controller and device. Both of t=
hose pretty much require working below GEOM.
>=20
> Therefore, I think I want to do it in the CAM stack; ideally, I'd put the h=
ooks in the SIM drivers, to get as close to the metal as possible, to get th=
e most accurate latency. However, even a light touch to every SIM driver is g=
oing to be painful, and since I don't have access to every controller in the=
world, testing would be difficult. Adding the hooks to da(4) and ada(4) wou=
ld still be pretty close to the metal, and would require much smaller change=
s. Or maybe they belong in the XPT layer? (I don't know nearly as much about=
CAM as I should. :-/)
>=20
> --------
> NOTE: When I mentioned this idea at a (San Francisco) Bay Area FreeBSD Use=
r Group (BAFUG) meeting, there were suggestions about integrating this with D=
Trace or KTrace instead. I'm happy to consider either of those, but I know e=
ven less about those than I do about CAM. If that's really the best way to d=
o this sort of thing, I'm happy to learn, if someone is willing to teach.
> --------
>=20
> I took the Panasas / 7.x implementation, and pseudo-coded what it might lo=
ok like in CAM. (I could probably make diffs against the old ATA stack avail=
able for comparison, but I figured no one would care anymore.) A huge caveat=
is that I've never really poked in CAM before, so I fully expect I'm doing s=
everal things wrong. Please educate me. :-)
>=20
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D
>=20
> Create a circular buffer of trace records for each CAM PERIPH. Each record=
will contain the full CDB for the request as sent to the device, the timest=
amp (in usec) of the request, the CDB as returned by the device, the timesta=
mp of the response, and the status and error code. Additionally, flags are p=
rovided to track the state of the trace record, and to indicate which type o=
f periph the record is associated with.
>=20
> --------
>=20
> typedef struct {
> u_int32_t flags;
> #define CAM_TRACE_F_VALID 0x00000001 /* valid record */
> #define CAM_TRACE_F_INPROGRESS 0x00000002 /* request in progress */
> #define CAM_TRACE_F_REQ_VALID 0x00000004 /* request data valid */
> #define CAM_TRACE_F_RESP_VALID 0x00000008 /* response data valid */
> #define CAM_TRACE_F_COMPLETE 0x00000010 /* response gathered */
> #define CAM_TRACE_F_TIMEDOUT 0x00000020 /* request timed out */
> #define CAM_TRACE_F_ABANDONED 0x00000040 /* cancelled, etc. */
> #define CAM_TRACE_F_RETRIED 0x00000080 /* retry in later record */
> #define CAM_TRACE_F_IS_RETRY 0x00000100 /* retry of earlier req. */
> #define CAM_TRACE_F_PERIPH_MASK 0xf0000000 /* up to 16 PERIPH types *=
/
> #define CAM_TRACE_F_PERIPH_DA 0x0
> #define CAM_TRACE_F_PERIPH_ADA 0x1
> #define CAM_TRACE_F_PERIPH_NVD 0x2
> u_int64_t req_usec;
> cdb_t req_cdb;
> u_int64_t resp_usec;
> cdb_t resp_cdb;
> u_int8_t resp_status;
> u_int8_t resp_error;
> u_int8_t resp_asc; /* Needed for SCSI? */
> u_int8_t resp_asq; /* Needed for SCSI? */
> u_int8_t padding[8]; /* Pad to 64 bytes */
> /* There's going to be a large-ish array of these in the kernel; make sure=
> * they're packed to minimize space usage, and keep things aligned. It may
> * make sense to re-order the fields for alignment as well.
> */
> } cam_trace_record_t __attribute__((packed, aligned(8)));
>=20
> typedef struct {
> u_int32_t trace_buf_record_count;
> cam_trace_record_t *trace_buf;
> } cam_trace_buffer_t;
>=20
> /* This is settable via a tunable; for Panasas' purposes, 100K entries was=
> * enough to provide a good amount of history even on a busy system, while
> * not taking up much space (< 5MB/drive).
> */
> #define CAM_TRACE_BUFFER_DEFAULT_COUNT 100000
>=20
> ----------------
>=20
> ioctl interfaces are used to get the trace buffer size, to retrieve the bu=
ffer contents, or to clear the buffer. The buffer can then be analyzed from u=
serspace, or written to a file for post-processing.
>=20
> --------
>=20
> #define CAM_GET_TRACE_RECORD_COUNT _IOR(???, ???, u_int32_t)
> #define CAM_GET_TRACE_BUFFER _IOWR(???, ???, cam_trace_buffer_t)
> #define CAM_CLEAR_TRACE_BUFFER _IOW(???, ???, u_int32_t)
>=20
> ----------------
>=20
> To get the buffer:
>=20
> --------
>=20
> cam_trace_buffer_t trace =3D NULL;
> fd =3D open(periph_name, O_RDONLY);
> error =3D ioctl(fd, CAM_GET_TRACE_RECORD_COUNT,
> &trace.trace_buf_record_count);
> trace.trace_buf =3D malloc(trace.trace_buf_record_count *
> sizeof(cam_trace_record_t));
> error =3D ioctl(fd, CAM_GET_TRACE_BUFFER, &trace);
>=20
> ----------------
>=20
> By including the full CDB, the entire command can be decoded by analysis t=
ools. Tools can be written to parse the buffer and translate command data in=
to human-readable descriptions, calculate elapsed times for requests, perfor=
m statistical analysis based on opcode, command type, transfer size, etc. Co=
mmand decoders would need to be written for each PERIPH type (i.e. one for A=
TA, one for SCSI, one for NVMe, etc.).
>=20
> ----------------
>=20
> Each PERIPH's softc needs to include some information about the per-device=
> buffer.
>=20
> --------
>=20
> u_int32_t trace_buf_record_count;
> u_int64_t trace_buf_request_count;
> cam_trace_record_t *trace_buf;
>=20
> ----------------
>=20
> In each SIM, there should already be some sort of request structure, which=
associates the softc and the CDB of the request. To that structure, add a p=
ointer to the trace record associated with the request.
>=20
> --------
>=20
> request structure:
> cam_trace_record_t *trace_rec;
>=20
> ----------------
>=20
> When initializing the per-device softc, allocate a trace buffer. The numbe=
r of records can be hard-coded, or settable via a tunable.
>=20
> The example below has a global size for everything, but it should be possi=
ble to do it on a per-device or per-PERIPH basis.
>=20
> --------
>=20
> /** Initialize per-drive command tracing buffer. **/
> /* Get the buffer size from the tunable. Sanity check it. */
> sc->trace_buf_record_count =3D CAM_TRACE_BUFFER_DEFAULT_COUNT;
> TUNABLE_ULONG_FETCH("cam.trace_buf_record_count",
> &trace_buf_record_count_tunable);
> if (trace_buf_record_count_tunable) {
> /* Less than 1K records is unusably few; more than 1M
> * records is unusably many.
> */
> if ((trace_buf_record_count_tunable < (1 << 10)) ||
> (trace_buf_record_count_tunable > (1 << 20))) {
> device_printf(dev, "cam.trace_buf_record_count=3D%lu"
> " out of range; should be between %u and %u\n",
> trace_buf_record_count_tunable, (1 << 10),
> (1 << 20));
> } else {
> sc->trace_buf_record_count =3D (u_int32_t)
> trace_buf_record_count_tunable;
> }
> }
> /* Allocate the buffer. */
> trace_buf_bytes =3D sc->trace_buf_record_count *
> sizeof(cam_trace_record_t);
> sc->trace_buf =3D malloc(trace_buf_bytes, M_AD, M_NOWAIT | M_ZERO);
> if (sc->trace_buf =3D=3D NULL) {
> device_printf(dev, "unable to allocate %zd for trace_buf\n",
> trace_buf_bytes);
> sc->trace_buf_record_count =3D 0;
> } else {
> device_printf(dev, "allocated %zd @ %p for trace_buf\n",
> trace_buf_bytes, sc->trace_buf);
> }
> /* Request counter, used for indexing */
> sc->trace_buf_request_count =3D 0;
>=20
>=20
> ----------------
>=20
> /* Initial setup of the trace record for a new request.
> *
> * Locate the circular per-device trace buffer in the drive's softc.
> * Populate the pointer to the next record in the trace buffer; this is
> * done atomically to prevent concurrent requests from getting the same
> * index. That could get a bit weird if/when 'trace_buf_request_count'
> * rolls over, if the maximum counter value is not a multiple of
> * 'trace_buf_record_count' (i.e. the request counter recycles in the
> * middle of the buffer). By using a 64-bit request counter, we're pretty
> * safe - even at one transaction per nanosecond, that will still last
> * about 585 years.
> *
> * Initialize the record by zeroing it and then setting the VALID flag.
> *
> * This could be done at any point before sending the request to the
> * hardware; neither the CDB data nor the timestamps are involved yet.
> */
> void
> trace_rec_init(cam_request_t req) {
> ada_softc *sc =3D (request->...->softc)
> u_int64_t tmp =3D 0;
> if (sc && sc->trace_buf) {
> tmp =3D atomic_fetchadd_log(&sc->trace_buf_request_count, 1);
> req->trace_rec =3D &sc->trace_buf[tmp %
> sc->trace_buf_record_count];
> bzero(req->trace_rec, sizeof(*req->trace_rec));
> req->trace_rec->flags =3D (CAM_TRACE_F_PERIPH_XXX << 28);
> req->trace_rec->flags |=3D CAM_TRACE_F_VALID;
> } else {
> req->trace_rec =3D NULL;
> }
> }
>=20
> ----------------
>=20
> /* Stuff the request data into the trace record.
> *
> * If the trace record is present and valid, copy the CDB into the
> * record. Set CAM_TRACE_F_REQ_VALID to indicate that the request data is
> * valid, and INPROGRESS to show that the request is in-progress.
> *
> * This should be done just before sending the request to the hardware.
> * This should *not* be fused into trace_rec_init(), because
> * trace_rec_init() is also used in the retry case, where it is not
> * correct to immediately start the timer.
> */
> void
> trace_rec_req(cam_request_t req) {
> struct timeval tv;
> if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
> bcopy(req->cdb, req->trace_rec->req_cdb, sizeof(req->cdb));
> microuptime(&tv);
> req->trace_rec->req_usec =3D ((uint64_t) tv->tv_sec * 1000000)
> + tv->tv_usec;
> req->trace_rec->flags |=3D
> (CAM_TRACE_F_REQ_VALID | CAM_TRACE_F_INPROGRESS);
> }
> }
>=20
> ----------------
>=20
> /* Stuff the response data into the trace record.
> *
> * If the trace record is present and valid, copy the CDB **as returned by=
> * the drive** into the record. At this point, we also have the result fro=
m
> * the command (status, error, (asc? ascq?)), so save them too. Set
> * CAM_TRACE_F_RESP_VALID to indicate that the response data is valid.
> * Clear INPROGRESS and set COMPLETE instead.
> *
> * This should be done just after getting the response information from th=
e
> * hardware.
> */
> void
> trace_rec_resp(cam_request_t req) {
> struct timeval tv;
> if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
> microuptime(&tv);
> bcopy(req->cdb, req->trace_rec->resp_cdb, sizeof(req->cdb));
> req->trace_rec->resp_usec =3D ((uint64_t) tv->tv_sec *
> 1000000) + tv->tv_usec;
> req->trace_rec->flags |=3D
> (CAM_TRACE_F_RESP_VALID | CAM_TRACE_F_COMPLETE);
> req->trace_rec->flags &=3D ~CAM_TRACE_F_INPROGRESS;
> }
> }
>=20
> ----------------
>=20
> /* Update the trace record if there was a timeout.
> *
> * If the trace record is present and valid, set TIMEDOUT to indicate
> * that the request associated with the record exceeded its timeout.
> * Since this attempt is complete, clear INPROGRESS and set COMPLETE.
> *
> * This should be done along with other timeout handling.
> */
> void
> trace_rec_timeout(cam_request_t req) {
> if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
> req->trace_rec->flags |=3D
> (CAM_TRACE_F_TIMEDOUT | CAM_TRACE_F_COMPLETE);
> req->trace_rec->flags &=3D ~CAM_TRACE_F_INPROGRESS;
> }
> }
>=20
> ----------------
>=20
> /* Update the trace record if the request is being retried.
> * =20
> * If the trace record is present and valid, set RETRIED to indicate
> * that the request associated with the record is going to be retried.
> * Since this attempt is complete, clear INPROGRESS and set COMPLETE.
> * Then get a new trace record for the new attempt, copy the request
> * command data to the new record, and set IS_RETRY to show that this
> * is a retry of an earlier request.
> *
> * This should be done at the time the retry is queued.
> */
> void
> trace_rec_retry(cam_request_t req) {
> if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
> /* First, mark the original request as being requeued. */
> req->trace_rec->flags |=3D
> (CAM_TRACE_F_RETRIED | CAM_TRACE_F_COMPLETE);
> req->trace_rec->flags &=3D ~CAM_TRACE_F_INPROGRESS;
> /* Get a new record for the retry. */
> trace_rec_init(req);
> /* Mark the new record as being a retry. */
> req->trace_rec->flags |=3D CAM_TRACE_F_IS_RETRY;
> }
> }
>=20
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D
>=20
> I'd appreciate feedback, both on the implementation details and on the ide=
a as a whole. Some open questions are:
>=20
> - My assumption is that each SIM has a structure that associates a PERIPH'=
s softc with the CDB that's being processed. It might require dereferencing a=
chain of pointers, but it should be there in some fashion.
>=20
> - The current implementation adds a few ioctls to the old ad(4) disk-devic=
e driver. I would expect to add the new ioctls to da(4) and ada(4); is that t=
he right thing to do?
>=20
> Again, any feedback would be greatly appreciated. I'll be at the Dev/Vendo=
r Summit next week, if anyone wants to discuss this with me in person.
>=20
> Thanks,
>=20
> Ravi
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2C9653BA-CC10-497B-ACEF-0C76AAE2FF44>
