Date: Mon, 26 Oct 2015 21:08:42 -0600 From: Scott Long <scott4long@yahoo.com> To: "Pokala, Ravi" <rpokala@panasas.com> Cc: "freebsd-geom@freebsd.org" <freebsd-geom@freebsd.org>, "freebsd-scsi@freebsd.org" <freebsd-scsi@freebsd.org>, "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>, "ken@freebsd.org" <ken@freebsd.org>, "imp@freebsd.org" <imp@freebsd.org>, "scottl@freebsd.org" <scottl@freebsd.org> Subject: Re: Low-level trace-buffers in CAM Message-ID: <2C9653BA-CC10-497B-ACEF-0C76AAE2FF44@yahoo.com> In-Reply-To: <A760883F-317D-46C9-AD7C-B8F5D96A49DA@panasas.com> References: <A760883F-317D-46C9-AD7C-B8F5D96A49DA@panasas.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi Ravi, Yes I was one of the people who never followed up. I'll be at the vendor su= mmit next week, let's chat some more. Scott Sent from my iPhone > On Oct 26, 2015, at 8:22 PM, Pokala, Ravi <rpokala@panasas.com> wrote: >=20 > Hi folks, >=20 > ---------------------------------------------------------------- > This is an updated re-send of a message I originally sent about a year ago= , during MeetBSD 2014. A few people expressed interest in person, but no one= ever followed up on the lists. I'm bringing this up again, in the hopes tha= t I can get some feedback before / during next week's Dev/Vendor Summit. I'm= CCing some folks who expressed interest at that time, and some folks that I= was told would be interested. > ---------------------------------------------------------------- >=20 > At Panasas, we found it useful to have a trace buffer for each ATA drive i= n the system, for gathering information about command sequences and latency.= I implemented that functionality for our old 7-ish ATA driver, and it works= quite well, with fairly low overhead. (I ran our usual suite of performance= tests, and any differences were lost in the noise.) These traces allowed us= to capture and replay access patterns (though obviously not with the same d= ata), analyze and compare latency, etc. >=20 > As Panasas moves toward a more modern base, we want to re-implement this f= unctionality for ATA, SCSI, NVMe, etc. And push it upstream, of course! :-) >=20 > First, some example output; I wrote code for three different formats: > - The binary trace buffer dump > Useful for manipulating in C; the other two formats are post-processed o= ut of this, and I also wrote a replay utility and a basic stats analyzer whi= ch operated on the binary dump. > - Tabular hex > Useful for parsing with scripts; I never actually used this format, but= it was trivial so I just went ahead and wrote it. > - Human-readable > Useful for eyeballing; on more than one occasion, I was able to see tha= t I had mis-assembled a command in userspace, etc. >=20 > These examples are actually really long lines; I've broken them up and ind= ented for the sake of email: >=20 > # Tabular hex format, for use w/ scripts > # Request time ATA request information Response time =20 > ATA response information sts/err flags > 1445839793003737 c8 0000 0004 000000002fa8 1445839793003806 =20 > c8 0000 0004 000000002fa8 50 00 0000007d > 1445839793003874 c8 0000 0020 000000113ac8 1445839793011541 =20 > c8 0000 0020 000000113ac8 50 00 0000007d > 1445839793011795 c8 0000 0008 000000118ca8 1445839793017702 =20 > c8 0000 0008 000000118ca8 50 00 0000007d > 1445839793017808 c8 0000 0080 000000117628 1445839793018975 =20 > c8 0000 0080 000000117628 50 00 0000007d > 1445839793019079 c8 0000 0020 0000001177a8 1445839793019686 =20 > c8 0000 0020 0000001177a8 50 00 0000007d >=20 > # Human-readable > FLAGS REQUEST TIME RESPONSE TIME ELAPSED TIME = =20 > REQUEST COMMAND STATUS ERROR =20 > RESPONSE COMMAND =20 > ----------- -------------------- -------------------- -----------------= ---=20 > ---------------------------------------- -------- --------=20 > ---------------------------------------- > ____C...._V 1445839793003737 1445839793003806 69 = =20 > READ DMA (LBA 12200 + 4 sectors) _R_S____ ________=20 > ---- =20 > ____C...._V 1445839793003874 1445839793011541 7667 = =20 > READ DMA (LBA 1129160 + 32 sectors) _R_S____ ________=20 > ---- =20 > ____C...._V 1445839793011795 1445839793017702 5907 = =20 > READ DMA (LBA 1150120 + 8 sectors) _R_S____ ________=20 > ---- =20 > ____C...._V 1445839793017808 1445839793018975 1167 = =20 > READ DMA (LBA 1144360 + 128 sectors) _R_S____ ________=20 > ---- =20 > ____C...._V 1445839793019079 1445839793019686 607 = =20 > READ DMA (LBA 1144744 + 32 sectors) _R_S____ ________=20 > ---- =20 >=20 > The obvious place to put this sort of tracing would be in GEOM, but that a= ctually isn't what I want, for a few reasons. First of all, I'm primarily in= terested in the exact register / taskfile / CDB values sent to the drive - o= pcode, feature codes, LBAs, etc. Second, I'm interested in hardware-level la= tency - how long does a request stay in the controller and device. Both of t= hose pretty much require working below GEOM. >=20 > Therefore, I think I want to do it in the CAM stack; ideally, I'd put the h= ooks in the SIM drivers, to get as close to the metal as possible, to get th= e most accurate latency. However, even a light touch to every SIM driver is g= oing to be painful, and since I don't have access to every controller in the= world, testing would be difficult. Adding the hooks to da(4) and ada(4) wou= ld still be pretty close to the metal, and would require much smaller change= s. Or maybe they belong in the XPT layer? (I don't know nearly as much about= CAM as I should. :-/) >=20 > -------- > NOTE: When I mentioned this idea at a (San Francisco) Bay Area FreeBSD Use= r Group (BAFUG) meeting, there were suggestions about integrating this with D= Trace or KTrace instead. I'm happy to consider either of those, but I know e= ven less about those than I do about CAM. If that's really the best way to d= o this sort of thing, I'm happy to learn, if someone is willing to teach. > -------- >=20 > I took the Panasas / 7.x implementation, and pseudo-coded what it might lo= ok like in CAM. (I could probably make diffs against the old ATA stack avail= able for comparison, but I figured no one would care anymore.) A huge caveat= is that I've never really poked in CAM before, so I fully expect I'm doing s= everal things wrong. Please educate me. :-) >=20 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D >=20 > Create a circular buffer of trace records for each CAM PERIPH. Each record= will contain the full CDB for the request as sent to the device, the timest= amp (in usec) of the request, the CDB as returned by the device, the timesta= mp of the response, and the status and error code. Additionally, flags are p= rovided to track the state of the trace record, and to indicate which type o= f periph the record is associated with. >=20 > -------- >=20 > typedef struct { > u_int32_t flags; > #define CAM_TRACE_F_VALID 0x00000001 /* valid record */ > #define CAM_TRACE_F_INPROGRESS 0x00000002 /* request in progress */ > #define CAM_TRACE_F_REQ_VALID 0x00000004 /* request data valid */ > #define CAM_TRACE_F_RESP_VALID 0x00000008 /* response data valid */ > #define CAM_TRACE_F_COMPLETE 0x00000010 /* response gathered */ > #define CAM_TRACE_F_TIMEDOUT 0x00000020 /* request timed out */ > #define CAM_TRACE_F_ABANDONED 0x00000040 /* cancelled, etc. */ > #define CAM_TRACE_F_RETRIED 0x00000080 /* retry in later record */ > #define CAM_TRACE_F_IS_RETRY 0x00000100 /* retry of earlier req. */ > #define CAM_TRACE_F_PERIPH_MASK 0xf0000000 /* up to 16 PERIPH types *= / > #define CAM_TRACE_F_PERIPH_DA 0x0 > #define CAM_TRACE_F_PERIPH_ADA 0x1 > #define CAM_TRACE_F_PERIPH_NVD 0x2 > u_int64_t req_usec; > cdb_t req_cdb; > u_int64_t resp_usec; > cdb_t resp_cdb; > u_int8_t resp_status; > u_int8_t resp_error; > u_int8_t resp_asc; /* Needed for SCSI? */ > u_int8_t resp_asq; /* Needed for SCSI? */ > u_int8_t padding[8]; /* Pad to 64 bytes */ > /* There's going to be a large-ish array of these in the kernel; make sure= > * they're packed to minimize space usage, and keep things aligned. It may > * make sense to re-order the fields for alignment as well. > */ > } cam_trace_record_t __attribute__((packed, aligned(8))); >=20 > typedef struct { > u_int32_t trace_buf_record_count; > cam_trace_record_t *trace_buf; > } cam_trace_buffer_t; >=20 > /* This is settable via a tunable; for Panasas' purposes, 100K entries was= > * enough to provide a good amount of history even on a busy system, while > * not taking up much space (< 5MB/drive). > */ > #define CAM_TRACE_BUFFER_DEFAULT_COUNT 100000 >=20 > ---------------- >=20 > ioctl interfaces are used to get the trace buffer size, to retrieve the bu= ffer contents, or to clear the buffer. The buffer can then be analyzed from u= serspace, or written to a file for post-processing. >=20 > -------- >=20 > #define CAM_GET_TRACE_RECORD_COUNT _IOR(???, ???, u_int32_t) > #define CAM_GET_TRACE_BUFFER _IOWR(???, ???, cam_trace_buffer_t) > #define CAM_CLEAR_TRACE_BUFFER _IOW(???, ???, u_int32_t) >=20 > ---------------- >=20 > To get the buffer: >=20 > -------- >=20 > cam_trace_buffer_t trace =3D NULL; > fd =3D open(periph_name, O_RDONLY); > error =3D ioctl(fd, CAM_GET_TRACE_RECORD_COUNT, > &trace.trace_buf_record_count); > trace.trace_buf =3D malloc(trace.trace_buf_record_count * > sizeof(cam_trace_record_t)); > error =3D ioctl(fd, CAM_GET_TRACE_BUFFER, &trace); >=20 > ---------------- >=20 > By including the full CDB, the entire command can be decoded by analysis t= ools. Tools can be written to parse the buffer and translate command data in= to human-readable descriptions, calculate elapsed times for requests, perfor= m statistical analysis based on opcode, command type, transfer size, etc. Co= mmand decoders would need to be written for each PERIPH type (i.e. one for A= TA, one for SCSI, one for NVMe, etc.). >=20 > ---------------- >=20 > Each PERIPH's softc needs to include some information about the per-device= > buffer. >=20 > -------- >=20 > u_int32_t trace_buf_record_count; > u_int64_t trace_buf_request_count; > cam_trace_record_t *trace_buf; >=20 > ---------------- >=20 > In each SIM, there should already be some sort of request structure, which= associates the softc and the CDB of the request. To that structure, add a p= ointer to the trace record associated with the request. >=20 > -------- >=20 > request structure: > cam_trace_record_t *trace_rec; >=20 > ---------------- >=20 > When initializing the per-device softc, allocate a trace buffer. The numbe= r of records can be hard-coded, or settable via a tunable. >=20 > The example below has a global size for everything, but it should be possi= ble to do it on a per-device or per-PERIPH basis. >=20 > -------- >=20 > /** Initialize per-drive command tracing buffer. **/ > /* Get the buffer size from the tunable. Sanity check it. */ > sc->trace_buf_record_count =3D CAM_TRACE_BUFFER_DEFAULT_COUNT; > TUNABLE_ULONG_FETCH("cam.trace_buf_record_count", > &trace_buf_record_count_tunable); > if (trace_buf_record_count_tunable) { > /* Less than 1K records is unusably few; more than 1M > * records is unusably many. > */ > if ((trace_buf_record_count_tunable < (1 << 10)) || > (trace_buf_record_count_tunable > (1 << 20))) { > device_printf(dev, "cam.trace_buf_record_count=3D%lu" > " out of range; should be between %u and %u\n", > trace_buf_record_count_tunable, (1 << 10), > (1 << 20)); > } else { > sc->trace_buf_record_count =3D (u_int32_t) > trace_buf_record_count_tunable; > } > } > /* Allocate the buffer. */ > trace_buf_bytes =3D sc->trace_buf_record_count * > sizeof(cam_trace_record_t); > sc->trace_buf =3D malloc(trace_buf_bytes, M_AD, M_NOWAIT | M_ZERO); > if (sc->trace_buf =3D=3D NULL) { > device_printf(dev, "unable to allocate %zd for trace_buf\n", > trace_buf_bytes); > sc->trace_buf_record_count =3D 0; > } else { > device_printf(dev, "allocated %zd @ %p for trace_buf\n", > trace_buf_bytes, sc->trace_buf); > } > /* Request counter, used for indexing */ > sc->trace_buf_request_count =3D 0; >=20 >=20 > ---------------- >=20 > /* Initial setup of the trace record for a new request. > * > * Locate the circular per-device trace buffer in the drive's softc. > * Populate the pointer to the next record in the trace buffer; this is > * done atomically to prevent concurrent requests from getting the same > * index. That could get a bit weird if/when 'trace_buf_request_count' > * rolls over, if the maximum counter value is not a multiple of > * 'trace_buf_record_count' (i.e. the request counter recycles in the > * middle of the buffer). By using a 64-bit request counter, we're pretty > * safe - even at one transaction per nanosecond, that will still last > * about 585 years. > * > * Initialize the record by zeroing it and then setting the VALID flag. > * > * This could be done at any point before sending the request to the > * hardware; neither the CDB data nor the timestamps are involved yet. > */ > void > trace_rec_init(cam_request_t req) { > ada_softc *sc =3D (request->...->softc) > u_int64_t tmp =3D 0; > if (sc && sc->trace_buf) { > tmp =3D atomic_fetchadd_log(&sc->trace_buf_request_count, 1); > req->trace_rec =3D &sc->trace_buf[tmp % > sc->trace_buf_record_count]; > bzero(req->trace_rec, sizeof(*req->trace_rec)); > req->trace_rec->flags =3D (CAM_TRACE_F_PERIPH_XXX << 28); > req->trace_rec->flags |=3D CAM_TRACE_F_VALID; > } else { > req->trace_rec =3D NULL; > } > } >=20 > ---------------- >=20 > /* Stuff the request data into the trace record. > * > * If the trace record is present and valid, copy the CDB into the > * record. Set CAM_TRACE_F_REQ_VALID to indicate that the request data is > * valid, and INPROGRESS to show that the request is in-progress. > * > * This should be done just before sending the request to the hardware. > * This should *not* be fused into trace_rec_init(), because > * trace_rec_init() is also used in the retry case, where it is not > * correct to immediately start the timer. > */ > void > trace_rec_req(cam_request_t req) { > struct timeval tv; > if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) { > bcopy(req->cdb, req->trace_rec->req_cdb, sizeof(req->cdb)); > microuptime(&tv); > req->trace_rec->req_usec =3D ((uint64_t) tv->tv_sec * 1000000) > + tv->tv_usec; > req->trace_rec->flags |=3D > (CAM_TRACE_F_REQ_VALID | CAM_TRACE_F_INPROGRESS); > } > } >=20 > ---------------- >=20 > /* Stuff the response data into the trace record. > * > * If the trace record is present and valid, copy the CDB **as returned by= > * the drive** into the record. At this point, we also have the result fro= m > * the command (status, error, (asc? ascq?)), so save them too. Set > * CAM_TRACE_F_RESP_VALID to indicate that the response data is valid. > * Clear INPROGRESS and set COMPLETE instead. > * > * This should be done just after getting the response information from th= e > * hardware. > */ > void > trace_rec_resp(cam_request_t req) { > struct timeval tv; > if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) { > microuptime(&tv); > bcopy(req->cdb, req->trace_rec->resp_cdb, sizeof(req->cdb)); > req->trace_rec->resp_usec =3D ((uint64_t) tv->tv_sec * > 1000000) + tv->tv_usec; > req->trace_rec->flags |=3D > (CAM_TRACE_F_RESP_VALID | CAM_TRACE_F_COMPLETE); > req->trace_rec->flags &=3D ~CAM_TRACE_F_INPROGRESS; > } > } >=20 > ---------------- >=20 > /* Update the trace record if there was a timeout. > * > * If the trace record is present and valid, set TIMEDOUT to indicate > * that the request associated with the record exceeded its timeout. > * Since this attempt is complete, clear INPROGRESS and set COMPLETE. > * > * This should be done along with other timeout handling. > */ > void > trace_rec_timeout(cam_request_t req) { > if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) { > req->trace_rec->flags |=3D > (CAM_TRACE_F_TIMEDOUT | CAM_TRACE_F_COMPLETE); > req->trace_rec->flags &=3D ~CAM_TRACE_F_INPROGRESS; > } > } >=20 > ---------------- >=20 > /* Update the trace record if the request is being retried. > * =20 > * If the trace record is present and valid, set RETRIED to indicate > * that the request associated with the record is going to be retried. > * Since this attempt is complete, clear INPROGRESS and set COMPLETE. > * Then get a new trace record for the new attempt, copy the request > * command data to the new record, and set IS_RETRY to show that this > * is a retry of an earlier request. > * > * This should be done at the time the retry is queued. > */ > void > trace_rec_retry(cam_request_t req) { > if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) { > /* First, mark the original request as being requeued. */ > req->trace_rec->flags |=3D > (CAM_TRACE_F_RETRIED | CAM_TRACE_F_COMPLETE); > req->trace_rec->flags &=3D ~CAM_TRACE_F_INPROGRESS; > /* Get a new record for the retry. */ > trace_rec_init(req); > /* Mark the new record as being a retry. */ > req->trace_rec->flags |=3D CAM_TRACE_F_IS_RETRY; > } > } >=20 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D >=20 > I'd appreciate feedback, both on the implementation details and on the ide= a as a whole. Some open questions are: >=20 > - My assumption is that each SIM has a structure that associates a PERIPH'= s softc with the CDB that's being processed. It might require dereferencing a= chain of pointers, but it should be there in some fashion. >=20 > - The current implementation adds a few ioctls to the old ad(4) disk-devic= e driver. I would expect to add the new ioctls to da(4) and ada(4); is that t= he right thing to do? >=20 > Again, any feedback would be greatly appreciated. I'll be at the Dev/Vendo= r Summit next week, if anyone wants to discuss this with me in person. >=20 > Thanks, >=20 > Ravi > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2C9653BA-CC10-497B-ACEF-0C76AAE2FF44>