Date: Tue, 4 Nov 2014 00:21:14 +0000 From: "Pokala, Ravi" <rpokala@panasas.com> To: "freebsd-geom@freebsd.org" <freebsd-geom@freebsd.org> Subject: Low-level trace-buffers in CAM Message-ID: <D07D5BA2.122AD1%rpokala@panasas.com>
next in thread | raw e-mail | index | archive | help
Hi folks, At Panasas, we found it useful to have a trace buffer for each ATA drive in the system, for gathering information about command sequences and latency. I implemented that functionality for our old 7-ish ATA driver, and it works quite well, with fairly low overhead. For example (sorry for the really long lines...): # Tabular format for use w/ other tools # Request time ATA request information Response time ATA response information sts/err flags 1415043680586349 25 0000 0200 000001b8c200 1415043680587357 25 0000 0200 000001b8c200 50 00 0000007d 1415043680587389 25 0000 0200 000001b8c400 1415043680588396 25 0000 0200 000001b8c400 50 00 0000007d 1415043680588430 25 0000 0200 000001b8c600 1415043680589406 25 0000 0200 000001b8c600 50 00 0000007d 1415043681203051 ca 0000 0008 000000000480 1415043681203279 ca 0000 0008 000000000480 50 00 0000007d 1415043681203284 ca 0000 0008 000000002f00 1415043681204022 ca 0000 0008 000000002f00 50 00 0000007d 1415043681596331 25 0000 0200 000001b8c800 1415043681597389 25 0000 0200 000001b8c800 50 00 0000007d 1415043681597423 25 0000 0200 000001b8ca00 1415043681598434 25 0000 0200 000001b8ca00 50 00 0000007d 1415043681598466 25 0000 0200 000001b8cc00 1415043681599484 25 0000 0200 000001b8cc00 50 00 0000007d 1415043681599517 25 0000 0200 000001b8ce00 1415043681600515 25 0000 0200 000001b8ce00 50 00 0000007d 1415043681934541 25 0000 0200 000000586600 1415043681935579 25 0000 0200 000000586600 50 00 0000007d # Human-readable FLAGS REQUEST TIME RESPONSE TIME ELAPSED TIME REQUEST COMMAND STATUS ERROR RESPONSE COMMAND =20 -------------------------------- -------------------- -------------------- -------------------- --------------------------------------------------------------------------- ----- -------- -------- --------------------------------------------------------------------------- ----- ____C...._V 1415043680586349 1415043680587357 1008 READ DMA EXT (LBA 28885504 + 512 sectors) _R_S____ ________ ---- =20 ____C...._V 1415043680587389 1415043680588396 1007 READ DMA EXT (LBA 28886016 + 512 sectors) _R_S____ ________ ---- =20 ____C...._V 1415043680588430 1415043680589406 976 READ DMA EXT (LBA 28886528 + 512 sectors) _R_S____ ________ ---- =20 ____C...._V 1415043681203051 1415043681203279 228 WRITE DMA (LBA 1152 + 8 sectors) _R_S____ ________ ---- =20 ____C...._V 1415043681203284 1415043681204022 738 WRITE DMA (LBA 12032 + 8 sectors) _R_S____ ________ ---- =20 ____C...._V 1415043681596331 1415043681597389 1058 READ DMA EXT (LBA 28887040 + 512 sectors) _R_S____ ________ ---- =20 ____C...._V 1415043681597423 1415043681598434 1011 READ DMA EXT (LBA 28887552 + 512 sectors) _R_S____ ________ ---- =20 ____C...._V 1415043681598466 1415043681599484 1018 READ DMA EXT (LBA 28888064 + 512 sectors) _R_S____ ________ ---- =20 As we start moving toward a more modern base release, we want to re-implement that functionality for ATA, SCSI, NVMe, etc. And push it upsream. The obvious place to put this sort of tracing would be in GEOM, but that actually isn't what we want, for a few reasons. First of all, we're primarily interested in the exact register / taskfile / CDB values sent to the drive - opcode, feature codes, LBAs, etc. Second, we're interested in hardware-level latency - how long does a request stay in the SIM and PERIPH. Both of those pretty much require working below GEOM. Therefore, I think we want to do it in the CAM stack; specifically, the hooks would need to go into some of the SIMs. (Yes, I know, this requires touching every SIM which we want to use. It's not optimal, but the hooks should be pretty lightweight and easy to add, and nothing will break if a given SIM isn't updated.) I took our implementation, and pseudo-coded what it might look like in CAM. (I could probably make diffs against the old ATA stack available for comparison, but I figured no one would care anymore.) A huge caveat is that I've never really poked in CAM before, so I fully expect I'm doing several things wrong. Please educate me. :-) =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D Create a circular buffer of trace records for each CAM PERIPH, getting information as close to the device as possible. Each record will contain the full CDB for the request as sent to the device, the timestamp (in usec) of the request, the CDB as returned by the device, the timestamp of the response, and the status and error code. Additionally, flags are provided to track the state of the trace record, and to indicate which type of periph the record is associated with. typedef struct { u_int32_t flags; #define CAM_TRACE_F_VALID 0x00000001 /* valid record */ #define CAM_TRACE_F_INPROGRESS 0x00000002 /* request in progress */ #define CAM_TRACE_F_REQ_VALID 0x00000004 /* request data valid */ #define CAM_TRACE_F_RESP_VALID 0x00000008 /* response data valid */ #define CAM_TRACE_F_COMPLETE 0x00000010 /* response gathered */ #define CAM_TRACE_F_TIMEDOUT 0x00000020 /* request timed out */ #define CAM_TRACE_F_ABANDONED 0x00000040 /* cancelled, etc. */ #define CAM_TRACE_F_RETRIED 0x00000080 /* retry in later record */ #define CAM_TRACE_F_IS_RETRY 0x00000100 /* retry of earlier req. */ #define CAM_TRACE_F_PERIPH_MASK 0xf0000000 /* up to 16 types of PERIPHs */ #define CAM_TRACE_F_PERIPH_DA 0x0 #define CAM_TRACE_F_PERIPH_ADA 0x1 #define CAM_TRACE_F_PERIPH_NVD 0x2 u_int64_t req_usec; cdb_t req_cdb; u_int64_t resp_usec; cdb_t resp_cdb; u_int8_t resp_status; u_int8_t resp_error; u_int8_t resp_asc; /* Needed for SCSI? */ u_int8_t resp_asq; /* Needed for SCSI? */ u_int8_t padding[???]; /* Pad for alignment? */ /* There's going to be a large-ish array of these in the kernel; make sure they're packed and aligned to minimize space usage and keep things aligned. The original implementation was 48 bytes, which was already aligned. */ } cam_trace_record_t __attribute__((packed, aligned(8))); typedef struct { u_int32_t trace_buf_record_count; cam_trace_record_t *trace_buf; } cam_trace_buffer_t; #define CAM_TRACE_BUFFER_DEFAULT_COUNT 100000 ---------------- ioctl interfaces are used to get the trace buffer size, to retrieve the buffer contents, or to clear the buffer. The buffer can then be analyzed from userspace, or written to a file for post-processing. #define CAM_GET_TRACE_RECORD_COUNT _IOR(???, ???, u_int32_t) #define CAM_GET_TRACE_BUFFER _IOWR(???, ???, cam_trace_buffer_t) #define CAM_CLEAR_TRACE_BUFFER _IOW(???, ???, u_int32_t) ---------------- To get the buffer: cam_trace_buffer_t trace =3D NULL; fd =3D open(periph_name, O_RDONLY); error =3D ioctl(fd, CAM_GET_TRACE_RECORD_COUNT, &trace.trace_buf_record_count); trace.trace_buf =3D malloc(record_count * sizeof (cam_trace_record_t)); error =3D ioctl(fd, CAM_GET_TRACE_BUFFER, &trace); By including the full CDB, the entire command can be decoded by analysis tools. Tools can be written to parse the buffer and translate command data into human-readable descriptions, calculate elapsed times for requests, perform statistical analysis based on opcode, command type, transfer size, etc. Command decoders would need to be written for each PERIPH type (i.e. one for ATA, one for SCSI, one for NVMe, etc.). ---------------- Each PERIPH's softc needs to include some information about the per-device buffer. u_int32_t trace_buf_record_count; u_int64_t trace_buf_request_count; cam_trace_record_t *trace_buf; In each SIM, there should already be some sort of request structure, which associates the softc and the CDB of the request. To that structure, add a pointer to the trace record associated with the request. request structure: cam_trace_record_t *trace_rec; ---------------- When initializing the per-device softc, allocate a trace buffer. The number of records can be hard-coded, or settable via a tunable. /** Initialize per-drive command tracing buffer. **/ /* Get the buffer size from the tunable, if present. Sanity check it. */ sc->trace_buf_request_count =3D 0; sc->trace_buf_record_count =3D CAM_TRACE_BUFFER_DEFAULT_COUNT; TUNABLE_ULONG_FETCH("cam.trace_buf_record_count", &trace_buf_record_count_tunable); if (trace_buf_record_count_tunable) { /* Less than 1K records is unusably few; more than 1M records is unusably many. */ if ((trace_buf_record_count_tunable < (1 << 10)) || (trace_buf_record_count_tunable > (1 << 20))) { device_printf(dev, "cam.trace_buf_record_count=3D%lu" " out of range; should be between %u and %u\n", trace_buf_record_count_tunable, (1 << 10), (1 << 20)); } else { sc->trace_buf_record_count =3D (u_int32_t) trace_buf_record_count_tunabl= e; } } /* Allocate the buffer. */ trace_buf_bytes =3D sc->trace_buf_record_count * sizeof(cam_trace_record_t= ); sc->trace_buf =3D malloc(trace_buf_bytes, M_AD, M_NOWAIT | M_ZERO); if (sc->trace_buf =3D=3D NULL) { device_printf(dev, "unable to allocate %zd for trace_buf\n", trace_buf_bytes); sc->trace_buf_record_count =3D 0; } else { device_printf(dev, "allocated %zd @ %p for trace_buf\n", trace_buf_bytes, sc->trace_buf); } ---------------- /* Initial setup of the trace record for a new request. * * Locate the circular per-device trace buffer in the drive's softc. * Populate the pointer to the next record in the trace buffer; this is * done atomically to prevent concurrent requests from getting the same * index. That could get a bit weird if/when 'trace_buf_request_count' * rolls over, if the maximum counter value is not a multiple of * 'trace_buf_record_count' (i.e. the request counter recycles in the * middle of the buffer). By using a 64-bit request counter, we're pretty * safe - even at one transaction per nanosecond, that will still last * about 585 years. * Initialize the record by zeroing it and then setting the VALID flag. * * This could be done at any point before sending the request to the * hardware. */ void trace_rec_init(cam_request_t req) { ada_softc *sc =3D (request->...->softc) u_int64_t tmp =3D 0; if (sc && sc->trace_buf) { tmp =3D atomic_fetchadd_log(&sc->trace_buf_request_count, 1); req->trace_rec =3D &sc->trace_buf[tmp % sc->trace_buf_record_count]; bzero(req->trace_rec, sizeof(*req->trace_rec)); req->trace_rec->flags =3D (CAM_TRACE_F_PERIPH_XXX << 28); req->trace_rec->flags |=3D CAM_TRACE_F_VALID; } else { req->trace_rec =3D NULL; } } ---------------- /* Stuff the request data into the trace record. * * If the trace record is present and valid, copy the CDB into the * record. Set CAM_TRACE_F_REQ_VALID to indicate that the request data is * valid, and INPROGRESS to show that the request is in-progress. * * This should be done just before sending the request to the hardware. * This should *not* be fused into trace_rec_init(), because * trace_rec_init() is also used in the retry case, where it is not correct * to immediately start the timer. */ void trace_rec_req(cam_request_t req) { struct timeval tv; if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) { bcopy(req->cdb, req->trace_rec->req_cdb, sizeof(req->cdb)); microuptime(&tv); req->trace_rec->req_usec =3D ((uint64_t) tv->tv_sec * 1000000) + tv->tv_usec; req->trace_rec->flags |=3D (CAM_TRACE_F_REQ_VALID | CAM_TRACE_F_INPROGRESS); } } ---------------- /* Stuff the response data into the trace record. * * If the trace record is present and valid, copy the CDB **as returned by * the drive** into the record. At this point, we also have the result from * the command (status, error, (asc? ascq?)), so save them too. Set * CAM_TRACE_F_RESP_VALID to indicate that the response data is valid. * Clear INPROGRESS and set COMPLETE instead. * * This should be done just after getting the response information from the * hardware */ void trace_rec_resp(cam_request_t req) { struct timeval tv; if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) { bcopy(req->cdb, req->trace_rec->resp_cdb, sizeof(req->cdb)); microuptime(&tv); req->trace_rec->resp_usec =3D ((uint64_t) tv->tv_sec * 1000000) + tv->tv_usec; req->trace_rec->flags |=3D CAM_TRACE_F_RESP_VALID | CAM_TRACE_F_COMPLETE)= ; req->trace_rec->flags &=3D ~CAM_TRACE_F_INPROGRESS; } } ---------------- /* Update the trace record if there was a timeout. * * If the trace record is present and valid, set TIMEDOUT to indicate * that the request associated with the record exceeded its timeout. * Since this attempt is complete, clear INPROGRESS and set COMPLETE. * * This should be done along with other timeout handling. */ void trace_rec_timeout(cam_request_t req) { if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) { req->trace_rec->flags &=3D ~CAM_TRACE_F_INPROGRESS; req->trace_rec->flags |=3D (CAM_TRACE_F_TIMEDOUT | CAM_TRACE_F_COMPLETE); } } ---------------- /* Update the trace record if the request is being retried. * =20 * If the trace record is present and valid, set RETRIED to indicate * that the request associated with the record is going to be retried. * Since this attempt is complete, clear INPROGRESS and set COMPLETE. * Then get a new trace record for the new attempt, copy the request * command data to the new record, and set IS_RETRY to show that this * is a retry of an earlier request. * * This should be done at the time the retry is queued. */ void trace_rec_retry(cam_request_t req) { if (req->trace_rec && (req->trace_rec->flags & PAN_ATA_TRACE_F_VALID)) { /* First, mark the original request as being requeued. */ req->trace_rec->flags &=3D ~CAM_TRACE_F_INPROGRESS; req->trace_rec->flags |=3D (CAM_TRACE_F_RETRIED | CAM_TRACE_F_COMPLETE); /* Get a new record for the retry. */ trace_rec_init(req); /* Mark the new record as being a retry. */ req->trace_rec->flags |=3D CAM_TRACE_F_IS_RETRY; } } =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D I'd appreciate feedback, both on the implementation details and on the idea as a whole. Some open questions are: - My assumption is that each SIM has a structure that associates a PERIPH's softc with the CDB that's being processed. It might require dereferencing a chain of pointers, but it should be there in some fashion. - The current implementation adds a few ioctls to the old ad(4) disk-device driver. I would expect to add the new ioctls to da(4) and ada(4); is that the right thing to do? Again, any feedback would be greatly appreciated. I'm at the Dev/Vendor Summit today and tomorrow if anyone wants to discuss this with me in person. Thanks, Ravi
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?D07D5BA2.122AD1%rpokala>