FreeBSD Mail Archives

Date:      Mon, 26 Oct 2015 21:08:42 -0600
From:      Scott Long <scott4long@yahoo.com>
To:        "Pokala, Ravi" <rpokala@panasas.com>
Cc:        "freebsd-geom@freebsd.org" <freebsd-geom@freebsd.org>, "freebsd-scsi@freebsd.org" <freebsd-scsi@freebsd.org>, "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>, "ken@freebsd.org" <ken@freebsd.org>, "imp@freebsd.org" <imp@freebsd.org>, "scottl@freebsd.org" <scottl@freebsd.org>
Subject:   Re: Low-level trace-buffers in CAM
Message-ID:  <2C9653BA-CC10-497B-ACEF-0C76AAE2FF44@yahoo.com>
In-Reply-To: <A760883F-317D-46C9-AD7C-B8F5D96A49DA@panasas.com>
References:  <A760883F-317D-46C9-AD7C-B8F5D96A49DA@panasas.com>

Hi Ravi,

Yes I was one of the people who never followed up.  I'll be at the vendor su=
mmit next week, let's chat some more.

Scott

Sent from my iPhone

> On Oct 26, 2015, at 8:22 PM, Pokala, Ravi <rpokala@panasas.com> wrote:
>=20
> Hi folks,
>=20
> ----------------------------------------------------------------
> This is an updated re-send of a message I originally sent about a year ago=
, during MeetBSD 2014. A few people expressed interest in person, but no one=
 ever followed up on the lists. I'm bringing this up again, in the hopes tha=
t I can get some feedback before / during next week's Dev/Vendor Summit. I'm=
 CCing some folks who expressed interest at that time, and some folks that I=
 was told would be interested.
> ----------------------------------------------------------------
>=20
> At Panasas, we found it useful to have a trace buffer for each ATA drive i=
n the system, for gathering information about command sequences and latency.=
 I implemented that functionality for our old 7-ish ATA driver, and it works=
 quite well, with fairly low overhead. (I ran our usual suite of performance=
 tests, and any differences were lost in the noise.) These traces allowed us=
 to capture and replay access patterns (though obviously not with the same d=
ata), analyze and compare latency, etc.
>=20
> As Panasas moves toward a more modern base, we want to re-implement this f=
unctionality for ATA, SCSI, NVMe, etc. And push it upstream, of course! :-)
>=20
> First, some example output; I wrote code for three different formats:
>  - The binary trace buffer dump
>    Useful for manipulating in C; the other two formats are post-processed o=
ut of this, and I also wrote a replay utility and a basic stats analyzer whi=
ch operated on the binary dump.
>  - Tabular hex
>    Useful for parsing with scripts; I never actually used this format, but=
 it was trivial so I just went ahead and wrote it.
>  - Human-readable
>    Useful for eyeballing; on more than one occasion, I was able to see tha=
t I had mis-assembled a command in userspace, etc.
>=20
> These examples are actually really long lines; I've broken them up and ind=
ented for the sake of email:
>=20
>    # Tabular hex format, for use w/ scripts
>    # Request time       ATA request information   Response time       =20
>        ATA response information  sts/err  flags
>    1445839793003737     c8 0000 0004 000000002fa8 1445839793003806    =20
>        c8 0000 0004 000000002fa8 50 00 0000007d
>    1445839793003874     c8 0000 0020 000000113ac8 1445839793011541    =20
>        c8 0000 0020 000000113ac8 50 00 0000007d
>    1445839793011795     c8 0000 0008 000000118ca8 1445839793017702    =20
>        c8 0000 0008 000000118ca8 50 00 0000007d
>    1445839793017808     c8 0000 0080 000000117628 1445839793018975    =20
>        c8 0000 0080 000000117628 50 00 0000007d
>    1445839793019079     c8 0000 0020 0000001177a8 1445839793019686    =20
>        c8 0000 0020 0000001177a8 50 00 0000007d
>=20
>    # Human-readable
>    FLAGS       REQUEST TIME         RESPONSE TIME        ELAPSED TIME     =
   =20
>        REQUEST COMMAND                          STATUS   ERROR   =20
>        RESPONSE COMMAND                       =20
>    ----------- -------------------- -------------------- -----------------=
---=20
>        ---------------------------------------- -------- --------=20
>        ----------------------------------------
>    ____C...._V 1445839793003737     1445839793003806     69               =
   =20
>        READ DMA (LBA 12200 + 4 sectors)         _R_S____ ________=20
>        ----                                   =20
>    ____C...._V 1445839793003874     1445839793011541     7667             =
   =20
>        READ DMA (LBA 1129160 + 32 sectors)      _R_S____ ________=20
>        ----                                   =20
>    ____C...._V 1445839793011795     1445839793017702     5907             =
   =20
>        READ DMA (LBA 1150120 + 8 sectors)       _R_S____ ________=20
>        ----                                   =20
>    ____C...._V 1445839793017808     1445839793018975     1167             =
   =20
>        READ DMA (LBA 1144360 + 128 sectors)     _R_S____ ________=20
>        ----                                   =20
>    ____C...._V 1445839793019079     1445839793019686     607              =
   =20
>        READ DMA (LBA 1144744 + 32 sectors)      _R_S____ ________=20
>        ----                                   =20
>=20
> The obvious place to put this sort of tracing would be in GEOM, but that a=
ctually isn't what I want, for a few reasons. First of all, I'm primarily in=
terested in the exact register / taskfile / CDB values sent to the drive - o=
pcode, feature codes, LBAs, etc. Second, I'm interested in hardware-level la=
tency - how long does a request stay in the controller and device. Both of t=
hose pretty much require working below GEOM.
>=20
> Therefore, I think I want to do it in the CAM stack; ideally, I'd put the h=
ooks in the SIM drivers, to get as close to the metal as possible, to get th=
e most accurate latency. However, even a light touch to every SIM driver is g=
oing to be painful, and since I don't have access to every controller in the=
 world, testing would be difficult. Adding the hooks to da(4) and ada(4) wou=
ld still be pretty close to the metal, and would require much smaller change=
s. Or maybe they belong in the XPT layer? (I don't know nearly as much about=
 CAM as I should. :-/)
>=20
> --------
> NOTE: When I mentioned this idea at a (San Francisco) Bay Area FreeBSD Use=
r Group (BAFUG) meeting, there were suggestions about integrating this with D=
Trace or KTrace instead. I'm happy to consider either of those, but I know e=
ven less about those than I do about CAM. If that's really the best way to d=
o this sort of thing, I'm happy to learn, if someone is willing to teach.
> --------
>=20
> I took the Panasas / 7.x implementation, and pseudo-coded what it might lo=
ok like in CAM. (I could probably make diffs against the old ATA stack avail=
able for comparison, but I figured no one would care anymore.) A huge caveat=
 is that I've never really poked in CAM before, so I fully expect I'm doing s=
everal things wrong. Please educate me. :-)
>=20
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D
>=20
> Create a circular buffer of trace records for each CAM PERIPH. Each record=
 will contain the full CDB for the request as sent to the device, the timest=
amp (in usec) of the request, the CDB as returned by the device, the timesta=
mp of the response, and the status and error code. Additionally, flags are p=
rovided to track the state of the trace record, and to indicate which type o=
f periph the record is associated with.
>=20
> --------
>=20
> typedef struct {
>    u_int32_t    flags;
> #define CAM_TRACE_F_VALID    0x00000001    /* valid record */
> #define CAM_TRACE_F_INPROGRESS    0x00000002    /* request in progress */
> #define CAM_TRACE_F_REQ_VALID    0x00000004    /* request data valid */
> #define CAM_TRACE_F_RESP_VALID    0x00000008    /* response data valid */
> #define CAM_TRACE_F_COMPLETE    0x00000010    /* response gathered */
> #define CAM_TRACE_F_TIMEDOUT    0x00000020    /* request timed out */
> #define CAM_TRACE_F_ABANDONED    0x00000040    /* cancelled, etc. */
> #define CAM_TRACE_F_RETRIED    0x00000080    /* retry in later record */
> #define CAM_TRACE_F_IS_RETRY    0x00000100    /* retry of earlier req. */
> #define CAM_TRACE_F_PERIPH_MASK    0xf0000000    /* up to 16 PERIPH types *=
/
> #define CAM_TRACE_F_PERIPH_DA    0x0
> #define CAM_TRACE_F_PERIPH_ADA    0x1
> #define CAM_TRACE_F_PERIPH_NVD    0x2
>    u_int64_t    req_usec;
>    cdb_t        req_cdb;
>    u_int64_t    resp_usec;
>    cdb_t        resp_cdb;
>    u_int8_t    resp_status;
>    u_int8_t    resp_error;
>    u_int8_t    resp_asc;    /* Needed for SCSI? */
>    u_int8_t    resp_asq;    /* Needed for SCSI? */
>    u_int8_t    padding[8];    /* Pad to 64 bytes */
> /* There's going to be a large-ish array of these in the kernel; make sure=

> * they're packed to minimize space usage, and keep things aligned. It may
> * make sense to re-order the fields for alignment as well.
> */
> } cam_trace_record_t __attribute__((packed, aligned(8)));
>=20
> typedef struct {
>    u_int32_t        trace_buf_record_count;
>    cam_trace_record_t    *trace_buf;
> } cam_trace_buffer_t;
>=20
> /* This is settable via a tunable; for Panasas' purposes, 100K entries was=

> * enough to provide a good amount of history even on a busy system, while
> * not taking up much space (< 5MB/drive).
> */
> #define CAM_TRACE_BUFFER_DEFAULT_COUNT 100000
>=20
> ----------------
>=20
> ioctl interfaces are used to get the trace buffer size, to retrieve the bu=
ffer contents, or to clear the buffer. The buffer can then be analyzed from u=
serspace, or written to a file for post-processing.
>=20
> --------
>=20
> #define CAM_GET_TRACE_RECORD_COUNT    _IOR(???, ???, u_int32_t)
> #define CAM_GET_TRACE_BUFFER        _IOWR(???, ???, cam_trace_buffer_t)
> #define CAM_CLEAR_TRACE_BUFFER        _IOW(???, ???, u_int32_t)
>=20
> ----------------
>=20
> To get the buffer:
>=20
> --------
>=20
>    cam_trace_buffer_t trace =3D NULL;
>    fd =3D open(periph_name, O_RDONLY);
>    error =3D ioctl(fd, CAM_GET_TRACE_RECORD_COUNT,
>        &trace.trace_buf_record_count);
>    trace.trace_buf =3D malloc(trace.trace_buf_record_count *
>        sizeof(cam_trace_record_t));
>    error =3D ioctl(fd, CAM_GET_TRACE_BUFFER, &trace);
>=20
> ----------------
>=20
> By including the full CDB, the entire command can be decoded by analysis t=
ools. Tools can be written to parse the buffer and translate command data in=
to human-readable descriptions, calculate elapsed times for requests, perfor=
m statistical analysis based on opcode, command type, transfer size, etc. Co=
mmand decoders would need to be written for each PERIPH type (i.e. one for A=
TA, one for SCSI, one for NVMe, etc.).
>=20
> ----------------
>=20
> Each PERIPH's softc needs to include some information about the per-device=

> buffer.
>=20
> --------
>=20
>    u_int32_t        trace_buf_record_count;
>    u_int64_t        trace_buf_request_count;
>    cam_trace_record_t    *trace_buf;
>=20
> ----------------
>=20
> In each SIM, there should already be some sort of request structure, which=
 associates the softc and the CDB of the request. To that structure, add a p=
ointer to the trace record associated with the request.
>=20
> --------
>=20
> request structure:
>    cam_trace_record_t    *trace_rec;
>=20
> ----------------
>=20
> When initializing the per-device softc, allocate a trace buffer. The numbe=
r of records can be hard-coded, or settable via a tunable.
>=20
> The example below has a global size for everything, but it should be possi=
ble to do it on a per-device or per-PERIPH basis.
>=20
> --------
>=20
>    /** Initialize per-drive command tracing buffer. **/
>    /* Get the buffer size from the tunable. Sanity check it. */
>    sc->trace_buf_record_count =3D CAM_TRACE_BUFFER_DEFAULT_COUNT;
>    TUNABLE_ULONG_FETCH("cam.trace_buf_record_count",
>        &trace_buf_record_count_tunable);
>    if (trace_buf_record_count_tunable) {
>        /* Less than 1K records is unusably few; more than 1M
>         * records is unusably many.
>         */
>        if ((trace_buf_record_count_tunable < (1 << 10)) ||
>            (trace_buf_record_count_tunable > (1 << 20))) {
>            device_printf(dev, "cam.trace_buf_record_count=3D%lu"
>                " out of range; should be between %u and %u\n",
>                trace_buf_record_count_tunable, (1 << 10),
>                (1 << 20));
>        } else {
>            sc->trace_buf_record_count =3D (u_int32_t)
>                trace_buf_record_count_tunable;
>        }
>    }
>    /* Allocate the buffer. */
>    trace_buf_bytes =3D sc->trace_buf_record_count *
>        sizeof(cam_trace_record_t);
>    sc->trace_buf =3D malloc(trace_buf_bytes, M_AD, M_NOWAIT | M_ZERO);
>    if (sc->trace_buf =3D=3D NULL) {
>        device_printf(dev, "unable to allocate %zd for trace_buf\n",
>            trace_buf_bytes);
>        sc->trace_buf_record_count =3D 0;
>    } else {
>        device_printf(dev, "allocated %zd @ %p for trace_buf\n",
>            trace_buf_bytes, sc->trace_buf);
>    }
>    /* Request counter, used for indexing */
>    sc->trace_buf_request_count =3D 0;
>=20
>=20
> ----------------
>=20
> /* Initial setup of the trace record for a new request.
> *
> *  Locate the circular per-device trace buffer in the drive's softc.
> *  Populate the pointer to the next record in the trace buffer; this is
> *  done atomically to prevent concurrent requests from getting the same
> *  index. That could get a bit weird if/when 'trace_buf_request_count'
> *  rolls over, if the maximum counter value is not a multiple of
> *  'trace_buf_record_count' (i.e. the request counter recycles in the
> *  middle of the buffer). By using a 64-bit request counter, we're pretty
> *  safe - even at one transaction per nanosecond, that will still last
> *  about 585 years.
> *
> *  Initialize the record by zeroing it and then setting the VALID flag.
> *
> *  This could be done at any point before sending the request to the
> *  hardware; neither the CDB data nor the timestamps are involved yet.
> */
> void
> trace_rec_init(cam_request_t req) {
>    ada_softc *sc =3D (request->...->softc)
>    u_int64_t tmp =3D 0;
>    if (sc && sc->trace_buf) {
>        tmp =3D atomic_fetchadd_log(&sc->trace_buf_request_count, 1);
>        req->trace_rec =3D &sc->trace_buf[tmp %
>            sc->trace_buf_record_count];
>        bzero(req->trace_rec, sizeof(*req->trace_rec));
>        req->trace_rec->flags =3D (CAM_TRACE_F_PERIPH_XXX << 28);
>        req->trace_rec->flags |=3D CAM_TRACE_F_VALID;
>    } else {
>        req->trace_rec =3D NULL;
>    }
> }
>=20
> ----------------
>=20
> /* Stuff the request data into the trace record.
> *
> *  If the trace record is present and valid, copy the CDB into the
> *  record. Set CAM_TRACE_F_REQ_VALID to indicate that the request data is
> *  valid, and INPROGRESS to show that the request is in-progress.
> *
> *  This should be done just before sending the request to the hardware.
> *  This should *not* be fused into trace_rec_init(), because
> *  trace_rec_init() is also used in the retry case, where it is not
> * correct to immediately start the timer.
> */
> void
> trace_rec_req(cam_request_t req) {
>    struct timeval tv;
>    if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
>        bcopy(req->cdb, req->trace_rec->req_cdb, sizeof(req->cdb));
>        microuptime(&tv);
>        req->trace_rec->req_usec =3D ((uint64_t) tv->tv_sec * 1000000)
>            + tv->tv_usec;
>        req->trace_rec->flags |=3D
>            (CAM_TRACE_F_REQ_VALID | CAM_TRACE_F_INPROGRESS);
>    }
> }
>=20
> ----------------
>=20
> /* Stuff the response data into the trace record.
> *
> *  If the trace record is present and valid, copy the CDB **as returned by=

> *  the drive** into the record. At this point, we also have the result fro=
m
> *  the command (status, error, (asc? ascq?)), so save them too. Set
> *  CAM_TRACE_F_RESP_VALID to indicate that the response data is valid.
> *  Clear INPROGRESS and set COMPLETE instead.
> *
> *  This should be done just after getting the response information from th=
e
> *  hardware.
> */
> void
> trace_rec_resp(cam_request_t req) {
>    struct timeval tv;
>    if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
>        microuptime(&tv);
>        bcopy(req->cdb, req->trace_rec->resp_cdb, sizeof(req->cdb));
>        req->trace_rec->resp_usec =3D ((uint64_t) tv->tv_sec *
>            1000000) + tv->tv_usec;
>        req->trace_rec->flags |=3D
>            (CAM_TRACE_F_RESP_VALID | CAM_TRACE_F_COMPLETE);
>        req->trace_rec->flags &=3D ~CAM_TRACE_F_INPROGRESS;
>    }
> }
>=20
> ----------------
>=20
> /* Update the trace record if there was a timeout.
> *
> *  If the trace record is present and valid, set TIMEDOUT to indicate
> *  that the request associated with the record exceeded its timeout.
> *  Since this attempt is complete, clear INPROGRESS and set COMPLETE.
> *
> *  This should be done along with other timeout handling.
> */
> void
> trace_rec_timeout(cam_request_t req) {
>    if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
>        req->trace_rec->flags |=3D
>            (CAM_TRACE_F_TIMEDOUT | CAM_TRACE_F_COMPLETE);
>        req->trace_rec->flags &=3D ~CAM_TRACE_F_INPROGRESS;
>    }
> }
>=20
> ----------------
>=20
> /* Update the trace record if the request is being retried.
> * =20
> *  If the trace record is present and valid, set RETRIED to indicate
> *  that the request associated with the record is going to be retried.
> *  Since this attempt is complete, clear INPROGRESS and set COMPLETE.
> *  Then get a new trace record for the new attempt, copy the request
> *  command data to the new record, and set IS_RETRY to show that this
> *  is a retry of an earlier request.
> *
> *  This should be done at the time the retry is queued.
> */
> void
> trace_rec_retry(cam_request_t req) {
>    if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
>        /* First, mark the original request as being requeued. */
>        req->trace_rec->flags |=3D
>            (CAM_TRACE_F_RETRIED | CAM_TRACE_F_COMPLETE);
>        req->trace_rec->flags &=3D ~CAM_TRACE_F_INPROGRESS;
>        /* Get a new record for the retry. */
>        trace_rec_init(req);
>        /* Mark the new record as being a retry. */
>        req->trace_rec->flags |=3D CAM_TRACE_F_IS_RETRY;
>    }
> }
>=20
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D
>=20
> I'd appreciate feedback, both on the implementation details and on the ide=
a as a whole. Some open questions are:
>=20
> - My assumption is that each SIM has a structure that associates a PERIPH'=
s softc with the CDB that's being processed. It might require dereferencing a=
 chain of pointers, but it should be there in some fashion.
>=20
> - The current implementation adds a few ioctls to the old ad(4) disk-devic=
e driver. I would expect to add the new ioctls to da(4) and ada(4); is that t=
he right thing to do?
>=20
> Again, any feedback would be greatly appreciated. I'll be at the Dev/Vendo=
r Summit next week, if anyone wants to discuss this with me in person.
>=20
> Thanks,
>=20
> Ravi
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2C9653BA-CC10-497B-ACEF-0C76AAE2FF44>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation