Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 9 May 2024 19:58:32 +0200
From:      Peter Eriksson <pen@lysator.liu.se>
To:        FreeBSD Hackers <freebsd-hackers@FreeBSD.org>
Subject:   Trying to understand CAM (and the cciss driver)
Message-ID:  <087AB4C4-9CE0-4024-8E1B-4C636014C8CC@lysator.liu.se>

next in thread | raw e-mail | index | archive | help
Hi,

I=E2=80=99m trying to fix a bug in the cciss driver that has been there =
=E2=80=9Cforever=E2=80=9D when using it with an HP H241 SAS HBA card.=20
The driver works fine when all (SAS, spinning rust) drives are behaving =
well, but when some of them are starting to go bad it often goes into =
spin and either hangs the kernel or panics. I=E2=80=99ve been trying to =
add instrumentation to it in order to pin-point the problem and have =
been attempting some workarounds (like clearing cr_complete since =
without that hack sometime the driver get many many non-busy repeated =
requests with the same =E2=80=9Ctag=E2=80=9D and then It panics with:=20

> ciss1: WARNING: completing non-busy request
> ciss1: WARNING: completing non-busy request
> ciss1: WARNING: completing non-busy request
> panic: cam_periph_done_panic: already done with ccb 0xfffff88abefc4000
> cpuid =3D 0
> time =3D 1714852260
> KDB: stack backtrace:
> #0 0xffffffff80c541b5 at kdb_backtrace+0x65
> #1 0xffffffff80c06b31 at vpanic+0x151
> #2 0xffffffff80c069d3 at panic+0x43
> #3 0xffffffff8039017c at cam_periph_done_panic+0x1c
> #4 0xffffffff80397193 at xpt_done_process+0x313
> #5 0xffffffff803990a5 at xpt_done_td+0xf5
> #6 0xffffffff80bc333e at fork_exit+0x7e
> #7 0xffffffff8108a44e at fork_trampoline+0xe
=20
but it=E2=80=99s still there=E2=80=A6=20

I think I could use some input/suggestions from someone more =
knowledgeable with CAM and device drivers in general=E2=80=A6

An example (this is with extra instrumentation added so not quite a =
stock kernel driver):

ciss1: *** Drive failure imminent, Port=3D1E Box=3D1 Bay=3D17 =
reason=3D0x37 [ts=3D4460, class=3D4, subclass=3D0, detail=3D1, tag=3D0x2]
(da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 7f 17 51 98 00 =
00 00 10 00 00=20
(da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 7f 17 52 18 00 =
00 00 10 00 00=20
(da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 65 c8 00 =
00 02 00 00 00=20
(da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 63 c8 00 =
00 02 00 00 00=20
(da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 61 c8 00 =
00 02 00 00 00=20
(da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5f c8 00 =
00 02 00 00 00=20
(da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5f a8 00 =
00 00 20 00 00=20
(da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5f 38 00 =
00 00 70 00 00=20
(da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5d 38 00 =
00 02 00 00 00=20
(da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5b 38 00 =
00 02 00 00 00=20
(da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 59 38 00 =
00 02 00 00 00=20
(da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 58 48 00 =
00 00 f0 00 00=20
(da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 56 48 00 =
00 02 00 00 00=20
(da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 88 d8 48 d0 00 =
00 00 10 00 00=20
(da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 88 d8 46 e0 00 =
00 01 f0 00 00=20
(da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 88 d8 8f d0 00 =
00 02 00 00 00=20
(da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 88 d8 8d d0 00 =
00 02 00 00 00=20
(da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
ciss1: WARNING: completing non-busy request #1 [onq=3D0, length=3D262144, =
ccphys=3D1791048448, tag=3D72, flags=3D0x50 (----I-C), sg_tag=3D0x3, =
cr_complete=3D0xffffffff806059a0]
ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy
ciss1: WARNING: completing non-busy request #1 [onq=3D0, length=3D262144, =
ccphys=3D1791318368, tag=3D313, flags=3D0x50 (----I-C), sg_tag=3D0x3, =
cr_complete=3D0xffffffff806059a0]
ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy
ciss1: WARNING: completing non-busy request #1 [onq=3D0, length=3D8192, =
ccphys=3D1791226528, tag=3D231, flags=3D0x50 (----I-C), sg_tag=3D0x3, =
cr_complete=3D0xffffffff806059a0]
ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy
ciss1: WARNING: completing non-busy request #1 [onq=3D0, length=3D262144, =
ccphys=3D1791313888, tag=3D309, flags=3D0x50 (----I-C), sg_tag=3D0x5, =
cr_complete=3D0xffffffff806059a0]
ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy
ciss1: WARNING: completing non-busy request #1 [onq=3D0, length=3D8192, =
ccphys=3D1791206368, tag=3D213, flags=3D0x50 (----I-C), sg_tag=3D0x3, =
cr_complete=3D0xffffffff806059a0]
ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy
panic: camq_remove: Attempt to remove out-of-bounds index -1 from queue =
0xfffff80107fa9838 of size 1
cpuid =3D 7
time =3D 1715266713
KDB: stack backtrace:
#0 0xffffffff80c542d5 at kdb_backtrace+0x65
#1 0xffffffff80c06c51 at vpanic+0x151
#2 0xffffffff80c06af3 at panic+0x43
#3 0xffffffff80390676 at camq_remove+0xf6
#4 0xffffffff803936c8 at xpt_run_devq+0x148
#5 0xffffffff80392b0c at xpt_action_default+0x35c
#6 0xffffffff803c3544 at dastart+0x314
#7 0xffffffff80394f73 at xpt_run_allocq+0x83
#8 0xffffffff803c5152 at dastrategy+0x82
#9 0xffffffff80b438d4 at g_disk_start+0x314
#10 0xffffffff80b46b55 at g_io_request+0x1d5
#11 0xffffffff80b46b55 at g_io_request+0x1d5
#12 0xffffffff821a34dc at vdev_geom_io_start+0x23c
#13 0xffffffff82347fb4 at zio_vdev_io_start+0x204
#14 0xffffffff8234204f at zio_nowait+0x12f
#15 0xffffffff822a0ce7 at vdev_mirror_io_start+0x177
#16 0xffffffff82347fb4 at zio_vdev_io_start+0x204
#17 0xffffffff8234204f at zio_nowait+0x12f
Uptime: 10h16m26s


The drive it complains about in the first line is one that has SMART =
errors:

>        connector 1E box  1 bay 17                 HP      =
MB010000JWAYK                                    7PH8LU3G     HPD5 =
S.M.A.R.T. predictive failure.

It is _not_ da122 though=E2=80=A6 The drive with a SMART error (da85) is =
not in use so basically is just sitting there.



The code where I=E2=80=99m trying to figure out what=E2=80=99s going on =
is this part in sys/dev/ciss/ciss.c:

static void
ciss_complete(struct ciss_softc *sc, cr_qhead_t *qh)
{
    struct ciss_request *cr;

    debug_called(2);

    /*                                                                   =
                                                                         =
                                 =20
     * Loop taking requests off the completed queue and performing       =
                                                                         =
                                 =20
     * completion processing on them.                                    =
                                                                         =
                                 =20
     */
    for (;;) {
      int got_nonbusy =3D 0;

        if ((cr =3D ciss_dequeue_complete(sc, qh)) =3D=3D NULL)
            break;
        ciss_unmap_request(cr);

        if ((cr->cr_flags & CISS_REQ_BUSY) =3D=3D 0) {
          cr->nonbusy_request_counter++;
          ciss_printf(sc,
                      "WARNING: completing non-busy request #%d [onq=3D%d,=
 length=3D%u, ccphys=3D%u, tag=3D%d, flags=3D0x%x (%s), sg_tag=3D0x%x, =
cr_complete=3D%p]\n",
                      cr->nonbusy_request_counter,
                      cr->cr_onq, cr->cr_length, cr->cr_ccphys, =
cr->cr_tag,
                      cr->cr_flags, cr_flags2str(cr->cr_flags),
                      cr->cr_sg_tag, cr->cr_complete
                      );
          got_nonbusy =3D 1;
        }

        cr->cr_flags &=3D ~CISS_REQ_BUSY;

        /*                                                               =
                                                                         =
                                 =20
         * If the request has a callback, invoke it.                     =
                                                                         =
                                 =20
         */
        if (cr->cr_complete !=3D NULL) {
          cr->cr_complete(cr);

	/* pen: Attempt to make sure ciss_cam_complete() doesn=E2=80=99t =
get called multiple times */

          if (cr->cr_complete =3D=3D ciss_cam_complete && got_nonbusy) {
            ciss_printf(sc, "WARNING: clearing cr_complete due to =
ciss_cam_complete & nonbusy\n");
            cr->cr_complete =3D NULL;
          }
          continue;
        }     =20

        /*                                                               =
                                                                         =
                                 =20
         * If someone is sleeping on this request, wake them up.         =
                                                                         =
                                 =20
         */
        if (cr->cr_flags & CISS_REQ_SLEEP) {
          cr->cr_flags &=3D ~CISS_REQ_SLEEP;
          wakeup(cr);
          continue;
        }

        /*                                                               =
                                                                         =
                                 =20
         * If someone is polling this request for completion, signal.    =
                                                                         =
                                 =20
         */
        if (cr->cr_flags & CISS_REQ_POLL) {
          cr->cr_flags &=3D ~CISS_REQ_POLL;
          continue;
        }

        /*                                                               =
                                                                         =
                                 =20
         * Give up and throw the request back on the free queue.  This   =
                                                                         =
                                 =20
         * should never happen; resources will probably be lost.         =
                                                                         =
                                 =20
         */
        ciss_printf(sc, "WARNING: completed command with no =
submitter\n");
        ciss_enqueue_free(cr);
    }
}


- Any suggestions? (Probably not but I thought I=E2=80=99d ask anyway =
:-)

- Peter





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?087AB4C4-9CE0-4024-8E1B-4C636014C8CC>