Date: Thu, 9 May 2024 19:58:32 +0200 From: Peter Eriksson <pen@lysator.liu.se> To: FreeBSD Hackers <freebsd-hackers@FreeBSD.org> Subject: Trying to understand CAM (and the cciss driver) Message-ID: <087AB4C4-9CE0-4024-8E1B-4C636014C8CC@lysator.liu.se>
next in thread | raw e-mail | index | archive | help
Hi, I=E2=80=99m trying to fix a bug in the cciss driver that has been there = =E2=80=9Cforever=E2=80=9D when using it with an HP H241 SAS HBA card.=20 The driver works fine when all (SAS, spinning rust) drives are behaving = well, but when some of them are starting to go bad it often goes into = spin and either hangs the kernel or panics. I=E2=80=99ve been trying to = add instrumentation to it in order to pin-point the problem and have = been attempting some workarounds (like clearing cr_complete since = without that hack sometime the driver get many many non-busy repeated = requests with the same =E2=80=9Ctag=E2=80=9D and then It panics with:=20 > ciss1: WARNING: completing non-busy request > ciss1: WARNING: completing non-busy request > ciss1: WARNING: completing non-busy request > panic: cam_periph_done_panic: already done with ccb 0xfffff88abefc4000 > cpuid =3D 0 > time =3D 1714852260 > KDB: stack backtrace: > #0 0xffffffff80c541b5 at kdb_backtrace+0x65 > #1 0xffffffff80c06b31 at vpanic+0x151 > #2 0xffffffff80c069d3 at panic+0x43 > #3 0xffffffff8039017c at cam_periph_done_panic+0x1c > #4 0xffffffff80397193 at xpt_done_process+0x313 > #5 0xffffffff803990a5 at xpt_done_td+0xf5 > #6 0xffffffff80bc333e at fork_exit+0x7e > #7 0xffffffff8108a44e at fork_trampoline+0xe =20 but it=E2=80=99s still there=E2=80=A6=20 I think I could use some input/suggestions from someone more = knowledgeable with CAM and device drivers in general=E2=80=A6 An example (this is with extra instrumentation added so not quite a = stock kernel driver): ciss1: *** Drive failure imminent, Port=3D1E Box=3D1 Bay=3D17 = reason=3D0x37 [ts=3D4460, class=3D4, subclass=3D0, detail=3D1, tag=3D0x2] (da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 7f 17 51 98 00 = 00 00 10 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 7f 17 52 18 00 = 00 00 10 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 65 c8 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 63 c8 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 61 c8 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5f c8 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5f a8 00 = 00 00 20 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5f 38 00 = 00 00 70 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5d 38 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5b 38 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 59 38 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 58 48 00 = 00 00 f0 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 56 48 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 88 d8 48 d0 00 = 00 00 10 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 88 d8 46 e0 00 = 00 01 f0 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 88 d8 8f d0 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 88 d8 8d d0 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain ciss1: WARNING: completing non-busy request #1 [onq=3D0, length=3D262144, = ccphys=3D1791048448, tag=3D72, flags=3D0x50 (----I-C), sg_tag=3D0x3, = cr_complete=3D0xffffffff806059a0] ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy ciss1: WARNING: completing non-busy request #1 [onq=3D0, length=3D262144, = ccphys=3D1791318368, tag=3D313, flags=3D0x50 (----I-C), sg_tag=3D0x3, = cr_complete=3D0xffffffff806059a0] ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy ciss1: WARNING: completing non-busy request #1 [onq=3D0, length=3D8192, = ccphys=3D1791226528, tag=3D231, flags=3D0x50 (----I-C), sg_tag=3D0x3, = cr_complete=3D0xffffffff806059a0] ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy ciss1: WARNING: completing non-busy request #1 [onq=3D0, length=3D262144, = ccphys=3D1791313888, tag=3D309, flags=3D0x50 (----I-C), sg_tag=3D0x5, = cr_complete=3D0xffffffff806059a0] ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy ciss1: WARNING: completing non-busy request #1 [onq=3D0, length=3D8192, = ccphys=3D1791206368, tag=3D213, flags=3D0x50 (----I-C), sg_tag=3D0x3, = cr_complete=3D0xffffffff806059a0] ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy panic: camq_remove: Attempt to remove out-of-bounds index -1 from queue = 0xfffff80107fa9838 of size 1 cpuid =3D 7 time =3D 1715266713 KDB: stack backtrace: #0 0xffffffff80c542d5 at kdb_backtrace+0x65 #1 0xffffffff80c06c51 at vpanic+0x151 #2 0xffffffff80c06af3 at panic+0x43 #3 0xffffffff80390676 at camq_remove+0xf6 #4 0xffffffff803936c8 at xpt_run_devq+0x148 #5 0xffffffff80392b0c at xpt_action_default+0x35c #6 0xffffffff803c3544 at dastart+0x314 #7 0xffffffff80394f73 at xpt_run_allocq+0x83 #8 0xffffffff803c5152 at dastrategy+0x82 #9 0xffffffff80b438d4 at g_disk_start+0x314 #10 0xffffffff80b46b55 at g_io_request+0x1d5 #11 0xffffffff80b46b55 at g_io_request+0x1d5 #12 0xffffffff821a34dc at vdev_geom_io_start+0x23c #13 0xffffffff82347fb4 at zio_vdev_io_start+0x204 #14 0xffffffff8234204f at zio_nowait+0x12f #15 0xffffffff822a0ce7 at vdev_mirror_io_start+0x177 #16 0xffffffff82347fb4 at zio_vdev_io_start+0x204 #17 0xffffffff8234204f at zio_nowait+0x12f Uptime: 10h16m26s The drive it complains about in the first line is one that has SMART = errors: > connector 1E box 1 bay 17 HP = MB010000JWAYK 7PH8LU3G HPD5 = S.M.A.R.T. predictive failure. It is _not_ da122 though=E2=80=A6 The drive with a SMART error (da85) is = not in use so basically is just sitting there. The code where I=E2=80=99m trying to figure out what=E2=80=99s going on = is this part in sys/dev/ciss/ciss.c: static void ciss_complete(struct ciss_softc *sc, cr_qhead_t *qh) { struct ciss_request *cr; debug_called(2); /* = = =20 * Loop taking requests off the completed queue and performing = = =20 * completion processing on them. = = =20 */ for (;;) { int got_nonbusy =3D 0; if ((cr =3D ciss_dequeue_complete(sc, qh)) =3D=3D NULL) break; ciss_unmap_request(cr); if ((cr->cr_flags & CISS_REQ_BUSY) =3D=3D 0) { cr->nonbusy_request_counter++; ciss_printf(sc, "WARNING: completing non-busy request #%d [onq=3D%d,= length=3D%u, ccphys=3D%u, tag=3D%d, flags=3D0x%x (%s), sg_tag=3D0x%x, = cr_complete=3D%p]\n", cr->nonbusy_request_counter, cr->cr_onq, cr->cr_length, cr->cr_ccphys, = cr->cr_tag, cr->cr_flags, cr_flags2str(cr->cr_flags), cr->cr_sg_tag, cr->cr_complete ); got_nonbusy =3D 1; } cr->cr_flags &=3D ~CISS_REQ_BUSY; /* = = =20 * If the request has a callback, invoke it. = = =20 */ if (cr->cr_complete !=3D NULL) { cr->cr_complete(cr); /* pen: Attempt to make sure ciss_cam_complete() doesn=E2=80=99t = get called multiple times */ if (cr->cr_complete =3D=3D ciss_cam_complete && got_nonbusy) { ciss_printf(sc, "WARNING: clearing cr_complete due to = ciss_cam_complete & nonbusy\n"); cr->cr_complete =3D NULL; } continue; } =20 /* = = =20 * If someone is sleeping on this request, wake them up. = = =20 */ if (cr->cr_flags & CISS_REQ_SLEEP) { cr->cr_flags &=3D ~CISS_REQ_SLEEP; wakeup(cr); continue; } /* = = =20 * If someone is polling this request for completion, signal. = = =20 */ if (cr->cr_flags & CISS_REQ_POLL) { cr->cr_flags &=3D ~CISS_REQ_POLL; continue; } /* = = =20 * Give up and throw the request back on the free queue. This = = =20 * should never happen; resources will probably be lost. = = =20 */ ciss_printf(sc, "WARNING: completed command with no = submitter\n"); ciss_enqueue_free(cr); } } - Any suggestions? (Probably not but I thought I=E2=80=99d ask anyway = :-) - Peter
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?087AB4C4-9CE0-4024-8E1B-4C636014C8CC>