From nobody Thu May 9 17:58:32 2024 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Vb0BG5dVqz5KBsT for ; Thu, 09 May 2024 17:58:46 +0000 (UTC) (envelope-from pen@lysator.liu.se) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4Vb0BF3gK5z4Xn1 for ; Thu, 9 May 2024 17:58:45 +0000 (UTC) (envelope-from pen@lysator.liu.se) Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=none) header.from=lysator.liu.se; spf=pass (mx1.freebsd.org: domain of pen@lysator.liu.se designates 130.236.254.3 as permitted sender) smtp.mailfrom=pen@lysator.liu.se Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 8095F100F7 for ; Thu, 9 May 2024 19:58:42 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 747581007E; Thu, 9 May 2024 19:58:42 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED, T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0 X-Spam-Score: -1.0 Received: from smtpclient.apple (unknown [IPv6:2001:9b1:28fc:6000:58d3:48b0:3615:d77]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id D59601015C for ; Thu, 9 May 2024 19:58:40 +0200 (CEST) From: Peter Eriksson Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@FreeBSD.org Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.500.171.1.1\)) Subject: Trying to understand CAM (and the cciss driver) Message-Id: <087AB4C4-9CE0-4024-8E1B-4C636014C8CC@lysator.liu.se> Date: Thu, 9 May 2024 19:58:32 +0200 To: FreeBSD Hackers X-Mailer: Apple Mail (2.3774.500.171.1.1) X-Virus-Scanned: ClamAV using ClamSMTP X-Spamd-Bar: --- X-Spamd-Result: default: False [-3.42 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-0.92)[-0.924]; DMARC_POLICY_ALLOW(-0.50)[lysator.liu.se,none]; MV_CASE(0.50)[]; R_SPF_ALLOW(-0.20)[+a:mail.lysator.liu.se]; RCVD_IN_DNSWL_MED(-0.20)[130.236.254.3:from]; MIME_GOOD(-0.10)[text/plain]; RCVD_TLS_LAST(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; ASN(0.00)[asn:2843, ipnet:130.236.0.0/16, country:SE]; RCPT_COUNT_ONE(0.00)[1]; RCVD_COUNT_THREE(0.00)[3]; MID_RHS_MATCH_FROM(0.00)[]; R_DKIM_NA(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; FROM_HAS_DN(0.00)[]; MIME_TRACE(0.00)[0:+]; TO_MATCH_ENVRCPT_ALL(0.00)[]; TO_DN_ALL(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[freebsd-hackers@freebsd.org]; MLMMJ_DEST(0.00)[freebsd-hackers@FreeBSD.org]; ARC_NA(0.00)[] X-Rspamd-Queue-Id: 4Vb0BF3gK5z4Xn1 Hi, I=E2=80=99m trying to fix a bug in the cciss driver that has been there = =E2=80=9Cforever=E2=80=9D when using it with an HP H241 SAS HBA card.=20 The driver works fine when all (SAS, spinning rust) drives are behaving = well, but when some of them are starting to go bad it often goes into = spin and either hangs the kernel or panics. I=E2=80=99ve been trying to = add instrumentation to it in order to pin-point the problem and have = been attempting some workarounds (like clearing cr_complete since = without that hack sometime the driver get many many non-busy repeated = requests with the same =E2=80=9Ctag=E2=80=9D and then It panics with:=20 > ciss1: WARNING: completing non-busy request > ciss1: WARNING: completing non-busy request > ciss1: WARNING: completing non-busy request > panic: cam_periph_done_panic: already done with ccb 0xfffff88abefc4000 > cpuid =3D 0 > time =3D 1714852260 > KDB: stack backtrace: > #0 0xffffffff80c541b5 at kdb_backtrace+0x65 > #1 0xffffffff80c06b31 at vpanic+0x151 > #2 0xffffffff80c069d3 at panic+0x43 > #3 0xffffffff8039017c at cam_periph_done_panic+0x1c > #4 0xffffffff80397193 at xpt_done_process+0x313 > #5 0xffffffff803990a5 at xpt_done_td+0xf5 > #6 0xffffffff80bc333e at fork_exit+0x7e > #7 0xffffffff8108a44e at fork_trampoline+0xe =20 but it=E2=80=99s still there=E2=80=A6=20 I think I could use some input/suggestions from someone more = knowledgeable with CAM and device drivers in general=E2=80=A6 An example (this is with extra instrumentation added so not quite a = stock kernel driver): ciss1: *** Drive failure imminent, Port=3D1E Box=3D1 Bay=3D17 = reason=3D0x37 [ts=3D4460, class=3D4, subclass=3D0, detail=3D1, tag=3D0x2] (da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 7f 17 51 98 00 = 00 00 10 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 7f 17 52 18 00 = 00 00 10 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 65 c8 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 63 c8 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 61 c8 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5f c8 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5f a8 00 = 00 00 20 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5f 38 00 = 00 00 70 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5d 38 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5b 38 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 59 38 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 58 48 00 = 00 00 f0 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 56 48 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 88 d8 48 d0 00 = 00 00 10 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 88 d8 46 e0 00 = 00 01 f0 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 88 d8 8f d0 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 88 d8 8d d0 00 = 00 02 00 00 00=20 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries remain ciss1: WARNING: completing non-busy request #1 [onq=3D0, length=3D262144, = ccphys=3D1791048448, tag=3D72, flags=3D0x50 (----I-C), sg_tag=3D0x3, = cr_complete=3D0xffffffff806059a0] ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy ciss1: WARNING: completing non-busy request #1 [onq=3D0, length=3D262144, = ccphys=3D1791318368, tag=3D313, flags=3D0x50 (----I-C), sg_tag=3D0x3, = cr_complete=3D0xffffffff806059a0] ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy ciss1: WARNING: completing non-busy request #1 [onq=3D0, length=3D8192, = ccphys=3D1791226528, tag=3D231, flags=3D0x50 (----I-C), sg_tag=3D0x3, = cr_complete=3D0xffffffff806059a0] ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy ciss1: WARNING: completing non-busy request #1 [onq=3D0, length=3D262144, = ccphys=3D1791313888, tag=3D309, flags=3D0x50 (----I-C), sg_tag=3D0x5, = cr_complete=3D0xffffffff806059a0] ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy ciss1: WARNING: completing non-busy request #1 [onq=3D0, length=3D8192, = ccphys=3D1791206368, tag=3D213, flags=3D0x50 (----I-C), sg_tag=3D0x3, = cr_complete=3D0xffffffff806059a0] ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy panic: camq_remove: Attempt to remove out-of-bounds index -1 from queue = 0xfffff80107fa9838 of size 1 cpuid =3D 7 time =3D 1715266713 KDB: stack backtrace: #0 0xffffffff80c542d5 at kdb_backtrace+0x65 #1 0xffffffff80c06c51 at vpanic+0x151 #2 0xffffffff80c06af3 at panic+0x43 #3 0xffffffff80390676 at camq_remove+0xf6 #4 0xffffffff803936c8 at xpt_run_devq+0x148 #5 0xffffffff80392b0c at xpt_action_default+0x35c #6 0xffffffff803c3544 at dastart+0x314 #7 0xffffffff80394f73 at xpt_run_allocq+0x83 #8 0xffffffff803c5152 at dastrategy+0x82 #9 0xffffffff80b438d4 at g_disk_start+0x314 #10 0xffffffff80b46b55 at g_io_request+0x1d5 #11 0xffffffff80b46b55 at g_io_request+0x1d5 #12 0xffffffff821a34dc at vdev_geom_io_start+0x23c #13 0xffffffff82347fb4 at zio_vdev_io_start+0x204 #14 0xffffffff8234204f at zio_nowait+0x12f #15 0xffffffff822a0ce7 at vdev_mirror_io_start+0x177 #16 0xffffffff82347fb4 at zio_vdev_io_start+0x204 #17 0xffffffff8234204f at zio_nowait+0x12f Uptime: 10h16m26s The drive it complains about in the first line is one that has SMART = errors: > connector 1E box 1 bay 17 HP = MB010000JWAYK 7PH8LU3G HPD5 = S.M.A.R.T. predictive failure. It is _not_ da122 though=E2=80=A6 The drive with a SMART error (da85) is = not in use so basically is just sitting there. The code where I=E2=80=99m trying to figure out what=E2=80=99s going on = is this part in sys/dev/ciss/ciss.c: static void ciss_complete(struct ciss_softc *sc, cr_qhead_t *qh) { struct ciss_request *cr; debug_called(2); /* = = =20 * Loop taking requests off the completed queue and performing = = =20 * completion processing on them. = = =20 */ for (;;) { int got_nonbusy =3D 0; if ((cr =3D ciss_dequeue_complete(sc, qh)) =3D=3D NULL) break; ciss_unmap_request(cr); if ((cr->cr_flags & CISS_REQ_BUSY) =3D=3D 0) { cr->nonbusy_request_counter++; ciss_printf(sc, "WARNING: completing non-busy request #%d [onq=3D%d,= length=3D%u, ccphys=3D%u, tag=3D%d, flags=3D0x%x (%s), sg_tag=3D0x%x, = cr_complete=3D%p]\n", cr->nonbusy_request_counter, cr->cr_onq, cr->cr_length, cr->cr_ccphys, = cr->cr_tag, cr->cr_flags, cr_flags2str(cr->cr_flags), cr->cr_sg_tag, cr->cr_complete ); got_nonbusy =3D 1; } cr->cr_flags &=3D ~CISS_REQ_BUSY; /* = = =20 * If the request has a callback, invoke it. = = =20 */ if (cr->cr_complete !=3D NULL) { cr->cr_complete(cr); /* pen: Attempt to make sure ciss_cam_complete() doesn=E2=80=99t = get called multiple times */ if (cr->cr_complete =3D=3D ciss_cam_complete && got_nonbusy) { ciss_printf(sc, "WARNING: clearing cr_complete due to = ciss_cam_complete & nonbusy\n"); cr->cr_complete =3D NULL; } continue; } =20 /* = = =20 * If someone is sleeping on this request, wake them up. = = =20 */ if (cr->cr_flags & CISS_REQ_SLEEP) { cr->cr_flags &=3D ~CISS_REQ_SLEEP; wakeup(cr); continue; } /* = = =20 * If someone is polling this request for completion, signal. = = =20 */ if (cr->cr_flags & CISS_REQ_POLL) { cr->cr_flags &=3D ~CISS_REQ_POLL; continue; } /* = = =20 * Give up and throw the request back on the free queue. This = = =20 * should never happen; resources will probably be lost. = = =20 */ ciss_printf(sc, "WARNING: completed command with no = submitter\n"); ciss_enqueue_free(cr); } } - Any suggestions? (Probably not but I thought I=E2=80=99d ask anyway = :-) - Peter