Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 29 Dec 2021 13:39:36 -0800
From:      John Baldwin <jhb@FreeBSD.org>
To:        scsi@FreeBSD.org
Cc:        Alexander Motin <mav@FreeBSD.org>, =?UTF-8?Q?Edward_Tomasz_Napiera=c5=82a?= <trasz@freebsd.org>
Subject:   iSCSI target: Handling in-flight requests during ctld shutdown
Message-ID:  <fd383f6f-5a19-e2bb-5383-e559271eb3cd@FreeBSD.org>

next in thread | raw e-mail | index | archive | help
One of the tests Chelsio QA has been running against our iSCSI stack
with cxgbei offload enabled is to run a bunch of iozone's on an
initiator while running a script on the target that keeps stopping
ctld (for a minute or so), then starting it again and letting it run
for about 5 minutes until stopping it again.

One of the errors found last night is that the target reported the
following error to the initiator:

(da7:iscsi10:0:0:0): CAM status: SCSI Status Error
(da7:iscsi10:0:0:0): SCSI status: Check Condition
(da7:iscsi10:0:0:0): SCSI sense: HARDWARE FAILURE asc:44,0 (Internal target failure)
(da7:iscsi10:0:0:0): Actual Retry Count: 44
(da7:iscsi10:0:0:0): Error 5, Unretryable error
g_vfs_done():da7[WRITE(offset=9797632, length=32768)]error = 6
UFS: forcibly unmounting /dev/da7 from /ISCSI8

The retry count of 44 is the breadcrumb to find the corresponding error
in the ctl code.  In this case it is here in ctl_frontend_iscsi.c:

static void
cfiscsi_datamove_out(union ctl_io *io)
{
         ...
	CFISCSI_SESSION_LOCK(cs);
	if (cs->cs_terminating) {
		CFISCSI_SESSION_UNLOCK(cs);
		cfiscsi_data_wait_abort(cs, cdw, 44);
		return;
	}
	TAILQ_INSERT_TAIL(&cs->cs_waiting_for_data_out, cdw, cdw_next);
	CFISCSI_SESSION_UNLOCK(cs);
         ...
}

I added this check recently (September) to fix a deadlock I encountered
during similar testing:

commit 0cd6e85e242bb07a33df9a6314e90bcb0ba99576
Author: John Baldwin <jhb@FreeBSD.org>
Date:   Wed Sep 15 13:25:30 2021 -0700

     iscsi: Abort data-out tasks queued on a terminating session.
     
     cfiscsi_datamove_out() can race with cfiscsi_session_terminate_tasks()
     and enqueue a new task after the latter function has aborted existing
     tasks.  This could result in a deadlock as
     cfiscsi_session_terminate_tasks() waited forever for this task to
     complete.
     
     Reviewed by:    mav
     Sponsored by:   Chelsio Communications
     Differential Revision:  https://reviews.freebsd.org/D31892

Note that in the case that ctld is shut down just slightly later we would
abort the request similarly in cfiscsi_session_terminate_tasks(), just with
an error code of 42:

	CFISCSI_SESSION_LOCK(cs);
	while ((cdw = TAILQ_FIRST(&cs->cs_waiting_for_data_out)) != NULL) {
		TAILQ_REMOVE(&cs->cs_waiting_for_data_out, cdw, cdw_next);
		CFISCSI_SESSION_UNLOCK(cs);
		cfiscsi_data_wait_abort(cs, cdw, 42);
		CFISCSI_SESSION_LOCK(cs);
	}
	CFISCSI_SESSION_UNLOCK(cs);

So my question I think is what is the expected behavior?  Is the internal error
really expected to make it on the wire to be sent to the other side?  Since
the connection is shutting down should we just discard the reply altogether
rather than reporting an internal error?  If we discarded the reply then the
initiator in this particular test would have retried the original request once
ctld was restarted and continued running without an error.


-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?fd383f6f-5a19-e2bb-5383-e559271eb3cd>