Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 9 Jul 2001 10:18:21 -0700 (PDT)
From:      Tracy Camp <campt@miralink.com>
To:        freebsd-gnats-submit@FreeBSD.org
Subject:   kern/28840: Possible interrupt masking trouble in sys/cam/cam_xpt.c
Message-ID:  <200107091718.f69HIL899370@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help

>Number:         28840
>Category:       kern
>Synopsis:       Possible interrupt masking trouble in sys/cam/cam_xpt.c
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Jul 09 10:20:01 PDT 2001
>Closed-Date:
>Last-Modified:
>Originator:     Tracy Camp
>Release:        4.3-RELEASE
>Organization:
MiraLink Corp
>Environment:
FreeBSD 4.3-RELEASE FreeBSD 4.3-RELEAE #126: Fri Jul 6 13:37:26 EDT 2001
root@bsd-4-3.miralink.com:/usr/src/sys/compile/TEST i386
>Description:
We have over a long period of time been experiencing seemingly random crashes using FreeBSD 3.1 and now FreeBSD 4.3 all related to the disk i/o system.  Our application uses a custom driver with a Qlogic ISP controller operating in target mode.  After extensive source code auditing in our driver code we could find no further problems.  However setting up sanity checks in portions of the sys/cam/cam_xpt.c code showed what appeared to be queue corruption due to invalid interrupt masking.  This problem only shows up under rather heavy load.  Sorry to say our driver does a fair amount of work at interrupt level so this may be the underlying trigger problem.  However removing and replacing all splsoftcam() calls in sys/cam/cam_xpt.c with splcam() entirely eliminated the problem.

Specific problems we had encountered:

devstat_end_transaction HELP!! busy_count for da2 < 0 (-1)

this was shown to allways result from a devstat_end_transaction_buf occuring within cam/sys/scsi/scsi_da.c:dadone()

panic: xpt_run_dev_allocq: Device on queue without any work to do

This was found after a bit of testing to be related directly to the next one:

Fatal Trap 12: page fault while in kernel mode

this was occuring within xpt_run_dev_allocq and was actually due to a NULL pointer being returned by camq_remove on the device queue.

Checks added to camq_insert and camq_remove showed that occasionally a queue entry could be added and before camq_insert had finished the entries count would be 0 rather than the expected 1.  Particularly convincing was a test inserted that did something similar to this:

camq_insert(..)
{
/* near the top */
saved_entries = queue->entries;
/* later */
if(queue->entries < 1) {
printf("entries < 1 %d", queue->entries);
}
else {
>How-To-Repeat:
>Fix:
>Release-Note:
>Audit-Trail:
>Unformatted:

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200107091718.f69HIL899370>