Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 5 Jun 1998 09:30:46 +0930
From:      Greg Lehey <grog@lemis.com>
To:        shimon@simon-shapiro.org
Cc:        Michael Hancock <michaelh@cet.co.jp>, "freebsd-current@freebsd.org" <freebsd-current@FreeBSD.ORG>, tcobb <tcobb@staff.circle.net>, Karl Pielorz <kpielorz@tdx.co.uk>, Mike Smith <mike@smith.net.au>
Subject:   Re: DPT driver fails and panics with Degraded Array
Message-ID:  <19980605093046.J768@freebie.lemis.com>
In-Reply-To: <XFMail.980604120046.shimon@simon-shapiro.org>; from Simon Shapiro on Thu, Jun 04, 1998 at 12:00:46PM -0400
References:  <19980603125443.K22406@freebie.lemis.com> <XFMail.980604120046.shimon@simon-shapiro.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu,  4 June 1998 at 12:00:46 -0400, Simon Shapiro wrote:
>
> On 03-Jun-98 Greg Lehey wrote:
>> Why would a driver call biodone on a buffer that doens't belong to it?
>
> The block belongs to it. Only it gets marked as done somehow.

That in itself is normal enough.  How come it's not busy?

>>>> These situations are worth analysing, and I hope to see you and Troy
>>>> resolving this one, even if it means that you point the finger
>>>> elsewhere.
>>
>> Definitely.  I'm surprised nobody has done it yet.
>
> I posted some notes on this issue several months ago, with no response.

Sorry, I've been too busy with my own stuff...

>>> I got these particularly with tape devices.  Especially if there are two
>>> tape drives on the system and yoy try to (for example) cpio to both
>>> independently.  I put a ton of debugging code in the DPT driver to try
>>> and
>>> catch the DPT sending biodone twice on the same request and am pretty
>>> comfortable the driver is not it.
>>
>> OK, where is the failing biodone called from?
>
> From the DPT driver.  Let me clarify the statement above;  There was a
> printf in the driver, just above the biodone call.  The driver also
> contains state info as to biodone was called or not (actually, biodone
> state is implicit from other states).  In every case where the biodone
> failure occured, there was no prior call to biodone.  I.E.  the offending
> call was the first call.  I even went as far as putting counters around
> these calls.  It always stayed at zero.

I don't know the driver, but I'm surprised you need to maintain
separate information.  I'd use the state in the bp->b_flags.

> Since the greatest sensitivity was in the st.c, and st.c is new in CAM, I
> basically dropped the ball.  Especially when I did not have this problem in
> 3.0, from very early on.

I haven't seen a driver called st.c in CAM.  They've changed the
names, and the tape driver is now called scsi_sa.c.  st.c is the old
tape driver.  How do you determine "greatest sensitivity"?

In any case, I can't see how a different driver can influence things.
Heavy tape I/O may help the problem to show itself, but I can't think
it's in any way to blame.

>> I find this difficult to follow.  Onn the one hand, lots of people
>> (myself included) regularly use the st driver, and I've never seen
>> this behaviour.  About the only thing that these panics have in common
>> is the DPT driver.  It's easy enough to determine which driver is
>> involved: all you need to do is follow the stack trace to find what
>> devices is involved with the buffer (or just look at bp->b_dev).
>
> Are you using two tape drives, and write to both concurrently, using 64k
> blocks?

Occasionally.

> Are you running disk I/O at 1500-1900 operations per second?  Is the
> SCSI controller you use capable of causing biodone to be called
> within less than 1us from the driver being called?

Well, I suppose each of the controllers could generate a number of
interrupts per second, so sooner or later that scenario would arise.
But as I said above, there's nothing to point to the st driver except
it's the new kid on the block.  What you have said points fairly and
squarely to the DPT driver as the culprit.

> The fact that the DPT driver causes this problem does not automatically
> vindicate the DPT driver code.  I would LOVE for it to be so because this
> is the part of the FreeBSD kernel I understand the best.
>
> Stack traces were analyzed, but did not reveal anything interesting.  It is
> entirely possible that the fast response from the DPT causes a race
> condition elsewhere.  Without cooperation from others who understand the
> other parts of the kernel better than I do, it is difficult for me to
> analyze it much farther beyond ``I am pretty confident it is not a coding
> error in the driver or the immediate code that calls it.

OK.  What happens if you analyse the buffer header before calling
biodone and just ignore it if it's not busy?

Greg
--
See complete headers for address and phone numbers
finger grog@lemis.com for PGP public key

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19980605093046.J768>