From owner-freebsd-hackers  Sat Jun 21 10:49:13 1997
Return-Path: <owner-hackers>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.5/8.8.5) id KAA11535
          for hackers-outgoing; Sat, 21 Jun 1997 10:49:13 -0700 (PDT)
Received: from sendero-ppp.i-connect.net (sendero-ppp.i-Connect.Net [206.190.143.100])
          by hub.freebsd.org (8.8.5/8.8.5) with SMTP id KAA11483
          for <FreeBSD-Hackers@FreeBSD.ORG>; Sat, 21 Jun 1997 10:49:03 -0700 (PDT)
Received: (qmail 11673 invoked by uid 1000); 21 Jun 1997 17:49:01 -0000
Message-ID: <XFMail.970621104901.Shimon@i-Connect.Net>
X-Mailer: XFMail 1.2-alpha [p0] on FreeBSD
Content-Type: text/plain; charset=iso-8859-8
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
Date: Sat, 21 Jun 1997 10:49:01 -0700 (PDT)
Organization: Atlas Telecom
From: Simon Shapiro <Shimon@i-Connect.Net>
To: FreeBSD-Hackers@FreeBSD.ORG, FreeBSD-SCSI@FreeBSD.ORG
Subject: Mystery of The missing I/O - Help Solicited
Sender: owner-hackers@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

Hi Y'all

This message is for all those who are still speaking to me after daring to
suggest that plasic (yuck!) disk carriers can be as good as steel ones
(imagine that!) :-))

No, really, there is something serious we could be helped with:

With the new DPT driver, we were plagued with occasional getting stuck. 
what happens is that after few minutes of operation, or after few days of
operation, under varying loads, any process which goes to a certain disk
would just block indefinitely.

We verified that we do not miss processing any interrupt.
We fixed a minor hole that causes biodone to get confused every million
I/O's or so.  We traced individual commands to make sure that we do not
have any SCSI command which we do not return to sd.c

To make these verifications we built all kinds of strange and interesting
tools.  Nothing helps.

Oh, to confuse everyone, we can reproduce this problem only on Pentium Pros.
Pentium-100's simply will not fail.  We braught the load on test systems
all the way up to about 120.  Nothing.

Next hint set;  We can reliably reproduce the problem only on sendero, only
when doing make release.  So we though.

Today we decided to try something else.  We quited down ALL networking
activity on the system, including disconnecting PPP.  We managed to build
make release flawlessly.  Several times.  Connect PPP and SCSI command
completions seem to disappear somewhere between sd.c and the driver or
higher.  Disconnect PPP and all is well.

Before someone tells me to shut down the software interrupts, I will be
quickly to point out that I can #ifdef it out and still get the same
problem.  Exactly.

Let me point out that the DPT can complete a SCSI READ/WRITE command in
about 250 microseconds (on a cache hit).  We measured, occasionally,
interruptscoming  as fast as 4 microseconds apart (like two consecutive
cache hits).

We are at our wits end to find an explanation for this. Any suggestion will
be greatly appreciated.


Thamx,

Simon

Quiz:    How many SCSI commands does it take to run make release?
Answer:  300,000 reads and 2.1 million writes.