From owner-freebsd-hackers Sat Jun 21 10:49:13 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.5/8.8.5) id KAA11535 for hackers-outgoing; Sat, 21 Jun 1997 10:49:13 -0700 (PDT) Received: from sendero-ppp.i-connect.net (sendero-ppp.i-Connect.Net [206.190.143.100]) by hub.freebsd.org (8.8.5/8.8.5) with SMTP id KAA11483 for ; Sat, 21 Jun 1997 10:49:03 -0700 (PDT) Received: (qmail 11673 invoked by uid 1000); 21 Jun 1997 17:49:01 -0000 Message-ID: X-Mailer: XFMail 1.2-alpha [p0] on FreeBSD Content-Type: text/plain; charset=iso-8859-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 Date: Sat, 21 Jun 1997 10:49:01 -0700 (PDT) Organization: Atlas Telecom From: Simon Shapiro To: FreeBSD-Hackers@FreeBSD.ORG, FreeBSD-SCSI@FreeBSD.ORG Subject: Mystery of The missing I/O - Help Solicited Sender: owner-hackers@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk Hi Y'all This message is for all those who are still speaking to me after daring to suggest that plasic (yuck!) disk carriers can be as good as steel ones (imagine that!) :-)) No, really, there is something serious we could be helped with: With the new DPT driver, we were plagued with occasional getting stuck. what happens is that after few minutes of operation, or after few days of operation, under varying loads, any process which goes to a certain disk would just block indefinitely. We verified that we do not miss processing any interrupt. We fixed a minor hole that causes biodone to get confused every million I/O's or so. We traced individual commands to make sure that we do not have any SCSI command which we do not return to sd.c To make these verifications we built all kinds of strange and interesting tools. Nothing helps. Oh, to confuse everyone, we can reproduce this problem only on Pentium Pros. Pentium-100's simply will not fail. We braught the load on test systems all the way up to about 120. Nothing. Next hint set; We can reliably reproduce the problem only on sendero, only when doing make release. So we though. Today we decided to try something else. We quited down ALL networking activity on the system, including disconnecting PPP. We managed to build make release flawlessly. Several times. Connect PPP and SCSI command completions seem to disappear somewhere between sd.c and the driver or higher. Disconnect PPP and all is well. Before someone tells me to shut down the software interrupts, I will be quickly to point out that I can #ifdef it out and still get the same problem. Exactly. Let me point out that the DPT can complete a SCSI READ/WRITE command in about 250 microseconds (on a cache hit). We measured, occasionally, interruptscoming as fast as 4 microseconds apart (like two consecutive cache hits). We are at our wits end to find an explanation for this. Any suggestion will be greatly appreciated. Thamx, Simon Quiz: How many SCSI commands does it take to run make release? Answer: 300,000 reads and 2.1 million writes.