From owner-freebsd-stable  Fri Aug  6 20:33: 8 1999
Delivered-To: freebsd-stable@freebsd.org
Received: from hotmail.com (law-f60.hotmail.com [209.185.131.123])
	by hub.freebsd.org (Postfix) with SMTP id 684F914CE5
	for <freebsd-stable@FreeBSD.ORG>; Fri,  6 Aug 1999 20:32:54 -0700 (PDT)
	(envelope-from lightningweb@hotmail.com)
Received: (qmail 17072 invoked by uid 0); 7 Aug 1999 03:32:41 -0000
Message-ID: <19990807033241.17071.qmail@hotmail.com>
Received: from 24.4.2.215 by www.hotmail.com with HTTP;
	Fri, 06 Aug 1999 20:32:40 PDT
X-Originating-IP: [24.4.2.215]
From: "lweb Lightningweb" <lightningweb@hotmail.com>
To: freebsd-stable@FreeBSD.ORG
Cc: greg@lightningweb.com, jeremy@lightningweb.com,
	keith@lightningweb.com, criter@lightningweb.com
Subject: continued crashes with 3.1-Stable
Date: Fri, 06 Aug 1999 20:32:40 PDT
Mime-Version: 1.0
Content-Type: text/plain; format=flowed
Sender: owner-freebsd-stable@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

We have not resolved our problem with frequent freezes with our web server.  
We had two responses to our first mail to this list, but neither one was the 
solution.  The problem is that the server will stop responding to ANYTHING 
except pings.  No telnet, no ssh, no web, no ftp, nothing.  Open telnet 
sessions don't drop, there's just no response to keyboard activity.

One suggestion was to fix the "pthreads library," whic we did.  The other 
was: "You may have hardware problems."

This server is going down more frequently.  Three times so far today.  There 
is no apparant pattern to the crashes.  They seem to happen most often 
during an Mysql database query, but it's happened many times without any 
queries (a few times just by running "pine" with a large mailbox file).

We cannot recreate the crash when we want, it just crashes at random times.  
We've tried hammering it with web, database queries, and benchamrking 
programs that slam the RAID array and memory and processors, but it chuggs 
right along.

We have replaced drives in the RAID array, we are now replacing drive 
caddies.  Next step I think will be the RAID controller.  I have a strong 
gut feeling that it is software however.  There's nothing to substantiate 
this, except that that more often than not, the crash happens during an 
MySQL query.

Some (NOT ALL) of the suspect errors that we've recorded from the console 
during a crash are:

(da0:dpt0:0:0:0): Invalidating pack
biodone: buffer already done
spec_getpages: I/O read failure: (error code=6)
               size: 32768, resid: 32768, a_count: 32768, valid: 0x0
               nread: 0, reqpage: 0, pindex: 0, pcount: 8


Everyone please take a second look at this and help us brainstorm the 
problem?  I am including a list of the hardware, the original message we 
sent to the list, and a recent dmesg:

FreeBSD 3.1-STABLE #1
Dual-Proc PII 450
512MB RAM
DPT PM334UW RAID controller
- 16MB RAM
- dual bus Ultra Wide
- Six 9.1GB Quantum VikingII SCSI3 U2W drives
- Three drives per bus, RAID5, one drive is hot-spare
Intel EtherExpress Pro 10/100B Ethernet
TOSHIBA CD-ROM XM-6201TA


--------------
I've recently had the job of system administration dumped in my lap.  I'm
looking forward to getting on top of it, but I'm a little behind the 8-ball
right now.  If my subject matter varies too far from the allowed context of
this list, please don't flame me too badly.

Background:  We are running a dual PII 450 system with a 45 gig raid array,
controlled by a DPT PM334.

The O/S: FreeBSD 3.1-STABLE #1

For several months this has been rock solid.  However, in the past three
weeks, we've had a number of crashes, most of which seem to be related to
mysql queries.  The system would be totally unresponsive to ssh/telnet and
web, but would still return pings.

The server is colocated at our ISP, so it's been tricky to track down the
exact 'on screen' console errors.  Today, shortly after we upgraded our
mysql version, I did see the error.


(da0:dpt0:0:0:0): Invalidating Pack
(da0:dpt0:0:0:0): Invalidating Pack
devstat_end_transaction: HELP!! busy_count for da0 is < 0 (-1)!
biodone: buffer already done
(da0:dpt0:0:0:0): Read  (10). CDB: 28 0 3 87 33 1f 0 0 80 0
(da0:dpt0:0:0:0): ILLEGAL REQUEST asc:20,0
(da0:dpt0:0:0:0): Invalid command operation code
devstat_end_transaction: HELP!! busy_count for da0 is < 0 (-1)!
biodone: buffer already done


Followed by a complete system freeze, including the console.

Some hunting and searching has led us to believe that we are encountering a
driver failure and that we should bring the OS back to -stable.

As I said, I haven't done this before, so I'm a little anxious.  Before I
take that step, I would be very greatful to hear some input from those who
surely know more about this than I do.

Is bringing the system back to -stable likely to correct our problem?  Am I
missing some indicator in the error above?   Has someone else encountered
similar trouble (and found a fix?)

I'll be happy to take replies in private e-mail if this is off topic.

Any help would be great.

Thanks,
Jeremy


_______________________________________________________________
Get Free Email and Do More On The Web. Visit http://www.msn.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message