From owner-freebsd-stable@FreeBSD.ORG  Thu Dec 15 02:01:58 2005
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
X-Original-To: freebsd-stable@freebsd.org
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id CEBDA16A428
	for <freebsd-stable@freebsd.org>; Thu, 15 Dec 2005 02:01:58 +0000 (GMT)
	(envelope-from atanas@asd.aplus.net)
Received: from pro20.abac.com (pro20.abac.com [66.226.64.21])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 5D24043D45
	for <freebsd-stable@freebsd.org>; Thu, 15 Dec 2005 02:01:58 +0000 (GMT)
	(envelope-from atanas@asd.aplus.net)
Received: from [216.55.129.41] (asd0.aplus.net [216.55.129.41])
	(authenticated bits=0)
	by pro20.abac.com (8.13.4/8.13.4) with ESMTP id jBF21ukh029029;
	Wed, 14 Dec 2005 18:01:56 -0800 (PST)
	(envelope-from atanas@asd.aplus.net)
Message-ID: <43A0D070.7020103@asd.aplus.net>
Date: Wed, 14 Dec 2005 18:09:52 -0800
From: Atanas <atanas@asd.aplus.net>
User-Agent: Mozilla Thunderbird 1.0.7 (X11/20051026)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Peter Jeremy <PeterJeremy@optushome.com.au>
References: <439DE88B.1090407@asd.aplus.net>	<20051212214003.GA77268@cirb503493.alcatel.com.au>	<439E3894.6060901@asd.aplus.net>
	<20051213100034.GE77268@cirb503493.alcatel.com.au>
In-Reply-To: <20051213100034.GE77268@cirb503493.alcatel.com.au>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Score: 1.47 (SPF_SOFTFAIL)
Cc: freebsd-stable@freebsd.org
Subject: Re: 6.0 random freezes
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Dec 2005 02:01:59 -0000

Peter Jeremy said the following on 12/13/05 02:00:
> 
> Note that PS/2 keyboards aren't hot-pluggable and attempts to do so
> can have deleterious effects on your keyboard and/or motherboard.  In
> any case, the probe/attach sequence relies on the kernel being in a
> reasonably sane state (and I'm not sure if it will detect the keyboard
> as a console device except at boot time).
> 
I agree, but the keyboard is a passive device (with no power source, 
i.e. mostly harmless), and it's a standard practice to have only few 
movable consoles for several racks and plug them in only where it's 
necessary. It always has been working for us and I don't remember having 
any hot-plugging accidents for years.

> If the keyboard has been plugged in since the system booted, do you
> still get the same "no response"?  If so, the kernel has wedged at
> a fairly low level and I'm not quite sure how to proceed other than
> by enabling the sanity checks that other people have mentioned
> (eg WITNESS, INVARIANTS) and hoping they catch something.
> 
I cannot say for sure. When the thing happens I'm usually away, and 
until I go there, the console could have been used by someone. I'm in 
process of getting a serial console, so if there's no response as well, 
I will enable the sanity checks.

> I only mentioned serial consoles on the off-chance that you had one.
> Whilst it may not help here, serial consoles have a number of
> advantages when managing remote equipment
 >
Thanks for pointing this. As I said I'm in process of getting one for 
now, and possibly equipping some dozens of servers with that later.

>>After the downgrade we could eventually set a test bed and start 
>>hammering it with requests. The problem would be how to trigger the 
>>crash and whether we would be able to reproduce it at all.
> 
I already went to the 5.4 downgrade way. Actually I was forced to do so 
during the other night, when one of the machines started hanging up in 
every half an hour or so. Looks like the background fsck on the slower 
SATA based RAID5 array helped a lot with that.

Now I have the test bed online. This is the very same server (SCSI 
based, with the OS drive intact and production data drives moved 
elsewhere) that was crashing once a day or so. Hopefully tomorrow I will 
have a serial console attached to it, so we can start pounding it. I 
hope this machine won't need to go in production during the next month 
or so and we'll have enough time for tests.

 > Depending on your application and the interfaces to it, it might be
 > feasible to either tee live traffic into both systems and just junk
 > the responses from your test bed, or "record" live traffic and
 > replay it into your test bed.
 >
It runs a fairly complex set of services. It's a shared web hosting 
server handling some hundreds of websites, and also email 
SMTP/POP3/IMAP, databases MySQL, FTP, DNS, etc.

I don't know how easy would be implement such traffic gathering and 
replaying that on the test bed. It seems kind of complicated at first 
sight (though I realize it might be the only way to reproduce the 
crash). We might need some NAT (via ipfw?), some services might not like 
their responses being junked, etc.

I was thinking about trying the kernel stress suite first. Or just have 
something rsync-ing lots files back and forth (possibly over the 
network), run apache bench in a loop and point it to some database 
intensive page, etc.

Regards,
Atanas