From owner-freebsd-stable@FreeBSD.ORG Thu Dec 15 02:01:58 2005 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id CEBDA16A428 for ; Thu, 15 Dec 2005 02:01:58 +0000 (GMT) (envelope-from atanas@asd.aplus.net) Received: from pro20.abac.com (pro20.abac.com [66.226.64.21]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5D24043D45 for ; Thu, 15 Dec 2005 02:01:58 +0000 (GMT) (envelope-from atanas@asd.aplus.net) Received: from [216.55.129.41] (asd0.aplus.net [216.55.129.41]) (authenticated bits=0) by pro20.abac.com (8.13.4/8.13.4) with ESMTP id jBF21ukh029029; Wed, 14 Dec 2005 18:01:56 -0800 (PST) (envelope-from atanas@asd.aplus.net) Message-ID: <43A0D070.7020103@asd.aplus.net> Date: Wed, 14 Dec 2005 18:09:52 -0800 From: Atanas User-Agent: Mozilla Thunderbird 1.0.7 (X11/20051026) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Peter Jeremy References: <439DE88B.1090407@asd.aplus.net> <20051212214003.GA77268@cirb503493.alcatel.com.au> <439E3894.6060901@asd.aplus.net> <20051213100034.GE77268@cirb503493.alcatel.com.au> In-Reply-To: <20051213100034.GE77268@cirb503493.alcatel.com.au> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: 1.47 (SPF_SOFTFAIL) Cc: freebsd-stable@freebsd.org Subject: Re: 6.0 random freezes X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Dec 2005 02:01:59 -0000 Peter Jeremy said the following on 12/13/05 02:00: > > Note that PS/2 keyboards aren't hot-pluggable and attempts to do so > can have deleterious effects on your keyboard and/or motherboard. In > any case, the probe/attach sequence relies on the kernel being in a > reasonably sane state (and I'm not sure if it will detect the keyboard > as a console device except at boot time). > I agree, but the keyboard is a passive device (with no power source, i.e. mostly harmless), and it's a standard practice to have only few movable consoles for several racks and plug them in only where it's necessary. It always has been working for us and I don't remember having any hot-plugging accidents for years. > If the keyboard has been plugged in since the system booted, do you > still get the same "no response"? If so, the kernel has wedged at > a fairly low level and I'm not quite sure how to proceed other than > by enabling the sanity checks that other people have mentioned > (eg WITNESS, INVARIANTS) and hoping they catch something. > I cannot say for sure. When the thing happens I'm usually away, and until I go there, the console could have been used by someone. I'm in process of getting a serial console, so if there's no response as well, I will enable the sanity checks. > I only mentioned serial consoles on the off-chance that you had one. > Whilst it may not help here, serial consoles have a number of > advantages when managing remote equipment > Thanks for pointing this. As I said I'm in process of getting one for now, and possibly equipping some dozens of servers with that later. >>After the downgrade we could eventually set a test bed and start >>hammering it with requests. The problem would be how to trigger the >>crash and whether we would be able to reproduce it at all. > I already went to the 5.4 downgrade way. Actually I was forced to do so during the other night, when one of the machines started hanging up in every half an hour or so. Looks like the background fsck on the slower SATA based RAID5 array helped a lot with that. Now I have the test bed online. This is the very same server (SCSI based, with the OS drive intact and production data drives moved elsewhere) that was crashing once a day or so. Hopefully tomorrow I will have a serial console attached to it, so we can start pounding it. I hope this machine won't need to go in production during the next month or so and we'll have enough time for tests. > Depending on your application and the interfaces to it, it might be > feasible to either tee live traffic into both systems and just junk > the responses from your test bed, or "record" live traffic and > replay it into your test bed. > It runs a fairly complex set of services. It's a shared web hosting server handling some hundreds of websites, and also email SMTP/POP3/IMAP, databases MySQL, FTP, DNS, etc. I don't know how easy would be implement such traffic gathering and replaying that on the test bed. It seems kind of complicated at first sight (though I realize it might be the only way to reproduce the crash). We might need some NAT (via ipfw?), some services might not like their responses being junked, etc. I was thinking about trying the kernel stress suite first. Or just have something rsync-ing lots files back and forth (possibly over the network), run apache bench in a loop and point it to some database intensive page, etc. Regards, Atanas