From owner-freebsd-current Sun Jul 26 12:13:07 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id MAA25366 for freebsd-current-outgoing; Sun, 26 Jul 1998 12:13:07 -0700 (PDT) (envelope-from owner-freebsd-current@FreeBSD.ORG) Received: from mail1.digital.com (mail1.digital.com [204.123.2.50]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id MAA25361 for ; Sun, 26 Jul 1998 12:13:06 -0700 (PDT) (envelope-from mitch@pa.dec.com) Received: from src-mail-too.pa.dec.com (src-mail-too.pa.dec.com [16.4.0.16]) by mail1.digital.com (8.8.8/8.8.8/WV1.0f) with SMTP id MAA09564 for ; Sun, 26 Jul 1998 12:12:40 -0700 (PDT) Received: by src-mail-too.pa.dec.com; id AA29820; Sun, 26 Jul 98 12:12:39 -0700 Received: by src-exchange.pa.dec.com with Microsoft Exchange (IMC 4.0.837.3) id <01BDB88E.1EC5D3B0@src-exchange.pa.dec.com>; Sun, 26 Jul 1998 12:08:37 -0700 Message-Id: From: Mitch Lichtenberg To: "'current@freebsd.org'" Subject: Hard hangs of -current under heavy load - how to debug? Date: Sun, 26 Jul 1998 12:08:36 -0700 X-Mailer: Microsoft Exchange Server Internet Mail Connector Version 4.0.837.3 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG I've been experiencing some random hangs on -current releases over the past few months (I'm currently at 3.0-19980723, but I've seen this since last December). The systems operate under heavy load for about 24 hours, then one or two randomly hang. The hangs are hard (no console messages, no dumps/traps, can't escape to the debugger). It looks like interrupts are disabled. Generally, how do you debug a hang like this? Are there any generic techniques or kernel options that I can enable to help me figure this one out? My next step is to hook up a button to the NMI line to see if I can get into DDB that way, but perhaps there's someting easier I can do in the meantime, or maybe there are known problems with my configuration that someone can point out to me. ---- Workload / system description, for those that are interested: I've got a network of ten identical machines. They netboot from a "master" machine (I did a netboot driver for the DEC DC21143 Ethernet chip if anyone's interested). The workload is a distributed storage application I'm working on, which generates a huge amount of UDP traffic and disk I/O. When the tests are running, the net and disk are running flat out, near maximum throughput. The application is basically I/O bound - I seldom see more than 15% CPU utilization. At present, some PCs are servers (lots of disk and net traffic), and some are clients (only net traffic). Both the clients and servers are affected by this problem, so I'm tempted to believe the disk is OK, but servers do crash more often than clients. The "master" machine, identical to the others, has never crashed. Could there be anything screwy about the hardware interrupt mechanism, or known problems with the VIA VP2/97 chipset? (see http://www.research.digital.com/SRC/personal/Ed_Lee/Petal/petal.html if you'd like to know more about the project) Basic configuration: Motherboard: FIC PA-2007 motherboard (VIA VP2/97 chipset (for ECC)), Processor: Cyrix 6x86MX processor Memory: 64MB Disk: Four IBM Deskstar 8.4GB, UltraDMA, all masters (Promise Ultra33 IDE controller for drives 3 and 4) Network: DEC DE500-BA (DC21143) 100Mb/s, connected to a Prominet fast ethernet switch The machines boot via netboot. Thanks! Mitch Lichtenberg COMPAQ Systems Research Center (yes, formerly Digital Equipment Corp.) Palo Alto, CA. mitch@pa.dec.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message