From owner-freebsd-stable@FreeBSD.ORG Mon Sep 20 13:03:06 2004 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8DAF116A4CE for ; Mon, 20 Sep 2004 13:03:06 +0000 (GMT) Received: from pcwin002.win.tue.nl (pcwin002.win.tue.nl [131.155.71.72]) by mx1.FreeBSD.org (Postfix) with ESMTP id EA35843D62 for ; Mon, 20 Sep 2004 13:03:05 +0000 (GMT) (envelope-from stijn@pcwin002.win.tue.nl) Received: from pcwin002.win.tue.nl (localhost [127.0.0.1]) by pcwin002.win.tue.nl (8.13.1/8.13.1) with ESMTP id i8KD34L9007870 for ; Mon, 20 Sep 2004 15:03:04 +0200 (CEST) (envelope-from stijn@pcwin002.win.tue.nl) Received: (from stijn@localhost) by pcwin002.win.tue.nl (8.13.1/8.13.1/Submit) id i8KD34Br007869 for freebsd-stable@freebsd.org; Mon, 20 Sep 2004 15:03:04 +0200 (CEST) (envelope-from stijn) Date: Mon, 20 Sep 2004 15:03:04 +0200 From: Stijn Hoop To: freebsd-stable@freebsd.org Message-ID: <20040920130304.GK827@pcwin002.win.tue.nl> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-Bright-Idea: Let's abolish HTML mail! Subject: [long] ATA timeout problems on -STABLE X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 20 Sep 2004 13:03:06 -0000 Hi, Short explanation: I have a box running 4.10-RELEASE-p2 suffering from severe ATA timeout problems once every few days, and I cannot determine for the life of me what's causing it. I'd really like some hints on how to determine the cause of this, as I think I've ruled out hardware related stuff. Long story: This box consists of - Gigabyte GA-7N400 Pro F8 mobo - Athlon XP 2500+ CPU - 1 Gig PC2700 DDR DRAM in 2 modules - Adaptec 2940 Ultra SCSI adapter - 1 x COMPAQ WDE4550W SCSI-2 4G drive for the OS - 2 x Promise Ultra100 IDE controllers (not the TX2) - 4 x Maxtor 200G 7200 RPM drives, connected to the master/slave positions of the onboard IDE controllers (yes I know better setups exist but these drives wouldn't play nice with the Promises) - 4 x Maxtor 120G 7200 RPM drives, all connected to the 4 master positions on the two promise controllers Now this setup, as you can image, draws a lot of power. The problems actually began a few months back, when I replaced 2 x 60 and 2 x 80 gig drives with the 4 x 200 above. The machine just wouldn't boot or randomly 'lost' IDE drives. A basic working setup I arrived on was to add a second power supply; I was not overjoyed at this but at the time I thought it was more power that was needed. If I determined that this helped enough my plan was to go out and buy an expensive 550W or 600W model. Unfortunately, while things appeared to work at first, once in a while, one of the ATA drives would mysteriously 'fallback to PIO mode' or even indicate that a block could not be read or written. The first few times I took out the indicated drive, ran it through the Maxtor test program, and every time the drive would come back as OK, so it's definitely not the drives. On the ATA drives are 3 vinum RAID-5 setups, and everytime vinum would of course correctly indicate that the affected volume was running in degraded mode. For my experiences with hot rebuilding, see my post from a few weeks back (basically: don't try to do that). In any case, there was no pattern to the failures -- I have seen the exact same error messages on both the onboard IDE controllers and the promises, and with both the 120G and the 200G drives. Here's an example: Sep 17 12:17:20 sandcat /kernel: ad10: DMA problem fallback to PIO mode Sep 17 12:17:20 sandcat last message repeated 4 times Sep 17 12:21:41 sandcat /kernel: ad10s1e: hard error reading fsbn 13008689 of 6504313-6504392 (ad10s1 bn 13008689; cn 12905 tn 7 sn 8) status=59 error=40 Sep 17 12:21:41 sandcat /kernel: vinum: local.p0.s3 is crashed by force Sep 17 12:21:41 sandcat /kernel: vinum: local.p0 is degraded Still suspecting power, I have in the meantime replaced one of the PSU's with another one, and even added a third. All the +12V and +5V amp totals that the PSUs could deliver were triplechecked with the specs of the drives, mobo and CPU, and should have been more than enough. I tried to monitor the voltages with sysutils/xmbmon, and got lines like this: Temp.= 36.0, 49.0, 43.0; Rot.= 3183, 0, 2710 Vcore = 1.65, 2.62; Volt. = 3.34, 4.27, 11.37, -5.34, -1.95 which initially confirmed my suspicions. However the box kept crashing. So, urged by some friends today I took up a multimeter and measured the voltages on the connectors; and this is were I got away totally clueless, because the multimeter measured 5.07V on the +5V line and 12.01V on the +12V. Other than greatly decreasing my confidence in sysutils/xmbmon, this also shattered my PSU theory. Other causes that I can think of are of course heat and memory, but there is no other instability in this box whatsoever. Even when loading all disks at the same time (dd if=/dev/ad[0-10] of=/dev/null bs=1m) and loading the processor with a CPU intensive task, nothing crashes. I would have expected lots of other symptoms (sig11 etc) in case of overheating or bad memory. I'm still planning to do a memtest when I can take the box offline, but I'm skeptical as to the outcome. Besides that, the temperature readings of xmbmon are within the expected ranges. Although of course the question remains whether xmbmon spits out the right values. Basically my question is open-ended: what would you check when confronted with such a situation? I'm really baffled by now, and would *greatly* like to keep this box up for > 1 week... As posted above, this is on 4.10-RELEASE-p2, dmesg & pciconf -lv (along with a copy of this email) available at http://sandcat.nl/~stijn/freebsd/ataproblem/ Thanks for _any_ hints on this... --Stijn -- "Computer games don't affect kids; I mean if Pac-Man affected us as kids, we'd all be running around in darkened rooms, munching magic pills and listening to repetitive electronic music." -- Kristian Wilson, Nintendo, Inc., 1989