From owner-freebsd-questions@FreeBSD.ORG Thu Jun 9 14:30:06 2005 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5D39F16A41F for ; Thu, 9 Jun 2005 14:30:06 +0000 (GMT) (envelope-from prefect@sidehack.sat.gweep.net) Received: from sidehack.sat.gweep.net (sidehack.sat.gweep.net [204.145.148.154]) by mx1.FreeBSD.org (Postfix) with SMTP id 8A0F143D49 for ; Thu, 9 Jun 2005 14:30:05 +0000 (GMT) (envelope-from prefect@sidehack.sat.gweep.net) Received: (qmail 74757 invoked by uid 504); 9 Jun 2005 14:30:02 -0000 Date: Thu, 9 Jun 2005 10:30:02 -0400 From: Steve Richardson To: freebsd-questions@freebsd.org Message-ID: <20050609143002.GA74546@sidehack.sat.gweep.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-Corp: acceltalk business machines Subject: FBSD 5.4-STABLE/3Ware Escalade 7506-4LP on dual Opteron issue X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Jun 2005 14:30:06 -0000 Hi, We're building out brand new dual Opteron box to run our public access unix site. We're running FreeBSD 5.4 and a 3Ware Escalade 7506-4LP. We are having difficulties with the system, and any help you can offer would be greatly appreciated. For the most part, everything behaves fine. We've got the system built and installed. Unfortunately, we're having a periodic, catastrophic failure involving the 3Ware card. Periodically, the system will partly lock up with the following errors: twe0: unexpected status bit(s) 100000 twe0: PCI abort, clearing. I say partly lock up because the kernel does not panic, nor do the console keyboard or network interfaces become non-responsive (i.e. you can type stuff at the login prompt, and ping the server). However, the disk subsystem does appear to cease functioning once this has occurred. Frankly at this point we are baffled, because the system is stable enough to run for days on end under light load, and will even occasionally handle periods of medium disk load (e.g. many hours of rsyncing from our live server, build world, etc). We have been using the bonnie++ hard disk benchmarking suite as a means for recreating the problem, as follows: > mkdir testdir > bonnie++ -d ./dbench -s 2g -n 100:500000:1000 -x 100 I've included system information below, including dmesg output. regards, Steve Richardson System Administrator GweepNet Cooperative Network System Description: Gigabyte GA-7A8DW motherboard (2) AMD Opteron 246 2GHz CPUs 2GB Samsung PC3200 ECC RAM 3Ware Escalade 7506-4LP parallel ATA RAID, installed in 64 bit PCI slot OS: FreeBSD 5.4-STABLE FreeBSD 5.4-STABLE amd64 dmesg output: Copyright (c) 1992-2005 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 5.4-STABLE #2: Tue Jun 7 00:10:29 EDT 2005 root@newsidey.gweep.net:/usr/obj/usr/src/sys/SIDEHACK Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: AMD Opteron(tm) Processor 246 (1993.79-MHz K8-class CPU) Origin = "AuthenticAMD" Id = 0xf5a Stepping = 10 Features=0x78bfbff AMD Features=0xe0500800 real memory = 2146893824 (2047 MB) avail memory = 2061205504 (1965 MB) ACPI APIC Table: FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 MADT: Forcing active-low polarity and level trigger for SCI ioapic0 irqs 0-23 on motherboard ioapic1 irqs 24-27 on motherboard ioapic2 irqs 28-31 on motherboard acpi0: on motherboard acpi0: Power Button (fixed) acpi0: Sleep Button (fixed) acpi_bus_number: can't get _ADR acpi_bus_number: can't get _ADR acpi_bus_number: can't get _ADR unknown: I/O range not supported unknown: I/O range not supported ACPI-1304: *** Error: Method execution failed [\\_SB_.PCI0.LPC_.LPT_._CRS] (Node 0xffffff0000a70080), AE_AML_BUFFER_LIMIT ACPI-0239: *** Error: Method execution failed [\\_SB_.PCI0.LPC_.LPT_._CRS] (Node 0xffffff0000a70080), AE_AML_BUFFER_LIMIT can't fetch resources for \\_SB_.PCI0.LPC_.LPT_ - AE_AML_BUFFER_LIMIT Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x8008-0x800b on acpi0 cpu0: on acpi0 cpu1: on acpi0 acpi_button0: on acpi0 pcib0: port 0xcf8-0xcff on acpi0 pci0: on pcib0 pcib1: at device 1.0 on pci0 pci1: on pcib1 pci1: at device 0.0 (no driver attached) pcib2: at device 6.0 on pci0 pci2: on pcib2 ohci0: mem 0xd0110000-0xd0110fff irq 19 at device 0.0 on pci2 usb0: OHCI version 1.0, legacy support usb0: SMM does not respond, resetting usb0: on ohci0 usb0: USB revision 1.0 uhub0: AMD OHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 3 ports with 3 removable, self powered ohci1: mem 0xd0111000-0xd0111fff irq 19 at device 0.1 on pci2 usb1: OHCI version 1.0, legacy support usb1: SMM does not respond, resetting usb1: on ohci1 usb1: USB revision 1.0 uhub1: AMD OHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub1: 3 ports with 3 removable, self powered ahc0: port 0x3000-0x30ff mem 0xd0112000-0xd0112fff irq 17 at device 4.0 on pci2 aic7850: Single Channel A, SCSI Id=7, 3/253 SCBs bge0: mem 0xd0100000-0xd010ffff irq 19 at device 5.0 on pci2 miibus0: on bge0 brgphy0: on miibus0 brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto bge0: Ethernet address: 00:0f:ea:7e:b1:81 atapci0: port 0x3400-0x340f,0x3410-0x3413,0x3418-0x341f,0x3414-0x3417,0x3420-0x3427 mem 0xd0113000-0xd01133ff irq 18 at device 6.0 on pci2 ata2: channel #0 on atapci0 ata3: channel #1 on atapci0 ata4: channel #2 on atapci0 ata5: channel #3 on atapci0 isab0: at device 7.0 on pci0 isa0: on isab0 atapci1: port 0x1000-0x100f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 7.1 on pci0 ata0: channel #0 on atapci1 ata1: channel #1 on atapci1 pci0: at device 7.3 (no driver attached) pcib3: on acpi0 pci8: on pcib3 pcib4: at device 3.0 on pci8 pci9: on pcib4 pci8: at device 3.1 (no driver attached) pcib5: at device 4.0 on pci8 pci14: on pcib5 twe0: <3ware Storage Controller. Driver version 1.50.01.002> port 0x4000-0x400f mem 0xf0800000-0xf0ffffff irq 30 at device 2.0 on pci14 twe0: 4 ports, Firmware FE7X 1.05.00.068, BIOS BE7X 1.08.00.048 pci8: at device 4.1 (no driver attached) atkbdc0: port 0x64,0x60 irq 1 on acpi0 atkbd0: flags 0x1 irq 1 on atkbdc0 kbd0 at atkbd0 fdc0: port 0x3f7,0x3f0-0x3f5 irq 6 drq 2 on acpi0 sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0 sio1: type 16550A ppc0: cannot reserve I/O port range ppc0: cannot reserve I/O port range orm0: at iomem 0xd0000-0xd0fff,0xc0000-0xcffff on isa0 ppc0: cannot reserve I/O port range sc0: at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 Timecounters tick every 1.000 msec ahc0: Someone reset channel A ad0: 152627MB [310101/16/63] at ata0-master UDMA100 ad2: 286188MB [581463/16/63] at ata1-master UDMA133 Waiting 15 seconds for SCSI devices to settle twed0: on twe0 twed0: 305253MB (625159424 sectors) sa0 at ahc0 bus 0 target 3 lun 0 sa0: Removable Sequential Access SCSI-2 device sa0: 10.000MB/s transfers (10.000MHz, offset 15) SMP: AP CPU #1 Launched! Mounting root from ufs:/dev/twed0s1a WARNING: / was not properly dismounted WARNING: /home/crib was not properly dismounted WARNING: /home/domus was not properly dismounted WARNING: /tmp was not properly dismounted WARNING: /u was not properly dismounted WARNING: /u/backup/nearline was not properly dismounted WARNING: /u/backup/online was not properly dismounted WARNING: /u/news was not properly dismounted WARNING: /u/news/nntpcached was not properly dismounted WARNING: /usr was not properly dismounted WARNING: /var was not properly dismounted WARNING: /var/tmp was not properly dismounted bge0: firmware handshake timed out bge0: RX CPU self-diagnostics failed! bge0: watchdog timeout -- resetting