From owner-freebsd-stable@FreeBSD.ORG Sat Jun 25 07:35:08 2005 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6D71A16A41C; Sat, 25 Jun 2005 07:35:08 +0000 (GMT) (envelope-from lonewolf-freebsd@earthmagic.org) Received: from ppp162-248.static.internode.on.net (ppp162-248.static.internode.on.net [150.101.162.248]) by mx1.FreeBSD.org (Postfix) with ESMTP id A84CA43D1D; Sat, 25 Jun 2005 07:35:07 +0000 (GMT) (envelope-from lonewolf-freebsd@earthmagic.org) Received: from earthmagic.org (unknown [192.168.2.3]) by ppp162-248.static.internode.on.net (Postfix-MSA) with ESMTP id BC26291F4; Sat, 25 Jun 2005 17:32:43 +1000 (EST) Message-ID: <42BD0926.8000804@earthmagic.org> Date: Sat, 25 Jun 2005 17:35:02 +1000 From: Johny Mattsson User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.3.1) Gecko/20030524 X-Accept-Language: en-us, en MIME-Version: 1.0 To: freebsd-stable@freebsd.org References: <8d02aed00506181404642100b9@mail.gmail.com> <42BC5353.1090807@earthmagic.org> <8d02aed005062412001c7903b3@mail.gmail.com> In-Reply-To: <8d02aed005062412001c7903b3@mail.gmail.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Cc: twesky@gmail.com, sos@freebsd.org Subject: Re: ATA_DMA errors - [ workaround for me ] X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Jun 2005 07:35:08 -0000 Hi all, Today I've taken a fresh stab at the problem (I'm never at my best at 5am in the morning having worked through the night), and I have managed to come up with what appears to amount to a successful workaround. It would be good if my observations could be confirmed by someone else. Basically, the problem seems to be related to using more than one channel on the IDE controller. Data points for this are: [ SiI 0680 ] Channel 1: 40 GB Seagate Channel 2: 60 GB Seagate + 160 GB Western Digital Result: 200k worth of "DMA_READ timed out" and "DMA_WRITE UDMA ICRC error" messages, inability to obtain SMART info from the WD drive, WD drive info garbled, and WD drive being removed/detached from the config. The errors only appeared after a few hours operation, but once they were there, no amount of reboots would get rid of them/improve the situation. To attempt to save the data on the WD disk before the FS got completely hammered, I pulled it out, and observed the following: [ SiI 0680 ] Channel 1: 40 GB Seagate Channel 2: 60 GB Seagate Result: DMA_READ timed out errors for both drives, and "DMA_WRITE UDMA ICRC error" messages for the 60 GB Seagate. Since I had an older ATA-100 controller available, I tried with it (it can't handle >120GB drives though, so I couldn't as many combinations as I would have liked): [ CMD 649 ] Channel 1: 40 GB Seagate Channel 2: 60 GB Seagate Result: DMA_READ timed out errors, but only when both drives are in use at the same time. Running fsck on a slice on either drive in parallell reliably reproduced the DMA_READ errors. Whenever an error was reported for one drive, another error for the other drive always followed right after. [ CMD 649 ] Channel 1: Channel 2: 40 GB Seagate + 60 GB Seagate Result: No error messages. [ CMD 649 ] Channel 1: 40 GB Seagate + 60 GB Seagate Channel 2: Result: No error messages. Encouraged by these findings, I swapped back to the SiI controller to test the 160 GB drive: [ SiI 0680 ] Channel 1: Channel 2: 160 GB WD Result: No error messages [ SiI 0680 ] Channel 1: 160 GB WD Channel 2: Result: No error messages Finally, I tried everything together: [ SiI 0680 ] Channel 1: 160 GB WD Channel 2: [ CMD 649 ] Channel 1: 40 GB Seagate + 60 GB Seagate Channel 2: Result: No errors messages. What I haven't mentioned in the above is that I also tried some combinations with different cables, and also at reduced speed (UDM66 vs UDMA100). Neither changes had any effect on the behaviour. With the WD drive alone on the SiI 0680, I was also able to retrieve SMART information from it, and it's showing no errors for the drive at all. Likewise so for the 60 GB Seagate drive. All drives pass their self-tests without any errors. As mentioned in my previous email, my system drive is hanging off the built-in PIIX4 controller, as a single drive and only one channel on the controller used. I never saw any errors for that drive throughout my testing. My conclusion is thusly that there is something that's crept in that's affecting stability when multiple channels are used on the same controller. I'm not versed enough in driver internals to know if it's IRQ, DMA, ISR or anything-else related though. Below are my latest dmesg and pciconf listings - hopefully this will help someone locate the culprit. (Soren?) So, now I'm stuck with a system with three IDE controllers and one SCSI controller, and a motherboard that is utterly confused when I ask it boot off an external controller... (i.e. I can only boot off the built-in controller now). Please let me know if there's some other info I can get for you; I'll have limited ability to move drives around since this is the file server and people get annoyed when it's unavailable, but do ask if you think it will help you! :) Cheers, /Johny ======= dmesg ======== Copyright (c) 1992-2005 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 5.4-RELEASE #0: Sun May 8 10:21:06 UTC 2005 root@harlow.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Pentium II/Pentium II Xeon/Celeron (467.73-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0x665 Stepping = 5 Features=0x183f9ff real memory = 805240832 (767 MB) avail memory = 778231808 (742 MB) npx0: on motherboard npx0: INT 16 interface acpi0: on motherboard acpi0: Power Button (fixed) Timecounter "ACPI-safe" frequency 3579545 Hz quality 1000 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x4008-0x400b on acpi0 cpu0: on acpi0 acpi_throttle0: on cpu0 acpi_button0: on acpi0 pcib0: port 0x5000-0x500f,0x4000-0x4041,0xcf8-0xcff on acpi0 pci0: on pcib0 agp0: mem 0xe0000000-0xe3ffffff at device 0.0 on pci0 pcib1: at device 1.0 on pci0 pci1: on pcib1 isab0: at device 7.0 on pci0 isa0: on isab0 atapci0: port 0xf000-0xf00f,0x376,0x170-0x177,0x 3f6,0x1f0-0x1f7 at device 7.1 on pci0 ata0: channel #0 on atapci0 ata1: channel #1 on atapci0 uhci0: port 0x9000-0x901f irq 11 at device 7.2 on pci0 usb0: on uhci0 usb0: USB revision 1.0 uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 2 ports with 2 removable, self powered pci0: at device 7.3 (no driver attached) atapci1: port 0xa400-0xa40f,0xa000-0xa003,0x9c00-0 x9c07,0x9800-0x9803,0x9400-0x9407 mem 0xea001000-0xea0010ff irq 11 at device 9.0 on pci0 ata2: channel #0 on atapci1 ata3: channel #1 on atapci1 atapci2: port 0xb800-0xb80f,0xb400-0xb403,0xb000-0xb007,0xac00-0xac03,0xa800-0xa807 irq 9 at device 10.0 on pci0 ata4: channel #0 on atapci2 ata5: channel #1 on atapci2 pci0: at device 11.0 (no driver attached) ahc0: port 0xbc00-0xbcff mem 0xea000000-0xea000fff irq 10 at device 12.0 on pci0 aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs rl0: port 0xc000-0xc0ff mem 0xea002000-0xea0020ff irq 11 at device 13.0 on pci0 miibus0: on rl0 rlphy0: on miibus0 rlphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto rl0: Ethernet address: 00:40:f4:28:9d:20 sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0 sio1: type 16550A ppc0: port 0x778-0x77b,0x378-0x37b irq 7 drq 3 on acpi0 ppc0: SMC-like chipset (ECP/EPP/PS2/NIBBLE) in COMPATIBLE mode ppc0: FIFO with 16/16/16 bytes threshold ppbus0: on ppc0 plip0: on ppbus0 lpt0: on ppbus0 lpt0: Interrupt-driven port ppi0: on ppbus0 atkbdc0: port 0x64,0x60 irq 1 on acpi0 atkbd0: irq 1 on atkbdc0 kbd0 at atkbd0 psm0: irq 12 on atkbdc0 psm0: model IntelliMouse Explorer, device ID 4 orm0: at iomem 0xd0000-0xd07ff,0xc0000-0xc7fff on isa0 pmtimer0 on isa0 fdc0: cannot allocate I/O port (6 ports) sc0: at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> Timecounter "TSC" frequency 467728754 Hz quality 800 Timecounters tick every 10.000 msec ad0: 8207MB [16676/16/63] at ata0-master UDMA33 ad4: 152627MB [310101/16/63] at ata2-master UDMA100 ad8: 57241MB [116301/16/63] at ata4-master UDMA100 ad9: 76319MB [155061/16/63] at ata4-slave UDMA100 Waiting 15 seconds for SCSI devices to settle sa0 at ahc0 bus 0 target 4 lun 0 sa0: Removable Sequential Access SCSI-2 device sa0: 5.000MB/s transfers (5.000MHz, offset 8) sa1 at ahc0 bus 0 target 6 lun 0 sa1: Removable Sequential Access SCSI-2 device sa1: 10.000MB/s transfers (10.000MHz, offset 15) cd0 at ahc0 bus 0 target 5 lun 0 cd0: Removable CD-ROM SCSI-2 device cd0: 20.000MB/s transfers (20.000MHz, offset 15) cd0: Attempt to query device size failed: NOT READY, Medium not present Mounting root from ufs:/dev/ad0s1a ----------------------- ======= pciconf -lv ========== # pciconf -lv agp0@pci0:0:0: class=0x060000 card=0x00000000 chip=0x71908086 rev=0x02 hdr=0x00 vendor = 'Intel Corporation' device = '82443BX/ZX 440BX/ZX CPU to PCI Bridge (AGP Implemented)' class = bridge subclass = HOST-PCI pcib1@pci0:1:0: class=0x060400 card=0x00000000 chip=0x71918086 rev=0x02 hdr=0x01 vendor = 'Intel Corporation' device = '82443BX/ZX 440BX/ZX AGPset PCI-to-PCI bridge' class = bridge subclass = PCI-PCI isab0@pci0:7:0: class=0x060100 card=0x00000000 chip=0x71108086 rev=0x02 hdr=0x00 vendor = 'Intel Corporation' device = '82371AB/EB/MB PIIX4/4E/4M ISA Bridge' class = bridge subclass = PCI-ISA atapci0@pci0:7:1: class=0x010180 card=0x00000000 chip=0x71118086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' device = '82371AB/EB/MB PIIX4/4E/4M IDE Controller' class = mass storage subclass = ATA uhci0@pci0:7:2: class=0x0c0300 card=0x00000000 chip=0x71128086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' device = '82371AB/EB/MB PIIX4/4E/4M USB Interface' class = serial bus subclass = USB none0@pci0:7:3: class=0x068000 card=0x00000000 chip=0x71138086 rev=0x02 hdr=0x00 vendor = 'Intel Corporation' device = '82371AB/EB/MB PIIX4/4E/4M Power Management Controller' class = bridge atapci1@pci0:9:0: class=0x010400 card=0x36801095 chip=0x06801095 rev=0x02 hdr=0x00 vendor = 'Silicon Image Inc (Was: CMD Technology Inc)' device = 'SiI 0680 (Was: PCI-0680) Ultra ATA133 EIDE Controller' class = mass storage subclass = RAID atapci2@pci0:10:0: class=0x010400 card=0xf5ffffff chip=0x06491095 rev=0x02 hdr=0x00 vendor = 'Silicon Image Inc (Was: CMD Technology Inc)' device = 'PCI-649 Ultra ATA/100 PCI to IDE/ATA Controller' class = mass storage subclass = RAID none1@pci0:11:0: class=0x030000 card=0x00000000 chip=0x0519102b rev=0x01 hdr=0x00 vendor = 'Matrox Electronic Systems Ltd.' device = 'MGA-2064W Storm (Millennium board)' class = display subclass = VGA ahc0@pci0:12:0: class=0x010000 card=0x00000000 chip=0x81789004 rev=0x00 hdr=0x00 vendor = 'Adaptec Inc' device = 'AHA-2940U/UW/2940D Ultra/Ultra Wide/Dual SCSI Host Adapter' class = mass storage subclass = SCSI rl0@pci0:13:0: class=0x020000 card=0x813910ec chip=0x813910ec rev=0x10 hdr=0x00 vendor = 'Realtek Semiconductor' device = 'RT8139 (A/B/C/810x/813x/C+) Fast Ethernet Adapter' class = network subclass = ethernet -------------------------- -- Johny Mattsson - Making IT work ,-. ,-. ,-. When all else fails, http://www.earthmagic.org _.' `-' `-' Murphy's Law still works!