Date: Fri, 14 Dec 2001 16:43:19 -0800 (PST) From: John Polstra <jdp@polstra.com> To: hardware@freebsd.org Subject: Question about a strange hardware problem Message-ID: <XFMail.011214164319.jdp@polstra.com>
next in thread | raw e-mail | index | archive | help
I've got an intermittent hardware problem on one of the CVSup mirror sites, and I would appreciate some experienced opinions about whether it's likely to be in the SCSI controller, the SCSI cable, the hard drive, or elsewhere. The symptom is that the checkouts.cvs file which maintains state between CVSup updates occasionally gets 1-bit errors at random places in it. I haven't seen any similar errors in the actual content on the mirror; but it is on a different drive, its access patterns are different, and errors there might be less noticeable. The errors in checkouts.cvs cause updates to break until I intervene manually, so I notice those pretty quickly. The motherboard is an Asus P2B-LS board with on-board Adaptec chip. Here's the relevant part of the dmesg output: ahc0: <Adaptec aic7890/91 Ultra2 SCSI adapter> port 0xd000-0xd0ff mem 0xe2000000-0xe2000fff irq 10 at device 6.0 on pci0 aic7890/91: Wide Channel A, SCSI Id=7, 32/255 SCBs da1 at ahc0 bus 0 target 1 lun 0 da1: <IBM DNES-309170W SAH0> Fixed Direct Access SCSI-3 device da1: 80.000MB/s transfers (40.000MHz, offset 30, 16bit), Tagged Queueing Enabled da1: 8748MB (17916240 512 byte sectors: 255H 63S/T 1115C) da0 at ahc0 bus 0 target 0 lun 0 da0: <IBM DNES-309170W SA30> Fixed Direct Access SCSI-3 device da0: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da0: 8748MB (17916240 512 byte sectors: 255H 63S/T 1115C) The files with the 1-bit errors are on da0, which also has all of the OS files. The mirror content is on da1. The OS version is FreeBSD-4.2-STABLE from around last January, plus a few security patches. The system has been up for 317 days. It seems like if the problem were in the RAM (the obvious place), it would have crashed by now. I can't remember whether it has ECC memory or now, and the system isn't physically accessible to me. If the errors were on the SCSI cable, parity checking ought to detect them. And if the media were bad, I should be seeing some disk errors in the dmesg output. But I have never seen even one. The errors don't show up in specific disk blocks -- they appear to be at random places. Given all that, it seems to me that the problem must be in the drive electronics of da0. What do you think? As a test, I've moved the files that always show the errors over to da1 for a while, to see if that fixes it. John To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hardware" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.011214164319.jdp>