Date: Mon, 10 May 2004 10:52:19 -0500 From: Doug Poland <doug@polands.org> To: questions@freebsd.org Subject: Need help diagnosing hardware failure Message-ID: <20040510155158.GA37371@omniresources.com>
next in thread | raw e-mail | index | archive | help
Hello, Upon returning from a weeks vacation, I was dismayed to find my home file server (running 4.8-STABLE) had crashed. The box in question has an Adaptec Host adapter ahc0: <Adaptec 2940A Ultra SCSI adapter> port 0xf800-0xf8ff mem 0xfedfe000-0xfedfefff irq 10 at device 13.0 on pci0 aic7860: Ultra Single Channel A, SCSI Id=7, 3/253 SCBs and seven identical SCSI drives judeah# dmesg | grep IBMRAID da0: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device da1: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device da2: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device da3: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device da4: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device da6: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device da5: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device in a vinum stipped volume... judeah# more /etc/vinum.conf drive a device /dev/da0e drive b device /dev/da1e drive c device /dev/da2e drive d device /dev/da3e drive e device /dev/da4e drive f device /dev/da5e drive g device /dev/da6e volume dataraid plex org striped 256k sd length 1920m drive a sd length 1920m drive b sd length 1920m drive c sd length 1920m drive d sd length 1920m drive e sd length 1920m drive f sd length 1920m drive g Perusal of /var/log/messages show... May 3 11:17:31 judeah /kernel: (da1:ahc0:0:1:0): SCB 0x5a - timed out May 3 11:17:31 judeah /kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<< May 3 11:17:31 judeah /kernel: ahc0: Dumping Card State while idle, at SEQADDR 0x7 May 3 11:17:31 judeah /kernel: Card was paused May 3 11:17:31 judeah /kernel: ACCUM = 0x97, SINDEX = 0x52, DINDEX = 0x8c, ARG_2 = 0x0 May 3 11:17:31 judeah /kernel: HCNT = 0x0 SCBPTR = 0x1 May 3 11:17:31 judeah /kernel: SCSISIGI[0x0] ERROR[0x40] SCSIBUSL[0x0] LASTPHASE[0x1] May 3 11:17:31 judeah /kernel: SCSISEQ[0x12] SBLKCTL[0x0] SCSIRATE[0x0] SEQCTL[0x10] May 3 11:17:31 judeah /kernel: SEQ_FLAGS[0xc0] SSTAT0[0x5] SSTAT1[0xa] SSTAT2[0x0] May 3 11:17:31 judeah /kernel: SSTAT3[0x0] SIMODE0[0x0] SIMODE1[0xa4] SXFRCTL0[0x80] May 3 11:17:31 judeah /kernel: DFCNTRL[0x0] DFSTATUS[0x29] May 3 11:17:31 judeah /kernel: STACK: 0x0 0x166 0x109 0x3 May 3 11:17:31 judeah /kernel: SCB count = 130 May 3 11:17:31 judeah /kernel: Kernel NEXTQSCB = 30 May 3 11:17:31 judeah /kernel: Card NEXTQSCB = 30 May 3 11:17:31 judeah /kernel: QINFIFO entries: May 3 11:17:31 judeah /kernel: Waiting Queue entries: May 3 11:17:31 judeah /kernel: Disconnected Queue entries: 2:90 May 3 11:17:31 judeah /kernel: QOUTFIFO entries: May 3 11:17:31 judeah /kernel: Sequencer Free SCB List: 1 0 May 3 11:17:31 judeah /kernel: Sequencer SCB Info: May 3 11:17:31 judeah /kernel: 0 SCB_CONTROL[0xe2] SCB_SCSIID[0x67] SCB_LUN[0x0] SCB_TAG[0xff] May 3 11:17:31 judeah /kernel: 1 SCB_CONTROL[0xe2] SCB_SCSIID[0x67] SCB_LUN[0x0] SCB_TAG[0xff] May 3 11:17:31 judeah /kernel: 2 SCB_CONTROL[0x66] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0x5a] May 3 11:17:31 judeah /kernel: Pending list: May 3 11:17:31 judeah /kernel: 90 SCB_CONTROL[0x62] SCB_SCSIID[0x17] SCB_LUN[0x0] May 3 11:17:31 judeah /kernel: Kernel Free SCB list: 82 88 14 115 12 83 120 92 45 8 16 5 59 124 31 29 38 18 73 42 93 64 19 7 74 100 113 75 24 3 86 71 20 108 6 67 68 125 105 97 110 34 54 87 106 25 61 109 123 47 44 66 53 94 84 76 65 77 72 9 69 32 17 55 119 1 22 91 4 112 56 27 102 62 13 15 128 50 33 51 81 37 57 28 99 117 85 36 41 11 121 49 0 80 35 39 40 95 26 96 10 58 118 122 127 111 2 126 70 98 89 21 60 46 48 78 43 101 23 79 52 63 129 103 104 107 116 114 May 3 11:17:31 judeah /kernel: May 3 11:17:31 judeah /kernel: <<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>> The box rebooted and failed to come up to it's normal state because the the vinum volume that was running off this SCSI disk system failed to load. May 3 11:22:01 judeah /kernel: sg[0] - Addr 0x1ddd000 : Length 4096 May 3 11:22:01 judeah /kernel: sg[1] - Addr 0x7be000 : Length 4096 May 3 11:22:01 judeah /kernel: (da1:ahc0:0:1:0): no longer in timeout, status = 34b May 3 11:22:01 judeah /kernel: ahc0: Issued Channel A Bus Reset. 1 SCBs aborted May 3 11:22:01 judeah /kernel: vinum: dataraid.p0.s1 is stale by force May 3 11:22:01 judeah /kernel: vinum: dataraid.p0 is corrupt May 3 11:22:01 judeah /kernel: fatal :dataraid.p0.s1 write error, block 1905465 for 8192 bytes May 3 11:22:01 judeah /kernel: dataraid.p0.s1: user buffer block 13336624 for 8192 bytes It looks like SCSI disk da1 was timing out but recovered. This is speculation on my part. Upon rebooting today, da1 seems to be OK? May 10 07:03:00 judeah /kernel: da1 at ahc0 bus 0 target 1 lun 0 May 10 07:03:00 judeah /kernel: da1: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device May 10 07:03:00 judeah /kernel: da1: 10.000MB/s transfers (10.000MHz, offset 15), Tagged Queueing Enabled May 10 07:03:00 judeah /kernel: da1: 1920MB (3933040 512 byte sectors: 255H 63S/T 244C) So, the question, do I have a hardware failure? If so, is it the Adaptec 2940/UW controller or the SCSI disk? When I get this resolved, I'll obviously have to figure out how to fix my corrupt vinum volume :( -- Regards, Doug
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20040510155158.GA37371>