From owner-freebsd-questions@FreeBSD.ORG Wed Nov 14 21:38:06 2007 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A712116A41B for ; Wed, 14 Nov 2007 21:38:06 +0000 (UTC) (envelope-from derek@computinginnovations.com) Received: from betty.computinginnovations.com (mail.computinginnovations.com [64.81.227.250]) by mx1.freebsd.org (Postfix) with ESMTP id 357AB13C459 for ; Wed, 14 Nov 2007 21:38:05 +0000 (UTC) (envelope-from derek@computinginnovations.com) Received: from p28.computinginnovations.com (dhcp-10-20-30-100.computinginnovations.com [10.20.30.100]) (authenticated bits=0) by betty.computinginnovations.com (8.13.8/8.13.8) with ESMTP id lAELbw7l094454; Wed, 14 Nov 2007 15:37:58 -0600 (CST) (envelope-from derek@computinginnovations.com) Message-Id: <6.0.0.22.2.20071114153312.024eda70@mail.computinginnovations.com> X-Sender: derek@mail.computinginnovations.com X-Mailer: QUALCOMM Windows Eudora Version 6.0.0.22 Date: Wed, 14 Nov 2007 15:37:41 -0600 To: "Tamouh H." , "'Barnaby Scott'" From: Derek Ragona In-Reply-To: <039801c826e9$f1804550$6700a8c0@tamouh> References: <473B0D70.7020307@waywood.co.uk> <6.0.0.22.2.20071114094712.024bfe38@mail.computinginnovations.com> <473B34C5.4030300@waywood.co.uk> <039801c826e9$f1804550$6700a8c0@tamouh> Mime-Version: 1.0 X-ComputingInnovations-MailScanner-Information: Please contact the ISP for more information X-ComputingInnovations-MailScanner: Found to be clean X-ComputingInnovations-MailScanner-From: derek@computinginnovations.com X-Spam-Status: No Content-Type: text/plain; charset="us-ascii"; format=flowed X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: freebsd-questions@freebsd.org Subject: RE: Dell PE4600 RAID5 server failing X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Nov 2007 21:38:06 -0000 At 12:12 PM 11/14/2007, Tamouh H. wrote: > > > > Derek Ragona wrote: > > > At 09:00 AM 11/14/2007, Barnaby Scott wrote: > > >> I suspect I already know the answer to this, which is that the > > >> trouble I am having is nothing to do with the OS at all, > > but I have > > >> to ask, because I am otherwise up against a total brick wall! > > >> > > >> I bought a second-hand Dell Poweredge 4600 and installed > > FreeBSD 6.2 > > >> earlier this year. I had it set up with RAID5 using its PERC3/DC > > >> controller, with 7 x 73GB disks (+ 1 hot spare). So far so > > good, and > > >> it worked faultlessly as a Samba server for several months. > > >> > > >> At the beginning of October, it went down, reporting a mismatch > > >> between the configuration on the NVRAM and the disks. With > > help from > > >> Dell support, I managed to recreate the RAID array and it worked > > >> again for a month. > > >> > > >> In early November it happened again, and has kept > > happening since. At > > >> one point it appeared that the backplane was faulty, so I replaced > > >> that, but I cannot keep the server up for more than a day or so > > >> without this 'mismatch' poblem. > > >> > > >> What about diagnostics on the hardware you may ask? I have run all > > >> the diagnostic tools that Dell can supply - several times > > - and the > > >> server declares itself to be totally fault-free. > > >> > > >> My specific questions therefore: > > >> > > >> Is there any way at all that FreeBSD could be invloved with this > > >> problem? (I did notice for example that the Dell PERC3/DC > > controller > > >> was not in the list of supported hardware - but then > > again, why did > > >> it work for several months?) > > >> > > >> Can I use FreeBSD to tell me anything about the fault that Dell's > > >> diagnostic tools haven't found? > > >> > > >> (I do hope someone might be able to help - Dell are trying > > to get me > > >> to switch to a 'supported' OS!) > > >> > > >> > > >> Thanks > > >> > > >> Barnaby Scott > > > > > > It doesn't sound like any OS issue as you set up the RAID > > outside the > > > OS. It may be a bad drive or drive(s). Most RAID drives have RAID > > > information written to the drives, and if this becomes > > unreadable you > > > will have RAID faults. > > > > > > Another likely culprit is heat. Overheating drives often > > fail. Are > > > you sure the temperatures in the drive enclosure is OK? > > > > > > If you can, run diagnostics on the drives, this usually requires > > > running these with the drives taken out of the RAID array though. > > > > > > -Derek > > > > > > > Thanks for replying - as I said, this is a long shot trying > > to see if there is any OS involvement. > > > > The drives are fine - I have used two different tools to > > analyse them while the computer is booted from a live CD and > > the RAID configuration cleared on the controller. Besides, > > you would expect one drive to fail at a time, and if this > > happened, the hot spare would surely be pressed into service. > > Nothing like this has happened though - the controller is > > reporting several drives (not always the same ones) failed > > simultaneously, but when the array is re-created from the > > disks, everything works fine. Problem is, it goes down again > > a day or so later. > > > > As for heat, there is nothing being reported there and the > > fans that cool that area are working. > > > > Any other ideas gratefully received! > > > > Barnaby Scott > >This is very unlikely to be OS related. But here are few pointers: > >1) Check the make/model of the drives. Certain types of make/model SCSI >drives had a glitch in them a while ago with a certain firmware that >they'd disconnect from a RAID. I had a personal experience with these ones >(Seagate U320). > >2) What did happen in October? Anything hardware, software, power wise has >occurred ? > >3) NVRAM and Disk mismatch, I'd say check the controller, backup battery >present but weak ? > >4) Unlikely to be the source, but run a test on your physical RAM using >MEMTEST86+ and check the power supply is sufficient and working properly. > > I've had some raid drives disconnect and go missing, which all cleared and was rebuilt on a full power-off reboot. I belive this is due to some power issues in my area. Specifically my line power from the utility was running high, over 127 volts, making over-voltage spikes prevalent. On a couple spikes I saw the drives disconnect. So it could be power related. On temperature, I would put in a temperature probe and check it from the external probe. Some remote KVM solutions now include temperature probes. -Derek -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. MailScanner thanks transtec Computers for their support.