Date: Thu, 24 Jun 2010 17:22:41 -0500 From: Adam Vande More <amvandemore@gmail.com> To: Matthew Lear <matt@bubblegen.co.uk>, Jeremy Chadwick <freebsd@jdc.parodius.com>, freebsd-stable@freebsd.org Subject: Re: 7.2-RELEASE-p4, IO errors & RAID1 failure Message-ID: <AANLkTimo1Vb461DHw3ZXNwK5BxDcgzKSkdxc3Dnqizge@mail.gmail.com> In-Reply-To: <1277417182.1874.30.camel@almscliff.bubblegen.co.uk> References: <1276844904.7519.19.camel@almscliff.bubblegen.co.uk> <20100618082127.GA34578@icarus.home.lan> <1276876031.7519.39.camel@almscliff.bubblegen.co.uk> <20100618174208.GA47470@icarus.home.lan> <1276889330.2210.44.camel@almscliff.bubblegen.co.uk> <1277155992.1860.3.camel@almscliff.bubblegen.co.uk> <20100622074541.GA71157@icarus.home.lan> <82A96ECD-676C-4A4D-A328-0CFAABD64D50@gid.co.uk> <1277401934.1874.12.camel@almscliff.bubblegen.co.uk> <20100624181535.GA58443@icarus.home.lan> <1277417182.1874.30.camel@almscliff.bubblegen.co.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
Haven't followed the entire thread, but wanted to point out something important to remember. SMART is not a reliable indicator of failure. It's certainly better than listening to it but it picks up less than 1/2 of drive failures. Google released a study of their disks in data centers a few years ago that was fairly in depth look into drive failure rate. You might find it interesting. On 6/24/10, Matthew Lear <matt@bubblegen.co.uk> wrote: > On Thu, 2010-06-24 at 11:15 -0700, Jeremy Chadwick wrote: >> On Thu, Jun 24, 2010 at 06:52:14PM +0100, Matthew Lear wrote: >> > On Tue, 2010-06-22 at 20:04 +0100, Bob Bishop wrote: >> > > Hi, >> > > >> > > On 22 Jun 2010, at 08:45, Jeremy Chadwick wrote: >> > > >> > > > On Mon, Jun 21, 2010 at 10:33:12PM +0100, Matthew Lear wrote: >> > > >> [tale of woe elided] >> > > > >> > > > I don't really have any other thoughts on the matter, sadly. >> > > > [helpful suggestions elided] >> > > > >> > > > Anyone else have ideas/recommendations? >> > > >> > > The disks sure look OK. I wouldn't rule out the controller(s), I've >> > > had various chipsets fail in odd ways. >> > > >> > >> > Thanks Bob. I think we all thought the same. >> > I've actually just rebooted the machine and FreeBSD no longer boots. >> > This isn't what I was expecting at all. Something has clearly gone wrong >> > with some file system metadata. >> > >> > When I commissioned the machine I installed an 'early' bootloader >> > (apologies for perhaps using an incorrect term) which boots FreeBSD by >> > default (F1 option) or from Drive 1 (F5). Drive 1 is the DVD drive. >> >> I believe this is the boot0 stage of the FreeBSD bootstrap process, >> otherwise known as "BootMgr" during the OS installation. I tend to >> avoid this and pick "Standard" instead, which lets the system boot right >> into boot2/loader. >> >> > It appears to be the case that the early bootloader tries to boot >> > FreeBSD and fails. I get the messages: >> > >> > error 1 lba 795079 >> > Invalid format >> > >> > FreeBSD/i386 boot >> > Default: 0:ad(0,a)/boot/kernel/kernel >> > boot: >> > error 1 lba 786815 >> > No /boot/kernel/kernel >> > >> > FreeBSD/i386 boot >> > Default: 0:ad(0,a)/boot/kernel/kernel >> > boot: >> > >> > ...and I'm at a boot prompt. >> >> You're at the boot0 stage. The bootstrap stage looks wrong: this should >> be 0:ad(0,a)/boot/loader, not /boot/kernel/kernel. You should load the >> kernel from boot2/loader, not boot0. >> >> After you powered off the system, did you physically remove the ad0 >> disk, or is it still in the system? >> > > It's still in the system. Given that the disk is ok relative to SMART, I > was of the [probably naive] assumption that I'd be able to boot up > normally, access the array on ar0, re-sync the array and carry on as > normal monitoring any further errors. > >> I would recommend taking ad0 out of the picture (power down the machine >> and physically unplug it), and make sure your BIOS is set to boot from >> the first hard disk *and* the 2nd hard disk. "Hard disk" in this >> context means "any disk that's part of the RAID-1 array". You want to >> make sure your other disks (whatever that thing is on ata0-slave, and >> the backup disk you have on ad1) *are not* bootable from the BIOS. If >> they've ever been used as bootable disks in the past, then you should >> have cleared the MBR on them to ensure they couldn't be booted by the >> BIOS. > > Understood. > >> >> What I'm documenting here is the need to make sure that you don't boot >> the wrong device/disk. I'm talking about what the *BIOS* boots, not the >> FreeBSD boot0 bootstrap. >> >> You should keep the 2nd disk in the RAID-1 mirror connected to its >> current SATA port; do not move it to what ad0 was connected to. >> >> > So, given that ad0 was the failed disk, the bootloader has failed to >> > find specific boot data on ad0 and dropped me into a boot prompt. >> >> Actually, it's reporting an I/O error at a specific LBA, indicating it >> either can't load the kernel. >> >> > I'm tempted to replace the boot line with 0:ad(2,a)/boot/kernel/kernel >> > or should that be 2:ad(0,a)/boot/kernel/kernel but I'm a little >> > suspicious of doing anything at this point? >> >> I believe you want 0:ad(2,a)/boot/loader, but you'll have to enter this >> every time unless you follow what I wrote above (re: BIOS disk boot >> order). > > Again, all understood. I gave this a whirl and saw several ad0 timeout > messages at various LBA, the system boot up hung and dropped me into > single user mode. atacontrol list showed no devices attached to channel > 0 which I thought was rather odd. I've no idea if this is indicative of > a hw failure or not. Further investigation is required. > >> > Can anybody offer any guidance of what I can do to restore my system? I >> > was able to shut down the machine cleanly (shutdown -p now) and despite >> > the RAID mirror going offline, everything seemed to be behaving normally >> > (expected I guess given that I just lost some redundancy). >> > >> > I'm just that little bit more worried now :-( If the disks are ok, what >> > on earth could have happened and more importantly, how can I restore >> > what was an operational system when I shut it down?! >> >> At this point you need to make a judgement call: which are you going to >> spend more time doing: a) futzing around with this weird situation, or >> b) reinstalling everything and restoring data from backups? >> >> If I was in your shoes at this point, I'd probably choose (b) and go >> with installing 8.1-RC1 using gmirror for the RAID-1 capability. > > That's probably fair enough but I'm of the opinion that I'd like to know > what has happened (or rather what FreeBSD has done) to my machine. Given > that the apparently faulty disk is not faulty, something (or probably > more accurately, the OS) has written some absolute LBA values to disk > with the intent of accessing these. Yes the disk has indicated that > there is an error but as to why, well that's the question :-) > > IMO it's all fine and well saying upgrade to the next stable release but > that's not actually finding the cause and trying to resolve the problem > in a sensible manner. I'm fortunate enough that I can easily handle a > bit of down time on the machine. You're absolutely right in saying that > the set up should have been tested prior to commissioning. I agree > completely. However, it's a server that I run at home, I'm not an IT > admin, I don't mind getting my hands dirty and do try to learn from > experience - hopefully! :-) > >> There isn't much else I can say about the issue, other than that proper >> failure testing may have caught this before it was too late. If there's >> anything positive to take away from this experience, it's that. :-) >> > > Absolutely. > > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" > -- Sent from my mobile device Adam Vande More
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?AANLkTimo1Vb461DHw3ZXNwK5BxDcgzKSkdxc3Dnqizge>