From owner-freebsd-questions@FreeBSD.ORG Sun Jan 17 02:36:18 2010 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 18D98106566B for ; Sun, 17 Jan 2010 02:36:18 +0000 (UTC) (envelope-from freebsd-questions@m.gmane.org) Received: from lo.gmane.org (lo.gmane.org [80.91.229.12]) by mx1.freebsd.org (Postfix) with ESMTP id 9C4078FC19 for ; Sun, 17 Jan 2010 02:36:17 +0000 (UTC) Received: from list by lo.gmane.org with local (Exim 4.50) id 1NWKzv-0007Bk-0F for freebsd-questions@freebsd.org; Sun, 17 Jan 2010 03:36:15 +0100 Received: from pool-141-156-222-202.res.east.verizon.net ([141.156.222.202]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 17 Jan 2010 03:36:14 +0100 Received: from nightrecon by pool-141-156-222-202.res.east.verizon.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 17 Jan 2010 03:36:14 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-questions@freebsd.org From: Michael Powell Followup-To: gmane.os.freebsd.questions Date: Sat, 16 Jan 2010 21:35:47 -0500 Lines: 69 Message-ID: References: <322efb7b1001161715s47de3bcdqd40e6efabaf57a9b@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7Bit X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: pool-141-156-222-202.res.east.verizon.net User-Agent: KNode/4.3.4 Sender: news Subject: Re: Errors on UFS Partitions X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 17 Jan 2010 02:36:18 -0000 The-IRC FreeBSD wrote: > Hi, > > I am sorry if I am asking a question that might have been brought up > before I have attempted to research my issue but it has many angles it > might be listed under so please bare with me. > > We have had ongoing problems with UFS Errors on our root partition (and > any additional partition that did not have soft-updates enabled by > default) and we recently had a problem with a secondary drive that housed > home directories completely filled up and then everything locked up due-to > huge CPU and Memory usage because nothing was able to write to the drive > but when the server was rebooted it failed to bootup because of critical > errors on the root partition. A healthy system does not get UFS errors during normal operation. > We have /etc and /usr on the root partition and our home/var partitions > mistakenly do not have soft-updates flag set. > > ::dmesg:: > http://the-irc.com/dmesg > > ::mount:: > /dev/ad4s1a on / (ufs, local) > devfs on /dev (devfs, local, multilabel) > /dev/ad4s1d on /home (ufs, local, with quotas) > /dev/ad4s1e on /tmp (ufs, local, noexec, nosuid, soft-updates) > /dev/ad4s1f on /var (ufs, local) > devfs on /var/named/dev (devfs, local, multilabel) > procfs on /proc (procfs, local) > /dev/ad0s1e on /Backups (ufs, local, soft-updates) > /dev/ad0s1d on /root (ufs, local, soft-updates) [snip] > > To prevent letting these errors go out of control and not beable to fix > the root partition errors without going into singleuser mode and the other > partitions by mounting them with soft-updates flag, does anyone advise > removing everything from the root partition and only leaving the > bootloader and thus moving /etc and /usr (or most of all just /usr) to > it's own partition or do you guys have a better solution. No. Proceeding in directions such as this is a waste of time. > Every partition gets errors over time but if you are unable to correct > them without downtime how are you to correct them before they get out of > control? Probably by not looking for a software solution to a hardware problem. It is not normal for a file system to behave as you describe. Moving partitions around and other such avenues of approach are doomed to failure as they are not addressing the underlying problem. Real server hardware with sophisticated ECC subsystems usually have some BIOS counters which you can check for stats on memory errors. Hard drives fail the most often but either bad memory or drive controller can readily corrupt data. If you have a RAID controller with RAM cache the RAM could be defective. Hardware failure is going to mean downtime. But I'd be looking for a hardware problem, get it fixed, then worry about how to proceed. If you have decent backups from before the system was corrupted you can get back to where you need to be in relatively short order. Not fixing a hardware defect will result in you never getting your server back to normal operation. -Mike