From owner-freebsd-questions@FreeBSD.ORG Fri Jun 27 20:53:32 2003 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E688137B401 for ; Fri, 27 Jun 2003 20:53:31 -0700 (PDT) Received: from mta1.adelphia.net (mta1.adelphia.net [64.8.50.175]) by mx1.FreeBSD.org (Postfix) with ESMTP id 268BE44017 for ; Fri, 27 Jun 2003 20:53:31 -0700 (PDT) (envelope-from wmoran@potentialtech.com) Received: from potentialtech.com ([24.53.179.151]) by mta1.adelphia.net (InterMail vM.5.01.05.32 201-253-122-126-132-20030307) with ESMTP id <20030628035703.RKHE25556.mta1.adelphia.net@potentialtech.com>; Fri, 27 Jun 2003 23:57:03 -0400 Message-ID: <3EFD113A.3060402@potentialtech.com> Date: Fri, 27 Jun 2003 23:53:30 -0400 From: Bill Moran User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.3) Gecko/20030429 X-Accept-Language: en-us, en MIME-Version: 1.0 To: John Ekins References: <20030627220033.5586e86b.john.ekins@brightview.com> In-Reply-To: <20030627220033.5586e86b.john.ekins@brightview.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit cc: questions@freebsd.org Subject: Re: Softupdates: df, du, sync and fsck [quite long] X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 28 Jun 2003 03:53:32 -0000 John Ekins wrote: > Hello, > > I've a couple of questions about soft updates. I've Googled heavily on this but > not really found a satisfactory answer. The story: > > I'm running on numerous FreeBSD 4.7 SMP machines as primary MX machines. The mail > is not stored on the FreeBSD machines but on NetApps via NFS. However the mail is > temporarily spooled on the FreeBSD machines during normal MTA handling and passing > to an anti-virus scanner. I have one large partition /var on each machine where > basically all the work and temporary/transient files for the MTA and AV scanner > takes place. > > These machines are heavily utilised, running quite "hot" with a load average of > anything from 2 to 8. Many thousands of temporary files are thus created and > deleted a minute. I have no problem with this as nearly all email is delivered in > under 1 minute whatever. > > I notice that after a while the amount of free space as shown by df considerably > varies from a du on /var. I'm aware of why this happens with soft updates, but > that's not the whole story. If I turn off incoming email on a machine, the space > does not seem to sync back to what it should be. No matter how long I turn off > the MTA, the space is simply not returned, and df/du show differences of about > 5:1. Nothing else is writing/holding open files on that partition (even turned > off syslog, cron, etc. and checked using lsof). In comparison, if, for example, on > my normal desktop machine I create a 500MB file, then delete it, the space shortly > afterwards is returned to me when I run df. The only way I've been able to recover > this space to what it should be is to reboot the machine. I don't know what's wrong, but does unmounting and remounting the partition reclaim the lost space? > As an example, here is a snippet from the console from when I rebooted an affected > machine: > > boot() called on cpu#2 > Waiting (max 60 seconds) for system process `vnlru' to stop...stopped > Waiting (max 60 seconds) for system process `bufdaemon' to stop...stopped > Waiting (max 60 seconds) for system process `syncer' to stop...timed out > > syncing disks... 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 > giving up on 22 buffers > Uptime: 27d23h1m27s > Rebooting... > > As you can see the file system is unable to sync. When the machine reboots it > literally takes hours to fsck the /var partition (only about 15GB). And the fsck > output is full of messages like this: > > UNEXPECTED SOFT UPDATE INCONSISTENCY Well, this sure isn't good. > Now, is there a problem here with soft updates "losing track" of what is going on > on this busy partition? It would appear to be so as quietening the machine does > not lead to a proper sync. Secondly, why does the fsck take such an inordinate > amount of time for a smallish partition? If there's a LOT of inodes with problems, it could easily take a while to fix. Also, if you run fsck without specifying a filesystem to fix, it exhaustively checks all filesystems. So even if the problem is on /var, it might spend a long time checking /usr as well. You can work around this by calling fsck with the filesystem to check. > I really like the performance benefits of soft updates, but it seems that I'm > going to have to turn it off on /var because of the problems that eventually > occur. If these are production boxes, I'd recommend turning it off until you resolve the problem. > If anyone has some advice I'd be grateful. I don't know if this would qualify as "advice", but since nobody else seems to have any suggestions, I figured I'd throw my thoughts in. Are you using ATA or SCSI drives? Does issuing a manual "sync" once you've stopped the spooling process help any? Are these all identical mobos ... possibly a BIOS update available? These aren't IBM ATA drives are they? I had one of those give me grief for months (if you look in the archives, you should be able to find details on which drives caused problems). Have you tried updating one of the machines to 4.8 to see if the problem has been fixed? Like I said, not good advice, just some ideas for you. -- Bill Moran Potential Technologies http://www.potentialtech.com