Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 5 Jul 2010 15:08:55 -0700
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        Charles Sprickman <spork@bway.net>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: 7.2 - ufs2 corruption
Message-ID:  <20100705220855.GA35860@icarus.home.lan>
In-Reply-To: <alpine.OSX.2.00.1007051701020.33454@hotlap.local>
References:  <alpine.OSX.2.00.1007051701020.33454@hotlap.local>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Jul 05, 2010 at 05:23:03PM -0400, Charles Sprickman wrote:
> Howdy,
> 
> I've posted previously about this, but I'm going to give it one more
> shot before I start reformatting and/or upgrading things.
> 
> I have a largish filesystem (1.3TB) that holds a few jails, the main
> one being a mail server.  Running 7.2/amd64 on a Dell 2970 with the
> mfi raid card, 6GB RAM, UFS2 (SU was enabled, I disabled it for
> testing to no effect)
> 
> The symptoms are as follows:
> 
> Various applications will log messages about "bad file descriptors"
> (imap, rsync backup script, quota counter):
> 
> du:
> ./cur/1271801961.M21831P98582V0000005BI08E85975_0.foo.net,S=2824:2,S:
> Bad file descriptor
> 
> The kernel also starts logging messages like this to the console:
> 
> g_vfs_done():mfid0s1e[READ(offset=2456998070156636160, length=16384)]error = 5
> g_vfs_done():mfid0s1e[READ(offset=-7347040593908226048, length=16384)]error = 5
> g_vfs_done():mfid0s1e[READ(offset=2456998070156636160, length=16384)]error = 5
> g_vfs_done():mfid0s1e[READ(offset=-7347040593908226048, length=16384)]error = 5
> g_vfs_done():mfid0s1e[READ(offset=2456998070156636160, length=16384)]error = 5
> 
> Note that the offsets look a bit... suspicious, especially those
> negative ones.
> 
> Usually within a day or two of those "g_vfs_done()" messages showing
> up the box will panic shortly after the daily run.  Things are hosed
> up enough that it is unable to save a dump.  The panic always looks
> like this:
> 
> panic: ufs_dirbad: /spool: bad dir ino 151699770 at offset 163920:
> mangled entry
> cpuid = 0
> Uptime: 70d22h56m48s
> Physical memory: 6130 MB
> Dumping 811 MB: 796 780 764 748 732 716 700 684 668 652 636 620 604
> 588 572 556 540 524 508 492 476 460 444 428 412 396 380 364 348 332
> 316 300 284
> ** DUMP FAILED (ERROR 16) **
> 
> panic: ufs_dirbad: /spool: bad dir ino 150073505 at offset 150:
> mangled entry
> cpuid = 2
> Uptime: 13d22h30m21s
> Physical memory: 6130 MB
> Dumping 816 MB: 801 785 769 753 737 721 705 689
> ** DUMP FAILED (ERROR 16) **
> Automatic reboot in 15 seconds - press a key on the console to abort
> Rebooting...
> 
> The fs, specifically "/spool" (which is where the errors always
> originate), will be pretty trashed and require a manual fsck.  The
> first pass finds/fixes errors, but does not mark the fs clean.  It
> can take anywhere from 2-4 passes to get a clean fs.
> 
> The box then runs fine for a few weeks or a few months until the
> "g_vfs_done" errors start popping up, then it's a repeat.
> 
> Are there any *known* issues with either the fs or possibly the mfi
> driver in 7.2?

http://lists.freebsd.org/pipermail/freebsd-hardware/2010-May/006350.html

A reply in the thread indicates "the hardware runs great", so everyone's
situation is different.  That's also for 8.0-RELEASE.

There's also a good possibility you have a disk that has problems ("bit
rot" syndrome or bad cache), and I imagine it would manifest itself in
this manner (filesystem corruption).

I would probably start by removing mfi(4) from the picture if at all
possible.  If the disks act reliably on some on-board or alternate brand
controller, then I think you've ruled out which piece is flaky.  You can
also use the opportunity to get SMART stats from the disks (smartctl -a)
and provide it here for review.

It's too bad the filesystem isn't ZFS (either mirror or raidz), as this
sort of thing could be detected easily and auto-corrected, plus you
could narrow it down to a single device/drive.

Finally, you might also try memtest86 just for fun, to see if there's
any RAM issues.  I'm doubting this is the problem though, as you'd
likely see many other problems (programs crashing, etc.), but better to
be safe than sorry.  A colleague of mine recently went through the bad
RAM ordeal -- during the 4th pass one of the DRAM modules exhibited a 
single bit error.  Fun times.

Sorry for going down the "it's the hardware!!!" route, but sometimes
that's the case.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100705220855.GA35860>