Date: Mon, 5 Jul 2010 15:08:55 -0700 From: Jeremy Chadwick <freebsd@jdc.parodius.com> To: Charles Sprickman <spork@bway.net> Cc: freebsd-fs@freebsd.org Subject: Re: 7.2 - ufs2 corruption Message-ID: <20100705220855.GA35860@icarus.home.lan> In-Reply-To: <alpine.OSX.2.00.1007051701020.33454@hotlap.local> References: <alpine.OSX.2.00.1007051701020.33454@hotlap.local>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Jul 05, 2010 at 05:23:03PM -0400, Charles Sprickman wrote: > Howdy, > > I've posted previously about this, but I'm going to give it one more > shot before I start reformatting and/or upgrading things. > > I have a largish filesystem (1.3TB) that holds a few jails, the main > one being a mail server. Running 7.2/amd64 on a Dell 2970 with the > mfi raid card, 6GB RAM, UFS2 (SU was enabled, I disabled it for > testing to no effect) > > The symptoms are as follows: > > Various applications will log messages about "bad file descriptors" > (imap, rsync backup script, quota counter): > > du: > ./cur/1271801961.M21831P98582V0000005BI08E85975_0.foo.net,S=2824:2,S: > Bad file descriptor > > The kernel also starts logging messages like this to the console: > > g_vfs_done():mfid0s1e[READ(offset=2456998070156636160, length=16384)]error = 5 > g_vfs_done():mfid0s1e[READ(offset=-7347040593908226048, length=16384)]error = 5 > g_vfs_done():mfid0s1e[READ(offset=2456998070156636160, length=16384)]error = 5 > g_vfs_done():mfid0s1e[READ(offset=-7347040593908226048, length=16384)]error = 5 > g_vfs_done():mfid0s1e[READ(offset=2456998070156636160, length=16384)]error = 5 > > Note that the offsets look a bit... suspicious, especially those > negative ones. > > Usually within a day or two of those "g_vfs_done()" messages showing > up the box will panic shortly after the daily run. Things are hosed > up enough that it is unable to save a dump. The panic always looks > like this: > > panic: ufs_dirbad: /spool: bad dir ino 151699770 at offset 163920: > mangled entry > cpuid = 0 > Uptime: 70d22h56m48s > Physical memory: 6130 MB > Dumping 811 MB: 796 780 764 748 732 716 700 684 668 652 636 620 604 > 588 572 556 540 524 508 492 476 460 444 428 412 396 380 364 348 332 > 316 300 284 > ** DUMP FAILED (ERROR 16) ** > > panic: ufs_dirbad: /spool: bad dir ino 150073505 at offset 150: > mangled entry > cpuid = 2 > Uptime: 13d22h30m21s > Physical memory: 6130 MB > Dumping 816 MB: 801 785 769 753 737 721 705 689 > ** DUMP FAILED (ERROR 16) ** > Automatic reboot in 15 seconds - press a key on the console to abort > Rebooting... > > The fs, specifically "/spool" (which is where the errors always > originate), will be pretty trashed and require a manual fsck. The > first pass finds/fixes errors, but does not mark the fs clean. It > can take anywhere from 2-4 passes to get a clean fs. > > The box then runs fine for a few weeks or a few months until the > "g_vfs_done" errors start popping up, then it's a repeat. > > Are there any *known* issues with either the fs or possibly the mfi > driver in 7.2? http://lists.freebsd.org/pipermail/freebsd-hardware/2010-May/006350.html A reply in the thread indicates "the hardware runs great", so everyone's situation is different. That's also for 8.0-RELEASE. There's also a good possibility you have a disk that has problems ("bit rot" syndrome or bad cache), and I imagine it would manifest itself in this manner (filesystem corruption). I would probably start by removing mfi(4) from the picture if at all possible. If the disks act reliably on some on-board or alternate brand controller, then I think you've ruled out which piece is flaky. You can also use the opportunity to get SMART stats from the disks (smartctl -a) and provide it here for review. It's too bad the filesystem isn't ZFS (either mirror or raidz), as this sort of thing could be detected easily and auto-corrected, plus you could narrow it down to a single device/drive. Finally, you might also try memtest86 just for fun, to see if there's any RAM issues. I'm doubting this is the problem though, as you'd likely see many other problems (programs crashing, etc.), but better to be safe than sorry. A colleague of mine recently went through the bad RAM ordeal -- during the 4th pass one of the DRAM modules exhibited a single bit error. Fun times. Sorry for going down the "it's the hardware!!!" route, but sometimes that's the case. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100705220855.GA35860>