Date: Mon, 10 Aug 2009 14:31:23 -0500 From: "Hearn, Trevor" <trevor.hearn@Vanderbilt.Edu> To: John Baldwin <jhb@freebsd.org>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: RE: UFS Filesystem issues, and the loss of my hair... Message-ID: <8E9591D8BCB72D4C8DE0884D9A2932DC35BD34CA@ITS-HCWNEM03.ds.Vanderbilt.edu> In-Reply-To: <200908070829.54571.jhb@freebsd.org> References: <8E9591D8BCB72D4C8DE0884D9A2932DC35BD34C3@ITS-HCWNEM03.ds.Vanderbilt.edu>, <200908070829.54571.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
To the FreeBSD-FS group at large... Well, I've spent alot of time looking this one over... I setup a share on a= webserver to put up redacted images of the errors I am getting. They are h= ere: http://www.trevorhearn.com/Array/IMG_0056.jpg http://www.trevorhearn.com/Array/IMG_0061.jpg http://www.trevorhearn.com/Array/IMG_0063.jpg http://www.trevorhearn.com/Array/IMG_0065.jpg http://www.trevorhearn.com/Array/IMG_0067.jpg http://www.trevorhearn.com/Array/IMG_0069.jpg So, while I am in a meeting about the array, oddly, I have this come rollin= g across the screen of the terminal session I am in... Aug 10 10:53:43 XXXX kernel: g_vfs_done():da1p7[READ(offset=3D-641956995000= 8350720, length=3D16384)]error =3D 5 Aug 10 10:53:43 XXXX last message repeated 20 times Aug 10 10:53:43 XXXX kernel: g_vfs_done():da1p7[READ(offset=3D-641956995000= 8350720, length=3D1638d)]error =3D 5 Aug 10 10:53:43 XXXX kernel: g_vfs_done():da1p7[READ(offset=3D-641956995000= 8350720, length=3D16384)]error =3D 5 Aug 10 10:53:43 XXXX last message repeated 18 times When I say it was rolling across the screen, I mean it did it for about 5 m= inutes... I was waiting for the hard-lock to happen, but the process that w= as touching the file(s) went to 99.02%, and has stayed there the remainder = of the day... PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAN= D 1351 xxxxxxxx 1 -8 0 10928K 4656K CPU1 0 2:10 99.02% smbd While this happened earlier in the morning, which we were only seeing moder= ate useage: Aug 10 09:54:18 PRSA kernel: pid 1776 (smbd), uid 1194 inumber 107797529 on= /xxxxxxxxxx: bad block Aug 10 09:54:18 PRSA kernel: bad block 165436921330628865, ino 107797529 The bad block number is WAAAY outside of what is used on the machine. So...= . Everything that I have found relating to these problems is everyone asking,= 'How do I fix this', and NONE of them so far have been a fix. 'Error =3D 5= ' relates to EIO, or an error in the input/output to a device. Now, that be= ing said, I either have a problem with the controller in my Promise Array, = which I am learning is possible, or, I have an issue with a driver in FreeB= SD, and just happen to have a circumstance where it will appear. There does= not seem to be a rhyme or reason to what is taking place. How does a set o= f array controllers throw a bad block error? I mean, with a standard drive,= I can see it... but an array controller? Some other things that I have fou= nd... The link below tells about using 'find / -type d -exec stat {} ;' to run t= hru the filesystem and find the corrupted files. I did this earlier this mo= rning, and found none. I went back thru several of the inodes that are show= ing in the pictures above, and only found one in existence. I battened down= the hatches, and hit that directory. I was able to cp all of the info in t= hat directory to another directory without a single problem. With all that = I have been reading, this should have caused all manner of hell. I ran fsck= on all directories, and got the server back online... Back online? Yes. It= hard-locked at 3:09AM Sunday morning. Odd, since it has done that MANY tim= es at 3:09 AM. I have Nagios watching the server, and it always seems to do= so at the same time. I looked at cron jobs, and found that it runs PERIODI= C DAILY at 3:01AM. My Nagios box checks every 5 minutes, with three interva= ls of one minute afterwards if a service is not available. SO, somewhere in= the list of things that the server does in the PERIODIC DAILY job, there i= s something that makes the server fault. Tonight, I will be going thru the = jobs, running them one by one, seeing exactly which one causes the fault. I= have seen others speak of it going down at 3:00AMish, so I think this migh= t be a bit of a clue. At this point, I am purchasing another 2 port fibre channel card, with hope= s of installing it in a spare 1U server I have, to migrate to Ubuntu, or si= milar. I'd like to test it out with Ubuntu, but I do not know at this point= if it will see the array partitions correctly, nor if it will allow me to = access the UFS partitions that are there. Worst case, I will backup, and re= -format the chassis themselves. I would hope that this would not be necessa= ry, but I am almost at my wit's end. Has ANYONE got any ideas, other than the ones presented? I'm keen to see if= there is a fix, because I love FreeBSD, but I can't be a evangelist for it= when it is giving me so much grief. Thanks for listening, I'll be here all= week. :) -Trevor ________________________________________ From: John Baldwin [jhb@freebsd.org] Sent: Friday, August 07, 2009 7:29 AM To: freebsd-fs@freebsd.org Cc: Hearn, Trevor Subject: Re: UFS Filesystem issues, and the loss of my hair... On Thursday 06 August 2009 9:51:04 am Hearn, Trevor wrote: > First off, let me state that I love FreeBSD. I've used it for years, and have not had any major problems with it... Until now. > > As you can tell, I work for a major university. I setup a large storage array to hold data for a project they have here. No great shakes, just some standard files and such. The fun started when I started loading users onto the system, and they started using it... Isn't that always the case? Now, I get ufs_dirbad errors, and the system hard locks. This isn't the worst thin= g that could happen, but when you're talking about file partitions the size that I am using, the fsck takes FOREVER. Somewhere on the order of 1.5 hour= s. During that time, I am bringing the individual shares/partitions online, bu= t the users suffer. I've asked about this before, in a different forum, but g= ot no usable information that I could see. So, here goes... > > The system is as such. A dell 2950 1U server, with a Qlogic Fibre Channel card. It is connected to two Promise Array chassis, 610 series, each with 1= 6 drives. Each chassis is running RAID 6, which gives me about 12.73tb of storage per chassis. From there, the logical drives are sliced up into smaller partitions. At most, I have a 3.6tb partition. The smallest is a 100gig partition. > > Filesystem Size Used Avail Capacity Mounted on > /dev/mfid0s1a 197G 10G 170G 6% / > devfs 1.0K 1.0K 0B 100% /dev > /dev/da0p1 1.8T 1.5T 130G 92% /slice1 > /dev/da0p5 2.7T 1.8T 661G 74% /slice2 > /dev/da0p9 250G 21G 209G 9% /slice3 > /dev/da1p3 103G 12G 83G 12% /slice4 > /dev/da1p4 205G 54G 135G 29% /slice5 > /dev/da1p5 103G 7.3G 87G 8% /slice6 > /dev/da1p6 103G 22G 72G 23% /slice7 > etc... > > I had to use GPT to setup the partitions, and they are using UFS2 for the filesystem. Now... If that's not fun enough... I have TWO of these creature= s, which RSYNC every 4 hours. The secondary system is across campus, and sits idle 99% of the time. Every 4 hours, in a stepped schedule, the primary arr= ay syncs to the secondary array. If the primary goes down, I FSCK, and any fil= es that are fried, I bring back across from the secondary and replace them. Th= is has worked OK for a while, but now I am getting Kernel Panics on a regular basis. I've been told to migrate to a different filesystem, but my options are ZFS and using GJOURNAL with UFS, from what I can tell. I need something repeatable, simple, and I need something robust. I have NO idea why I keep getting errors like this, but I imagine it's a cascading effect of other hangs that have caused more corruption. > > I'd buy a fella, or gal, a cup of coffee and a pop-tart if they could hel= p a brother out. I have checked out this link: > http://phaq.phunsites.net/2007/07/01/ufs_dirbad-panic-with-mangled-entries-= in-ufs/ > and decided that I need to give this a shot after hours, but being the ki= nda guy I am, I need to make sure I am covering all of my bases. Are you seeing ufs_dirbad panics? Specifically, can you capture the messag= es on the console when the machine panics? -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8E9591D8BCB72D4C8DE0884D9A2932DC35BD34CA>