Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 01 May 2012 00:42:25 +0200
From:      Jerome Herman <jherman@dichotomia.fr>
To:        freebsd-questions@freebsd.org
Cc:        lists@eitanadler.com
Subject:   Re: UFS Crash and directories now missing
Message-ID:  <4F9F1551.5040003@dichotomia.fr>
In-Reply-To: <CAF6rxgksNAC2PguE6jzPtBauNZig9VWG--UmTt_fGVB7PytonA@mail.gmail.com>
References:  <201204301136.q3UBa8fj083478@mail.r-bonomi.com> <CAF6rxgksNAC2PguE6jzPtBauNZig9VWG--UmTt_fGVB7PytonA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 30/04/2012 19:23, Eitan Adler wrote:
> On 30 April 2012 07:36, Robert Bonomi<bonomi@mail.r-bonomi.com>  wrote:
>> A competennt, "not stupid", sysadmin would know these things.  And not
>> 'remove all doubt' (in the words of Abraham Lincoln), by raising such
>> nonsense questions.
> A competent sysadmin would ask questions when they don't know the
> answer bringing up possibilities they thought about.
> A stupid sysadmin would yell at someone asking a question claiming
> they should have known the answer.
>
I must admit that Robert Bonomi tone was highly insulting for this list, 
and though I completely condemn the form of his post, I cannot say I 
disagree with the content.

There are quite a lot of things that are wrong with Alejandro Imass' 
post and analysis.
The fist thing is that he did not give is setup in one go. It took quite 
a while to figure what happened, what system he was using and how he was 
using it.
At first he had to hard reboot an unresponsive system, then at reboot he 
would have lost all of his jail.
Then it appeared that all the jails where inside another jail and that 
the unresponsiveness came from MySQL.
Then we learn that all his daemons are inside jails.
Then we learn that ftp-proxy is not.
Then we learned that jail are not handled manually but through EZJail.
Then we are told that the problem with MySQL is known and comes from a 
client using TigerCRM with a too much data.
There are litterally dozens of little pieces of important knowledge all 
over the thread. And you have to read it all to make sure you have the 
global view. Not really a good start.
It is OK to forget to mention a thing or two, discarding what you think 
is irrelevant to the problem at hand, but it is not OK to force people 
who are trying to help you to read 50+ posts to learn about the basics 
of your installation.

What is even more irritating is the fact that Alejandro Imass ignores 
pretty much anything that would leads toward a human mistake. Most posts 
implying a possible bad use of jails/nullfs/ezjail are ignored or 
answered by a simple "I have done everything by the book".  Now from my 
experience someone with 6 servers, each containing multiple jails will 
not do everything by the book every time. It might be that Alejandro is 
exceptional, but it is more likely that at least one if not more of 
these jails were not made "by the book". Nothing to blame anyone in 
here, we all get tired/bored/overconfident sometime - but refusing to 
admit the very possibility of a human mistake won't help at all in 
finding a solution. Reading the thread I realized that my suggestion 
that he might have over-used "ln" had been discarded as "stupid", but 
the information came a lot later in answer to another post. Of course in 
the mean time I learned that he was using ezjail, which, if I had known 
earlier, would have made me wonder if he had not overused nullfs or ln. 
He furthermore discarded the possibility saying that he did not think 
that ezjail was using links, just nullfs. Well too bad ezjail is 
massively using links, at least for basejail, and sometime for port 
trees or perl setup depending which guide you are using as your reference.
During the thread he pretty much bashed anyone who tried to tell him 
that no amount of jail/ezjail/nullfs/journal screw up could have 
resulted in the entire content of the jails being moved into another 
completely unrelated directory node.  If one jail had moved it would 
already have been extraordinary, with a probability of it happening so 
cleanly that fsck would find nothing already magnitude of order above 
the chances of winning the national lottery. But all of them ? Not a 
chance. He finally admitted that he had very little knowledge about UFS 
and fsck, but still managed to do it in a quite offensive way.

That was basically the point were I decided to stop to try to help him. 
I think others felt the same. This problem is quite interesting  in 
itself, and I think a lot of the most talented people on this list would 
have been on it but were repelled by the attitude.

On the other hand Alejandro Imass pretty much jumped on anything that 
would be a third party interaction. From someone hacking into his box to 
a potential nullfs bug that might result in a PR.

Now the thing is that EZJail make use of the "system immutable flag" 
quite a lot for its config file, resulting in quite a lot of file being 
impossible to delete or move unless the box is running at 
kern_secure_level 0. This renders the whole "jails moved on their own" 
theory even more improbable.

After so much ranting, I would feel bad not to try to help a little :
Here are the facts :
- In a jail, MySQL was grabbing all the CPU and making the box non 
responsive. This is due to TigerCRM making requests to a too huge database.
         -> The jail was working
         -> Unless all the data were in memory at this time 
(unprobable), it means that access path/nullfs/EZJail were OK at this time.

- After a force reboot all the jails were gone, or more exactly moved 
inside another jail. fsck saw no error on the disk.
         -> The disk was in a stable state at reboot, the directory and 
file structure was consistent.

- Jails contained it the apache jail were in an OK state and could be 
archived and restored
         -> The data structure of the hard drive was clean, and files 
contents were OK.

 From all this here is what we can safely assume :
a) The box was not hacked, or at least the hacker did not move the jails 
around, this is confirmed by MySQL working and doing enough I/O to stale 
the box from inside a jail that was later seen has moved.
b) The hard-reboot did not cause a problem, it revealed it. Since both 
fsck run fine and the data were preserved we can pretty safely assumed 
that there was no data or system corruption caused by the hard reboot.

Things to investigate :
- When was the last time this box was rebooted normally ? Did it went 
fine ? Were the jails created at this time ?
- What happens if you deactivate the jail that "survived" and reboot 
normally, would the other jail contained in it start ? If you deactivate 
the jail but leave the nullfs mapping on and try to restart EZJail ? Do 
the other jails start ?
- What is the content of the different fstab.* and of the EZJail conf ? 
Does any of it points inside the jail that survived the reboot ?

Unfortunately since the server was "corrected" and we probably won't 
have a satisfying answer. But honestly the probability of a system bug 
is really low. Very likely the "moved" jails were inside the surviving 
jail from the beginning, and a mix of nullfs remap and lack of reboot 
masked this fact for a while.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F9F1551.5040003>