From owner-freebsd-questions@FreeBSD.ORG Mon Apr 30 22:44:13 2012 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 00793106564A for ; Mon, 30 Apr 2012 22:44:13 +0000 (UTC) (envelope-from jherman@dichotomia.fr) Received: from mail.dichotomia.fr (hydrogen.dichotomia.net [91.121.82.228]) by mx1.freebsd.org (Postfix) with ESMTP id 5B81B8FC0A for ; Mon, 30 Apr 2012 22:44:12 +0000 (UTC) Received: from [192.168.2.11] (unknown [109.190.13.180]) (Authenticated sender: kha@dichotomia.fr) by sslmail.dichotomia.fr (Postfix) with ESMTPSA id C51D93DD07A; Tue, 1 May 2012 00:42:57 +0200 (CEST) Message-ID: <4F9F1551.5040003@dichotomia.fr> Date: Tue, 01 May 2012 00:42:25 +0200 From: Jerome Herman User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0) Gecko/20120327 Thunderbird/11.0.1 MIME-Version: 1.0 To: freebsd-questions@freebsd.org References: <201204301136.q3UBa8fj083478@mail.r-bonomi.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (sslmail.dichotomia.fr); Tue, 01 May 2012 00:42:58 +0200 (CEST) Cc: lists@eitanadler.com Subject: Re: UFS Crash and directories now missing X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Apr 2012 22:44:13 -0000 On 30/04/2012 19:23, Eitan Adler wrote: > On 30 April 2012 07:36, Robert Bonomi wrote: >> A competennt, "not stupid", sysadmin would know these things. And not >> 'remove all doubt' (in the words of Abraham Lincoln), by raising such >> nonsense questions. > A competent sysadmin would ask questions when they don't know the > answer bringing up possibilities they thought about. > A stupid sysadmin would yell at someone asking a question claiming > they should have known the answer. > I must admit that Robert Bonomi tone was highly insulting for this list, and though I completely condemn the form of his post, I cannot say I disagree with the content. There are quite a lot of things that are wrong with Alejandro Imass' post and analysis. The fist thing is that he did not give is setup in one go. It took quite a while to figure what happened, what system he was using and how he was using it. At first he had to hard reboot an unresponsive system, then at reboot he would have lost all of his jail. Then it appeared that all the jails where inside another jail and that the unresponsiveness came from MySQL. Then we learn that all his daemons are inside jails. Then we learn that ftp-proxy is not. Then we learned that jail are not handled manually but through EZJail. Then we are told that the problem with MySQL is known and comes from a client using TigerCRM with a too much data. There are litterally dozens of little pieces of important knowledge all over the thread. And you have to read it all to make sure you have the global view. Not really a good start. It is OK to forget to mention a thing or two, discarding what you think is irrelevant to the problem at hand, but it is not OK to force people who are trying to help you to read 50+ posts to learn about the basics of your installation. What is even more irritating is the fact that Alejandro Imass ignores pretty much anything that would leads toward a human mistake. Most posts implying a possible bad use of jails/nullfs/ezjail are ignored or answered by a simple "I have done everything by the book". Now from my experience someone with 6 servers, each containing multiple jails will not do everything by the book every time. It might be that Alejandro is exceptional, but it is more likely that at least one if not more of these jails were not made "by the book". Nothing to blame anyone in here, we all get tired/bored/overconfident sometime - but refusing to admit the very possibility of a human mistake won't help at all in finding a solution. Reading the thread I realized that my suggestion that he might have over-used "ln" had been discarded as "stupid", but the information came a lot later in answer to another post. Of course in the mean time I learned that he was using ezjail, which, if I had known earlier, would have made me wonder if he had not overused nullfs or ln. He furthermore discarded the possibility saying that he did not think that ezjail was using links, just nullfs. Well too bad ezjail is massively using links, at least for basejail, and sometime for port trees or perl setup depending which guide you are using as your reference. During the thread he pretty much bashed anyone who tried to tell him that no amount of jail/ezjail/nullfs/journal screw up could have resulted in the entire content of the jails being moved into another completely unrelated directory node. If one jail had moved it would already have been extraordinary, with a probability of it happening so cleanly that fsck would find nothing already magnitude of order above the chances of winning the national lottery. But all of them ? Not a chance. He finally admitted that he had very little knowledge about UFS and fsck, but still managed to do it in a quite offensive way. That was basically the point were I decided to stop to try to help him. I think others felt the same. This problem is quite interesting in itself, and I think a lot of the most talented people on this list would have been on it but were repelled by the attitude. On the other hand Alejandro Imass pretty much jumped on anything that would be a third party interaction. From someone hacking into his box to a potential nullfs bug that might result in a PR. Now the thing is that EZJail make use of the "system immutable flag" quite a lot for its config file, resulting in quite a lot of file being impossible to delete or move unless the box is running at kern_secure_level 0. This renders the whole "jails moved on their own" theory even more improbable. After so much ranting, I would feel bad not to try to help a little : Here are the facts : - In a jail, MySQL was grabbing all the CPU and making the box non responsive. This is due to TigerCRM making requests to a too huge database. -> The jail was working -> Unless all the data were in memory at this time (unprobable), it means that access path/nullfs/EZJail were OK at this time. - After a force reboot all the jails were gone, or more exactly moved inside another jail. fsck saw no error on the disk. -> The disk was in a stable state at reboot, the directory and file structure was consistent. - Jails contained it the apache jail were in an OK state and could be archived and restored -> The data structure of the hard drive was clean, and files contents were OK. From all this here is what we can safely assume : a) The box was not hacked, or at least the hacker did not move the jails around, this is confirmed by MySQL working and doing enough I/O to stale the box from inside a jail that was later seen has moved. b) The hard-reboot did not cause a problem, it revealed it. Since both fsck run fine and the data were preserved we can pretty safely assumed that there was no data or system corruption caused by the hard reboot. Things to investigate : - When was the last time this box was rebooted normally ? Did it went fine ? Were the jails created at this time ? - What happens if you deactivate the jail that "survived" and reboot normally, would the other jail contained in it start ? If you deactivate the jail but leave the nullfs mapping on and try to restart EZJail ? Do the other jails start ? - What is the content of the different fstab.* and of the EZJail conf ? Does any of it points inside the jail that survived the reboot ? Unfortunately since the server was "corrected" and we probably won't have a satisfying answer. But honestly the probability of a system bug is really low. Very likely the "moved" jails were inside the surviving jail from the beginning, and a mix of nullfs remap and lack of reboot masked this fact for a while.