From owner-freebsd-questions@FreeBSD.ORG Thu May 3 17:14:53 2012 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B09DA106566B for ; Thu, 3 May 2012 17:14:53 +0000 (UTC) (envelope-from aimass@yabarana.com) Received: from mail-yx0-f182.google.com (mail-yx0-f182.google.com [209.85.213.182]) by mx1.freebsd.org (Postfix) with ESMTP id 6589B8FC1B for ; Thu, 3 May 2012 17:14:53 +0000 (UTC) Received: by yenl9 with SMTP id l9so2554473yen.13 for ; Thu, 03 May 2012 10:14:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding:x-gm-message-state; bh=oTDEYMiATar92D0qrR0tiAkGvatSJdIptPtpea/dx3o=; b=eBbqoRFkf6Ad7PAW2N12DUthUsFmN0EZMcynlVvjGxFBkAKGOr+oJH30HPgFSokWL5 zm8YfoaaJQjSvvmtN1ShLvQ+8pTadUGV5w0ugWFCxB6FZM5ytYwSwz/QSgMxS7sggtJD Hij5tsNsSmTVOzUuOo78ojf3Hx5hOAKVtIg7gvu7PpKPgvaDN45rsj87BLeizvmZZkDI a1JBb4JWtMhgeQuBW1aZeNEj0r3xlOm0lLXoztGzhixIe6ut60m8F6Gzp2uvapd44F2o lC8LaH09ZV9o2V3XtaVsXsn/B0+l92fn36x+T3yliEzwHAmZpv95QpdJZGpcY6YhNQnB QicQ== MIME-Version: 1.0 Received: by 10.50.149.170 with SMTP id ub10mr1192585igb.43.1336065292669; Thu, 03 May 2012 10:14:52 -0700 (PDT) Sender: aimass@yabarana.com Received: by 10.231.74.138 with HTTP; Thu, 3 May 2012 10:14:52 -0700 (PDT) In-Reply-To: <201205031335.q43DZUKx025041@mail.r-bonomi.com> References: <201205031335.q43DZUKx025041@mail.r-bonomi.com> Date: Thu, 3 May 2012 13:14:52 -0400 X-Google-Sender-Auth: rIOMYzopy9ni3P22_mEwganWArg Message-ID: From: Alejandro Imass To: Robert Bonomi Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Gm-Message-State: ALoCoQm/wbDhi+cZHRHpB4X7yj40yGXrUA2M4eVl7MFgGr2QQwYoTMgthoLScMJVScwIp6KA3Tnl Cc: freebsd-questions@freebsd.org Subject: Re: UFS Crash and directories now missing X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 03 May 2012 17:14:53 -0000 On Thu, May 3, 2012 at 9:35 AM, Robert Bonomi wr= ote: > > Alejandro Imass wrote: > > [ megasnip ] > >> > Things to investigate : >> > - When was the last time this box was rebooted normally ? Did it went = fine ? >> >> After I moved the jails to the right place I archived the jails with >> ezjail-admin and rebooted the server several times, and everything >> worked as expected. > > Rephrasing -- when was the last time _before_the_problem_was_discovered_ > that the machine was re-booted? > The jails moved Friday 27th so the last reboot before that was Apr 4 and before Feb 29 Feb 29 10:18:46 nune reboot: rebooted by aimass Apr 4 19:45:03 nune reboot: rebooted by aimass Apr 27 19:47:06 nune reboot: rebooted by aimass Apr 28 02:03:57 nune reboot: rebooted by aimass >> > Were the jails created at this time ? >> >> No. Most of these jails have been operational for over a year on this >> server without any incidents. > > Clarifying the question -- were the jails created at the time of the last > _prior_ reboot? =A0i.e., had the machine been re-booted successfully _aft= er_ > the jails were installed, or was this the _first_ such reboot? > No not at all. Most of these jails were created last year, but here is the detail. cmm_php52_1 is the problematic jail with the MySQL, you will see a recent date in the config file because I recently added some cpuset as a band-aid to limit the jail's ability to bring down the whole system, leaving at least a couple of CPUs free to be able to ssh and shut it down. There is however a new jail corcaribe_php53 and was the reason we rebboted the server on Apr 4th, just to make sure that eveything would boot OK after reboot. -rw-r--r-- 1 root wheel 917 Feb 16 2011 cat58base -rw-r--r-- 1 root wheel 917 Apr 29 2011 cm_idvida -rw-r--r-- 1 root wheel 937 Apr 3 2011 cm_website -rw-r--r-- 1 root wheel 960 May 2 09:48 cmm_php52_1 -rw-r--r-- 1 root wheel 1037 Apr 4 20:00 corcaribe_php53 -rw-r--r-- 1 root wheel 950 Feb 16 2011 http_proxy -rw-r--r-- 1 root wheel 917 Aug 3 2011 mcs_cat58 -rw-r--r-- 1 root wheel 917 Feb 10 2011 php52base -rw-r--r-- 1 root wheel 917 Feb 12 2011 php53base -rw-r--r-- 1 root wheel 877 Dec 27 20:33 pyugmao -rw-r--r-- 1 root wheel 877 Mar 21 22:03 testbed -rw-r--r-- 1 root wheel 1017 May 13 2011 yabarana_cat58 -rw-r--r-- 1 root wheel 1017 Feb 13 2011 yabarana_php52 -rw-r--r-- 1 root wheel 1017 Feb 13 2011 yabarana_php53 > It appears you misunderstood the 'at this time' reference -- it did ot > mean 'at the time of the incident', but =A0'at the time of the last prior > reboot'. =A0If English is not your primary language, it is an understanda= ble > misread. > >> As I told you earlier, this server has been running for over a year >> and we have rebooted many times. > > I don't believe you ever mentioed that particular point (multiple > successful reboots after istallation) before. =A0Repeating a prior > question, _how_long_ before the problem showed up was the most recent > re-boot? =A0(Doesn't have to be exact -- an 'order of magnitude' estimate > [a day, a week, a month, several months] is sufficient.) > Apr 4th >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0If th= ere are such problems they exist >> by using the EzJail commands and I find this unlikely. > > What you 'find unlikely' is irrelevant. =A0The entire situation is 'unlik= ely', > yet it happened. =A0So one -has- to look at unlikely things. =A0 > funny >> here is the mount output is that's of any help: > > [ first disk, and 'fdescfs', and 'procfs' references removed, for clarity= ] > >> /dev/ad6s1.journal on /usr/jails (ufs, asynchronous, local, gjournal) >> /usr/jails/basejail on /usr/jails/yabarana-php53/basejail (nullfs, [...] > > Yes, that is a good start at useful detail. =A0It is, presumably, _after_ > the problem, and _after_ you had restored things to their proper places. > Yes. > Is it safe to =A0assume that you do -not- have such a 'mount' output from > some time 'before' the problem? =A0( There's no rational reason why you > -would- have such, but _if_ it existed, and there were any differences > between 'then' and 'now', it could be very informative.) > No, but from what I remember it's mostly very similar. I can pull off similar mount statement from other server(s) where we run similar set-ups and that have never failed if needed. > Aother critical piece of information is what diretories -- by full path > name -- disappeared from 'where they were', and where -- by full path nam= e, > again -- did you find them, and _with_what_names_? =A0 If everything was > moved from the same source point to the same destination, it's not necess= ary > to itemize each one, but the details of _one_ 'typicaal' migration is nee= ded. > It is also significant if there was 'anything else' in the 'where they > belonged' directory that was -not- moved. =A0*OR* if there was anything e= lse > (something other than the '/' of a jail) there, that was _also_ moved. > I took a screen shot because I somehow suspected no one would believe me, I don't know if I can attach it here but I can send it to you privately if not. > "Narrative" descriptions, as previously provided, and while clear to some= one > familiar with the machcine in question, are not sufficiently precise to a= llow > an 'outsider' to follow the events without 'logically' replicating the se= tup, > and then guessing at the meaning of any shorthands employed. > OK. I can provide mostly any information required. > > > One comment: for 'defensive' purposes it would be useful to break ad6 up > into two slices, putting 'basejail' in it's own slice. =A0Then, for produ= ction > use, that slice can be mounted RO, and with the 'system immutable' flag > set on everything in that filesystem. > Yes. From one of your posts that became somewhat clear to me: Having all the jails on a single 150GB slice seems like a bad idea. Thanks! Let me know if I can provide anything else to help determine the root cause. --=20 Alejandro Imass