From owner-freebsd-questions@FreeBSD.ORG Sat Apr 28 15:39:18 2012 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A5813106566C for ; Sat, 28 Apr 2012 15:39:18 +0000 (UTC) (envelope-from bonomi@mail.r-bonomi.com) Received: from mail.r-bonomi.com (mx-out.r-bonomi.com [204.87.227.120]) by mx1.freebsd.org (Postfix) with ESMTP id 51DDE8FC15 for ; Sat, 28 Apr 2012 15:39:18 +0000 (UTC) Received: (from bonomi@localhost) by mail.r-bonomi.com (8.14.4/rdb1) id q3SFdtir061045; Sat, 28 Apr 2012 10:39:55 -0500 (CDT) Date: Sat, 28 Apr 2012 10:39:55 -0500 (CDT) From: Robert Bonomi Message-Id: <201204281539.q3SFdtir061045@mail.r-bonomi.com> To: aimass@yabarana.com, wojtek@wojtek.tensor.gdynia.pl In-Reply-To: Cc: freebsd-questions@freebsd.org Subject: Re: UFS Crash and directories now missing X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 28 Apr 2012 15:39:18 -0000 Alejandro Imass wrote: > On Sat, Apr 28, 2012 at 3:22 AM, Wojciech Puchar > wrote: > >> I somewhat agree, but it wasn't a person. I am the only administrator, > >> the only one with root access. The jails were effectively moved to the > >> /usr/local/etc/apache22 of the single that survived at the top level. > >> I'm thinking something between mount, EzJail, the journal and the way > >> MySQL created a great deal of head contention, so something must have > >> gotten corrupted at the directory level like you state, but the > >> strange part is no _data_ corruption as such, because I was able to > >> physically archive the jails, move them to the correct directory and > > > > > > no matter what you do FreeBSD DOES NOT ramdomly move directories. if you are > > sure you didn't move it yourself then it must be machine hardware problem > > but still unlikely. > > After a little more research, ___it it NOT unlikely at all___ that > under high distress and a hard boot, UFS could have somehow corrupted > the directory structure, whilst maintaining the data intact. This is techically accurate, *BUT* the specifics of the quote "corruption" unquote in the case under discussion make it *EXTREMELY* unlikely that this is what happened. 99.99+++% of all UFS filesystem "corruption' issues are the result of a system crash _between_ the time cached 'meta-data' is updated in memory and that data is flushed to disk (a deferred write). The second most common (and vanishingly rare) failure mode is a powerfail _as_ a sector of disk is being written -- resulting in 'garbage data' being written to disk. The next possibility is 'cosmic rays'. If running on 'cheap' hardware (i.e., without 'ECC' memory), this can cause a *SINGLE-BIT* error in data being output. The fact that the 'corrupted' filesystem passed fsck -without- any reported errors shows that everything in the filesystem meta-data was consistent Given *that*, there are precisely *TWO* ways that the 'results' that have been reported could have happened. 1) "Something" did a mv(2) of the various jail directories 'from' their original location to the 'apache' diretory. This involves simply *copying* the diretory entry from the jail's 'parent directory' to the apache directory, and then marking the entry in the original parent as 'unused'. Nothing other than the directory whre the jail 'used to live', and the directory 'where it was found' are touched. This occured _through_ the system 'mv' function, so all the normal 'housekeeping' was done properly. 2) it was -not- done though mv(2) -- but that requires that a whole *series* of "corruptions" of the filesystem, _ALL_ of which had to occur in 'exactly' the right way. They are: 1) The -size- (filesystem metadata) of the orignal parent directory had to be changed to reflect the smaller size. 2) the 'indirect block' info for the original parent directory had to be changed to reflect the absense of the block(s) that are no longer part of that file. 3) the _size_ of the Apache directory had to be increased to reflect the additional block(s) that are now part o that directory. 4) the 'indirect block' info for the apache directory has to be changed to reflect the presense of the new block(s) that are now part of that file. This requires multiple -hundreds- of bits 'in error', in a minimum of FOUR separate disk locations. A -single- failure simply *CANNOT* cause all of this. The probability of a random single-bit error in a gigabyte of RAM is on the order of one such occurance in six months. The odds of having multiple *simultaneous* errors is the probability of a single-bit error raised to the power of the number of bits in error. e.g. the probability of a simultaneous 10-bit radom error is roughly 1 in 30 million years. The odds of it being a -specific- ten bits out of that gigabyte is preposterously small. The odds of the required specific _multiple-hundreds_ of bits in error occuringis (conservatively) 1 in (30 million years)**50 * ((2**30)!) / ((2^9)!) The first factor, alone, is over 7.1E373 years. I think it is safe to conclude that the probabilities -greatly- favor alternative #1.