Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 1 May 2012 00:58:10 -0500 (CDT)
From:      Robert Bonomi <bonomi@mail.r-bonomi.com>
To:        freebsd-questions@freebsd.org
Subject:   Re: UFS Crash and directories now missing
Message-ID:  <201205010558.q415wAFu091478@mail.r-bonomi.com>
In-Reply-To: <CAF6rxgksNAC2PguE6jzPtBauNZig9VWG--UmTt_fGVB7PytonA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

Eitan Adler <lists@eitanadler.com> wrote:
> On 30 April 2012 07:36, Robert Bonomi <bonomi@mail.r-bonomi.com> wrote:
> > A competennt, "not stupid", sysadmin would know these things.  And not
> > 'remove all doubt' (in the words of Abraham Lincoln), by raising such
> > nonsense questions.
>
> A competent sysadmin would ask questions when they don't know the
> answer bringing up possibilities they thought about.
> A stupid sysadmin would yell at someone asking a question claiming
> they should have known the answer.

An informed critic would have recognized that the 'lack of knowledge' issue,
and the 'nonsense questions' were two -entirely- different matters. <grin>

One who lacks knowledge of system fundamentals and asks questions _about_
_the_fundammentals_ that they do not understand is not subject to 
criticizm -- they are educatable.

Those who make grossly false-to-fact assumptions about the behavior of those 
fundamentals, and extrapolate wildly from those erroneous assumptions
cannot be engaged in rational conversation -without- hauling them back
to the initial erroneous assumptions, and correcting those errors.  And,
when that is done, it invaliates everything extrapolated from the false
premise.

Those who continue to extrapolate wildly in such manner cannot be helped.

It was also established that the OP's descriptions were woefully incomplete
and unreliable.  A second disk was involved.  'dangerously dedicated' or
otherwise?  partitioning?  slices? label type?  There is indirect indication
'everything of interest' was on a single slice, but that is only an inference.
There's no indication of where _in_the_filesystem_ on the slice that the 
jails '/' directories were located, or by what names they were known to the
system outside the jail.  The 'pattern' of the names, and placement in the 
hierarchy _is_ likely of some significance. As is (a) ownership, (b) 
permissions, and (c) 'flags', of (1) the original 'containing' directory,
(b) the external view of the jail '/' directories in that directory, and
(c) 'where they ended up'.  It is likely that that 'external view' (pre-
problem) of the jail '/'s does not exist -- unless one had historical data 
from before the problem.  "Everything" was running in jails.  Except for
things that weren't.

For any constructive analysis of "what happened", one needed to capture *all*
the bits in the directory (itself) where the jails ended up -- a directory
'listing', e.g. 'ls' (regardless of options), is not sufficient -- and the 
same for the directory where they 'should have been', plus a copy of the 
slice's complete inode table -- i.e., from _all_ the cylinder groups.  Then 
one would examine the 'last modified' timestamp on the directory where the 
jails were found, and -then- the timestamps on the jail directories 
themselves. 

Among other things, this data allows one to establish whether or not the
jail directories were ever _really_ where one thought they were, or whether
they just 'appeared' to be there, e.g. due to nullfs, or a 'link'.  And an 
'initial estimate' of -when- it may have happened.  (if 'malice' is involved,
or certain kinds of backup/restore activities, the timestamps _may_ not be 
accurate, but they are a 'best available' guess.)

Capturing -all- the data from the 'where they were' directory, allows one
to examine the 'deleted' entries -- where one _should_ find entries for
the jails, and 'last accessed' timestamps which put a lower bound on when
the 'move' occured.

When the 'apparently impossible' happens, it is *VERY*OFTEN* the case that
'reality' is *NOT* what someone 'knows' it is.  No matter how 'obvious' it
is, one has to =verify=.  

It is also _FAR_ 'easier to believe' that (especially) a nullfs mount (or,
less likely, a hard link) disappeared, than directories actually got moved.
The move may well have happened, but one must 'positively' eliminate the 
'more plausible' alternatives first.  Things that would 'give the appearance'
of what was reported, but from -very- different causations.

Of course, to capture this kind of information, one have to know "what's 
where" in the filesystem metadata, and have means to capture it _without_ 
changing any of that data.  And _that_ means that you have to have a fair
understanding of the mechanics of how the filesystem works.  Which rapidly
leads into gory details of how the O/S does disk I/O, and the various
performance optimizations (and trade-offs) employed.

Reading _both_ of McKusick's  "Design of .." books, and the 'Unix System 
Admininstration Handbook', by Nemeth, et al.  is a good _start_.

Having a bunch of the books from O'Reilley & Assoc. (<http://www.ora.com>),
especially for 'standard' tools that you need to get the most out of, is
also highly recommended.  

Disclaimer:  I know a lot of the authors of those books, persoally.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201205010558.q415wAFu091478>