FreeBSD Mail Archives

Date:      Mon, 1 Feb 2021 15:58:42 +0100
From:      Polytropon <freebsd@edvax.de>
To:        "Matt Emmerton" <matt@gsicomp.on.ca>
Cc:        <freebsd-questions@freebsd.org>
Subject:   Re: Help recovering damaged drive - fsck segfaults, read-only mount looks ok
Message-ID:  <20210201155842.1e529018.freebsd@edvax.de>
In-Reply-To: <012a01d6f81e$3103d390$930b7ab0$@gsicomp.on.ca>
References:  <012a01d6f81e$3103d390$930b7ab0$@gsicomp.on.ca>

On Sun, 31 Jan 2021 17:12:40 -0500, Matt Emmerton wrote:
> Hi,
> 
> I have a FreeBSD-11 machine that I recently upgraded to FreeBSD-12.  It has
> a Sii RAID-1 pair of 1TB drives. 
> A week ago this system got unexpectedly powered off and when it came back
> up, mount refuses to mount my RAID-1 FS because it is durty.
> fsck runs, but segfaults.  It's clear that the corruption is confusing fsck
> and causing the trap.

First of all: This sounds a lot like the problem that initially
brought me to the FreeBSD mailing list, so maybe the archives and
my memory can help you. I'm not sure if it is really _the same_
kind of problem, but at least it could be some inspiration for
further experiments.

> If I force a mount in readonly mode, I can inspect the drive and at first
> glance, everything seems valid.  Since this machine is used for backups, I
> have lots of other medata (eg, checksums) and I'm slowly working through to
> see if anything important is damaged.

At this point: STOP.

If your data is important to you, get a copy of it NOW.

A forced r/o mount is a good chance to read your data. Copy
everything you are interested in, because in worst case, you
could have to initialize the whole filesystem, which implies
data loss. Make sure you're prepared for such an event.

In all honesty: I wasn't, and I regret it. Always remember
the purpose of backups: You don't need them until it's too
late. :-)

> From some of the stuff that fsck is finding, it's clear that the corruption
> is in a rather large-and-deep directory tree that was recently deleted.
> It's possible that the 'rm -rf' for this was running in the background when
> the system lost power.

Therefore deleted files (or "scheduled for deletion") can still
be present in the r/o mount. This "delay" once helped me recover
accidentally deleted files (stupid wildcard + fat fingers + brain
already asleep) - turned of power, booted SUM, mounted read-only,
copied files (still there!), ran fsck (files were gone), and then
copied files back into place. As if nothing happened... :-)

> Is there any way to have fsck be more "selective" in what it checks/repairs?
> It's been a long time since I've done low-level filesystem surgery, but it
> seems to me that if I can prevent it from going off into the weeds (and
> trying to repair inode entries that are no longer relevant), all will be
> well.

Yes. There is a "preen mode" (fsck -p) and a forced mode (fsck -f).
Be careful with specifying -y, it does not always to what you want
it to do. Data loss might happen.

See "man fsck" for details.

> Any advice?  I have thought about doing some inspection with "ls -i" and
> then being very selective in the inodes I get fsck to repair, but that seems
> challenging to get right.

And _that_ is how I finally got my files back (the initial "severe
data loss problem more than 10 years ago): With ls -i, I determined
the inode of an offending directory, then used fsdb (which I found
out about reading a reference manual about a GDR UNIX system) to
remove it, and _then_ (!) fsck was able, after two runs, to bring
the filesystem back to a consistent state.

The offending directory was .snap at the root of the filesystem.
Once it was gone, fsck worked as expected.

Also note that fsck _might_ have problems (or require a second run)
when dealing with soft updates and UFS journal.

If fsck encounters an unallocated, but not "free" inode, it will
store its content in the lost+found/ directory at the root of the
filesystem. It could be possible that the whole deleted tree appears
there. So check this location after the system came up properly.
You can then delete its content, if you wanted to delete those
files anyway.

Up to that point, I had already read McKusick's UFS paper, the code
of fsck_ffs (UFS fsck) and many other resources about how things
worked; I modified the fsck program, debugged it, examined dumps;
I learned data recovery tools (such as TSK and "UFS Explorer"),
forensic strategies and "What you should have done" - I couldn't
find out why fsck had "hickups" and could not proceed. None
mentioned that some directory entry (for a feature that I never
used!) was the problem. At least in my case, I got all (!) my data
back, just a few hundred filenames were missing (unallocated,
but present), but from the content, it was no problem to finally
re-instantiate those that mattered.

However, it's possible that you're facing an entirely different
problem where fsck won't be able to get the filesystem back into
a consistent state, and backup - newfs - restore is your only
option.

All the best, and I hope you can solve that problem. It's one of the
very few cases that can happen, and which teach you a lot about how
the UFS filesystem works. :-)

-- 
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20210201155842.1e529018.freebsd>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation