Date: Mon, 1 Feb 2021 15:58:42 +0100 From: Polytropon <freebsd@edvax.de> To: "Matt Emmerton" <matt@gsicomp.on.ca> Cc: <freebsd-questions@freebsd.org> Subject: Re: Help recovering damaged drive - fsck segfaults, read-only mount looks ok Message-ID: <20210201155842.1e529018.freebsd@edvax.de> In-Reply-To: <012a01d6f81e$3103d390$930b7ab0$@gsicomp.on.ca> References: <012a01d6f81e$3103d390$930b7ab0$@gsicomp.on.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 31 Jan 2021 17:12:40 -0500, Matt Emmerton wrote: > Hi, > > I have a FreeBSD-11 machine that I recently upgraded to FreeBSD-12. It has > a Sii RAID-1 pair of 1TB drives. > A week ago this system got unexpectedly powered off and when it came back > up, mount refuses to mount my RAID-1 FS because it is durty. > fsck runs, but segfaults. It's clear that the corruption is confusing fsck > and causing the trap. First of all: This sounds a lot like the problem that initially brought me to the FreeBSD mailing list, so maybe the archives and my memory can help you. I'm not sure if it is really _the same_ kind of problem, but at least it could be some inspiration for further experiments. > If I force a mount in readonly mode, I can inspect the drive and at first > glance, everything seems valid. Since this machine is used for backups, I > have lots of other medata (eg, checksums) and I'm slowly working through to > see if anything important is damaged. At this point: STOP. If your data is important to you, get a copy of it NOW. A forced r/o mount is a good chance to read your data. Copy everything you are interested in, because in worst case, you could have to initialize the whole filesystem, which implies data loss. Make sure you're prepared for such an event. In all honesty: I wasn't, and I regret it. Always remember the purpose of backups: You don't need them until it's too late. :-) > From some of the stuff that fsck is finding, it's clear that the corruption > is in a rather large-and-deep directory tree that was recently deleted. > It's possible that the 'rm -rf' for this was running in the background when > the system lost power. Therefore deleted files (or "scheduled for deletion") can still be present in the r/o mount. This "delay" once helped me recover accidentally deleted files (stupid wildcard + fat fingers + brain already asleep) - turned of power, booted SUM, mounted read-only, copied files (still there!), ran fsck (files were gone), and then copied files back into place. As if nothing happened... :-) > Is there any way to have fsck be more "selective" in what it checks/repairs? > It's been a long time since I've done low-level filesystem surgery, but it > seems to me that if I can prevent it from going off into the weeds (and > trying to repair inode entries that are no longer relevant), all will be > well. Yes. There is a "preen mode" (fsck -p) and a forced mode (fsck -f). Be careful with specifying -y, it does not always to what you want it to do. Data loss might happen. See "man fsck" for details. > Any advice? I have thought about doing some inspection with "ls -i" and > then being very selective in the inodes I get fsck to repair, but that seems > challenging to get right. And _that_ is how I finally got my files back (the initial "severe data loss problem more than 10 years ago): With ls -i, I determined the inode of an offending directory, then used fsdb (which I found out about reading a reference manual about a GDR UNIX system) to remove it, and _then_ (!) fsck was able, after two runs, to bring the filesystem back to a consistent state. The offending directory was .snap at the root of the filesystem. Once it was gone, fsck worked as expected. Also note that fsck _might_ have problems (or require a second run) when dealing with soft updates and UFS journal. If fsck encounters an unallocated, but not "free" inode, it will store its content in the lost+found/ directory at the root of the filesystem. It could be possible that the whole deleted tree appears there. So check this location after the system came up properly. You can then delete its content, if you wanted to delete those files anyway. Up to that point, I had already read McKusick's UFS paper, the code of fsck_ffs (UFS fsck) and many other resources about how things worked; I modified the fsck program, debugged it, examined dumps; I learned data recovery tools (such as TSK and "UFS Explorer"), forensic strategies and "What you should have done" - I couldn't find out why fsck had "hickups" and could not proceed. None mentioned that some directory entry (for a feature that I never used!) was the problem. At least in my case, I got all (!) my data back, just a few hundred filenames were missing (unallocated, but present), but from the content, it was no problem to finally re-instantiate those that mattered. However, it's possible that you're facing an entirely different problem where fsck won't be able to get the filesystem back into a consistent state, and backup - newfs - restore is your only option. All the best, and I hope you can solve that problem. It's one of the very few cases that can happen, and which teach you a lot about how the UFS filesystem works. :-) -- Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ...
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20210201155842.1e529018.freebsd>