Date: Sun, 2 Nov 2008 05:06:01 +0100 From: Polytropon <freebsd@edvax.de> To: FreeBSD FS <freebsd-fs@freebsd.org> Subject: Repairing a defective UFS 2 partition with fsck_ffs (or other means) Message-ID: <20081102050601.9fccb80f.freebsd@edvax.de>
next in thread | raw e-mail | index | archive | help
Dear list, I need your help in order to solve one of the strangest and most complicated problems existing in this universe. First of all I'd like to mention that I'm using FreeBSD nearly exclusively (along with Solaris and other UNIXes) for many years and I never had any problem similar to this. In fact, I never had *any* problem that required external help. But now, I'm lost. I don't know what to try, so I would be glad about any suggestion you could give me. I'm familiar with FreeBSD, shell scripting and C. My skills cover the usual "admin things". The accident that happended to me is some very stange thing, strange in regards of why the usual means of solving sich a problem don't seem to fit. In fact, I'm the second (!) person on earth who encountered this problem, as far as my investigations revealed. So I'm not sure if it's solvable at all. In order to explain what it's about, I'd like to follow this path: 1. What initially happened? (impact) 2. How does the problem occur? (examination) 3. What seems to be the reason? (diagnosis) 4. What did I try to solve the problem? (treatment) 5. What kind of solution should be possible? (prognosis) This should help to explain my problem properly. If there's more to know, please ask me. I'll try to answer as precisely as I can. And don't mind my bad English, it's not my native language. It's a long story, sorry. So here I'll go... 1. What initially happened? --------------------------- First of all, we're talking about this device: ad0: 114473MB <Seagate ST3120022A 3.06> at ata0-master UDMA100 The installation has been a FreeBSD 5.4-p something on a 2 GHz P4 machine with 768 MB SDR-SDRAM, working perfectly for many years now. The disk contained some partitions (ad0s1a as /, ad0s1d as /var, ad0s1e as /usr and ad0s1f as /home), formatted as UFS 2 with Soft Updates (except for /). While doing some web development (running: xterms with Midnight Commander and its editor, and Opera), the system suddenly stopped working, it froze. Some seconds later, it rebootet. The last message on VT 0 was something like this, if I remember correctly: cannot free some inode: already free automatic reboot When the system came up again, I relied on fsck_ffs solving all possible problems, as I knew it from the past. The result: Many defects in the file system contents, most of them didn't matter (can reinstall), but it wouldn't make the /home partition completely accessible again. I could copy the content from the archive and all the other users' home directories (luckily), but under no circumstances I could access my own (!) home directory again. HEART ATTACK!!! Of course, I didn't have a good backup (the last one was many years old). This is because I never encountered any problems, so I got lazy. Okay, that seems to be the revenge now. When you don't do your backups, something will happen. If you do your backups, nothing will happen, and you won't need them at all. That's their purpose. I'm sure you're familiar with this wisdom. :-) We're talking about documentation, mail archives, sources of programming and various projects here, data collections created in many years of hard work. So it's understandable why I want to get the stuff back as complete as possible, that would be great. 2. How does the problem occur? ------------------------------ The problem occured at system startup when running fsck_ffs. ** /dev/ad1s1f ** Last Mounted on /home ** Phase 1 - Check Blocks and Sizes 1035979 BAD I=259127 UNEXPECTED SOFT UPDATE INCONSISTENCY 1101472 DUP I=260035 UNEXPECTED SOFT UPDATE INCONSISTENCY [...] 1117681 DUP I=260039 UNEXPECTED SOFT UPDATE INCONSISTENCY 1117682 DUP I=260039 UNEXPECTED SOFT UPDATE INCONSISTENCY EXCESSIVE DUP BLKS I=260039 CONTINUE? yes [...] 3774433638169537379 BAD I=260051 UNEXPECTED SOFT UPDATE INCONSISTENCY 7021223365635213949 BAD I=260051 UNEXPECTED SOFT UPDATE INCONSISTENCY 8030898235988077411 BAD I=260051 UNEXPECTED SOFT UPDATE INCONSISTENCY 7310315658325879925 BAD I=260051 UNEXPECTED SOFT UPDATE INCONSISTENCY EXCESSIVE BAD BLKS I=260051 CONTINUE? yes [...] 1485568 DUP I=290557 UNEXPECTED SOFT UPDATE INCONSISTENCY 1485569 DUP I=290557 UNEXPECTED SOFT UPDATE INCONSISTENCY 1485570 DUP I=290557 UNEXPECTED SOFT UPDATE INCONSISTENCY 1485571 DUP I=290557 UNEXPECTED SOFT UPDATE INCONSISTENCY 1485572 DUP I=290557 UNEXPECTED SOFT UPDATE INCONSISTENCY 1485573 DUP I=290557 UNEXPECTED SOFT UPDATE INCONSISTENCY 1485574 DUP I=290557 UNEXPECTED SOFT UPDATE INCONSISTENCY 1485575 DUP I=290557 UNEXPECTED SOFT UPDATE INCONSISTENCY 5707022222514874728 BAD I=290557 UNEXPECTED SOFT UPDATE INCONSISTENCY 8091332836184380774 BAD I=290557 UNEXPECTED SOFT UPDATE INCONSISTENCY 8598589197767749681 BAD I=290557 UNEXPECTED SOFT UPDATE INCONSISTENCY [...] 3631363939722683732 BAD I=290557 UNEXPECTED SOFT UPDATE INCONSISTENCY EXCESSIVE BAD BLKS I=290557 CONTINUE? yes INCORRECT BLOCK COUNT I=290557 (3104 should be 736) CORRECT? yes fsck_ffs: bad inode number 306176 to nextinode As it's obvious, fsck_ffs fails in phase 1. No recovery is done. In my opinion, this indicates a major defect of the file system. Maybe many defects, one worse than the other. If fsck_ffs can't repair it, it must be really bad. Okay, I took the opportunity to take a new hard disk where I already had installed FreeBSD 7. Why? Because other partitions had damages, too. On /dev/ad0s1a, /, nothing significant happened, but for example on /dev/ad0s1e, /usr, the whole X11R6/ subtree disappeared, and lost+found/ filled up with many directory fragments. So I could not use the system anymore. I put in the new disk as ad0 and the former ad0 disk as ad1 and retried the fsck_ffs check where fsck_ffs from version 5 failed with fsck_ffs from version 7. NB that no matter by which other name I called fsck_ffs, be it fsck_ufs or fsck_4.2bsd, the problem would stay the same. In order to do some tests, I made an 1:1 copy of the defective partition. This is a wise step, because I can't accidently damage important data, and when I messed up a copy, I can pull a new one. FreeBSD's dd program did the job well. It ran approx. 4 hours without any error message. The defect(s) of the disk partition are replicated 1:1 in the image. % cd ~/rescue % dd if=/dev/ad1s1f of=ad1s1f.dd bs=1m 86566+1 records in 86566+1 records out 90772014080 bytes transferred in 15156.804004 secs (5988862 bytes/sec) File size of ad1s1f.dd seemed to be good, the partition contained in this file was correctly recognized: % file ad1s1f.dd ad1s1f.dd: Unix Fast File system [v2] (little-endian) last mounted on /mnt, last written at Wed Jul 2 18:51:06 2008, clean flag 0, readonly flag 0, number of blocks 44322272, number of data blocks 42925108, number of cylinder groups 472, block size 16384, fragment size 2048, average file size 16384, average number of files in dir 64, pending blocks to free 0, pending inodes to free 0, system-wide uuid 0, minimum percentage of free blocks 8, TIME optimization Of course, I tried to mount and access the partition's copy using the vnode mechanism for memory disks: % sudo mdconfig -a -t vnode -u 10 -f ad1s1f.dd % mount -o ro /dev/md10 mnt/ Fine, mount worked, so I could see what's on the disk. +<-/export/home/poly/rescue/mnt------v>+ | Name | Size | MTime | |/.. |UP--DIR| | |/.snap | 512|Dec 21 2004| |/archiv | 512|Feb 27 2006| |/backup | 512|Sep 23 2005| |/gast | 1024|Aug 25 2005| |/lost+found | 2048|Jul 1 10:15| |/markus | 512|Nov 20 2003| |/root | 1024|Apr 18 16:17| |/surf | 1024|Feb 17 2005| | .fsck_snapshot | 86567M|Jun 30 20:47| |?poly | 0|Jan 1 1970| <=== +--------------------------------------+ |/.. | +--------------------------------------+ poly@r55:~/rescue/mnt% [^] 1Help 2Menu 3View 4Edit 5Copy 6RenMov 7Mkdir 8Delete 9PullDn 10Quit Within the Midnight Commander, the name of the home directory has been marked with red color and a leading question mark. Do you recognize the timestamp? Strange. Furthermore, I could not change into this directory. % cd mnt/poly mnt/poly: Not a directory. % file mnt/poly mnt/poly: cannot open `mnt/poly' (Bad file descriptor) But I didn't give up hope yet. The data from within the home directory seemed to be present. The corresponding inodes don't seem to be marked as unused. I think this is what "orphan inodes" are called? Where do I take this idea from? There's an interesting match of the disk occupation percentage I found out when trying some df and dh examinations: % df -h Filesystem Size Used Avail Capacity Mounted on /dev/md10 82G 75G 716M 99% /export/home/poly/rescue/mnt At this point, a strange situation already occurs: The disk is 82 GB, 75 GB are used, but less than 1 GB is free. So there's something missing? I remember that at the point the disk got mad there were only approx. 700 MB free on /home. This matches the numbers above, But where's the rest? % sudo du -sch mnt du: mnt/poly: Bad file descriptor du: mnt/archiv/cr/clips.w32/s01.wmv: Bad file descriptor du: mnt/archiv/cr/clips.w32/s02.wmv: Bad file descriptor 52G mnt 52G total The disk is 82 GB, 75 GB are used, and the data structures that are still present make up 52 GB. So there must be approx. 20 GB somewhere. This could be the content of my home directory, the important data, my life, the universe, and everything. :-) Furthermore, you'll see two further "Bad file descriptor" warnings inside the archive directory. They don't matter, but they surely indicate that more than just the inode of my home directory died. So more problems can occur while proceeding. Of course, checking the partition's copy with dd, directly or via the md device, gives the same error message as already mentioned. There was a file /.fsck_snapshot of the partition's respective size. This file could be mounted, too, and within it there was a very old copy of my home directory. The snapshot has been taken at the time when I initially installed and configured this system, so it was very old, too old. 3. What seems to be the reason? ------------------------------- The reason seems to be that the inode describing my home directory doesn't exist anymore. This explains why its name is is still there (stored in the inode describing the root directory), but no further information about the file type (here: directory) and its respective content is available. But after all, this does not explain why fsck_ffs can't repair the partition any more, nor can any other program. Here my troubles understanding what happened start. 4. What did I try to solve the problem? --------------------------------------- As I already mentioned, FreeBSD's fsck_ffs is unable to repair the partition. fsck_ffs: bad inode number 306176 to nextinode Using FreeBSD's clri, I tried to clear the inodes that I thought would cause the problem of fsck_ffs: % sudo mdconfig -a -t vnode -u 10 -f ad1s1f.dd % clri 306176 /dev/md10 % sync This didn't work at all. I've tried other versions of fsck_ffs, too, running on my main machine or another one, from FreeBSD 5, 6 and 7. The only difference was a FreeBSD 5 system where fsck_ffs crashed within phase 1 with this message: fsck_ffs: cannot alloc 1073796864 bytes for inoinfo It seems that this particular machine didn't have enough RAM installed. And no matter if I checked the original partition or the copy I made with dd, the problem would always be the same. So then I tried an alternative to FreeBSD's dd, hoping that some "magical translation" would happen. My first choice was ddrescue from the ports: % ddrescue -d -r 3 -n /dev/ad1s1f ad1s1f.ddr logfile Press Ctrl-C to interrupt Initial status (read from logfile) rescued: 0 B, errsize: 0 B, errors: 0 Current status rescued: 90772 MB, errsize: 0 B, current rate: 6815 kB/s ipos: 90772 MB, errors: 0, average rate: 6723 kB/s opos: 90772 MB Finished The file ad1s1f.ddr was exactly the same as ad1s1f.dd, so no gain of hope here. Another idea was to copy data from the original disk using FreeBSD's fetch program - fetch -rR. Nope. Even FreeBSD's recoverdisk, done from the partition or its copy, just brought up another 1:1 copy including the problem. % recoverdisk ad1s1f.dd ad1s1f.rd start size block-len state done remaining % done 90771030016 984064 984064 0 90772014080 0 100.00000 Completed After this, I tried some "hardcore stuff": The Sleuth Kit from the ports, and first its dls program: % dls -v -f ufs -i raw ad1s1f.dd > ad1s1f.dls File system is corrupt (ffs_group_load: Group 12 descriptor offsets too large at 1129104) Allthough it didn't help me either, the error message is to be considered interesting: "Group 12 descriptor offsets too large at 1129104", but sadly, I don't know how to interpret this. Is 1129104 an inode? If yes: it's not allocated. What group is meant? Cylinder group? Maybe you could tell me. Another program from The Sleuth Kit, fls, allowed me to see some content of the partition. In fact, it even showed data that wasn't accessible, so it's within the range of the files that need to be restored. % fls -i raw -r ad1s1f.dd [...] d/- * 259072(realloc): poly + d/d * 3438592(realloc): 2003-05-17 [...] +++ d/d 5840896: brazil ++++ r/r 5840897: kate_bush_-_brazil.mp3 ++++ r/r 5840898: shangrila_towers.mp3 ++++ r/r 5840899: singing_telegram.mp3 ++++ r/r 5840900: the_first_noel.mp3 Segmentation fault (core dumped) So I checked: % fsdb -r ad1s1f.dd ad1s1f.dd is not a disk device CONTINUE? [yn] y ** ad1s1f.dd Editing file system `ad1s1f.dd' Last Mounted on /export/home/poly/rescue/mnt fsdb (inum: 2)> inode 3438592 current inode: directory I=3438592 MODE=40700 SIZE=512 BTIME=Nov 30 14:31:57 2007 [0 nsec] MTIME=Jun 26 05:06:14 2008 [0 nsec] CTIME=Jun 26 05:06:14 2008 [0 nsec] ATIME=Jul 1 21:13:05 2008 [0 nsec] OWNER=poly GRP=staff LINKCNT=2 FLAGS=0 BLKCNT=4 GEN=4803f917 fsdb (inum: 3438592)> ls slot 0 ino 3438592 reclen 12: directory, `.' slot 1 ino 447497 reclen 12: directory, `..' slot 2 ino 3438593 reclen 24: regular, `.sylpheed_mark' slot 3 ino 283193 reclen 12: regular, `1' slot 4 ino 289966 reclen 12: regular, `2' slot 5 ino 289970 reclen 12: regular, `3' slot 6 ino 3438620 reclen 24: regular, `.sylpheed_cache' slot 7 ino 290363 reclen 12: regular, `4' slot 8 ino 290366 reclen 12: regular, `5' slot 9 ino 290385 reclen 12: regular, `6' slot 10 ino 290444 reclen 368: regular, `7' fsdb (inum: 3438592)> inode 259072 current inode 259072: unallocated inode fsdb (inum: 259072)> quit ***** FILE SYSTEM STILL DIRTY ***** *** FILE SYSTEM MARKED DIRTY *** BE SURE TO RUN FSCK TO CLEAN UP ANY DAMAGE *** IF IT WAS MOUNTED, RE-MOUNT WITH -u -o reload Allthough the directory's name "2003-05-17" indicates that it should hold pictures from the cam/ subtree, it's content seems to be a Sylpheed MH mail directory. According to fls's output, inodes 3438592 and 259072 have been reallocated. And remember 259072? This has been my home directory, I think. Another program from the ports, scan_ffs, would only confirm what I already knew: % scan_ffs -lv /dev/md10 block 128 id 3f67c4e6,354efde8 size 44322272 block 160 id 3f67c4e6,354efde8 size 44322272 X: 177289088 0 4.2BSD 2048 16384 0 # /export/home/poly/rescue/mnt block 12032 id 616e732e,c0690070 size 44322272 block 12416 id 3f67c4e6,354efde8 size 44322272 block 13248 id 6e73746a,c3577600 size 44322272 block 376512 id 3f67c4e6,354efde8 size 44322272 block 752864 id 3f67c4e6,354efde8 size 44322272 block 1129216 id 3f67c4e6,354efde8 size 44322272 block 1505568 id 3f67c4e6,354efde8 size 44322272 [...] The 4.2BSD partition is still there and intact, okay. The program testdisk, as well available from the ports, seems to have the same purpose. But a lost partition is not the real problem, I think. Another approach I found would to be to avoid looking at the file system at all, instead trying to parse the disk "byte-wise" and look for magic bytes. A tool to do so is magicrescue from the ports. % magicrescue -r /usr/local/share/magicrescue/recipes -d mr_output /dev/md10 Read error on /dev/md10 at 102400 bytes: Invalid argument It didn't work on the memory disk, but fortunately on the dd copy: % magicrescue -r /usr/local/share/magicrescue/recipes -d mr_output ad1s1f.dd The files recovered by this program contained many different types, such as JPG images or MP3 files. Furthermore, files from within the inaccessible home directory had been restored. This is another hint that the data should still be there. But sadly, the file structures could not be retrieved, so I got lots of stuff into one directory. >From the manual of the program ffs2recov from the ports I found out that it's possible to create an inode where you can explicitely specify name and number. So I tried: % cd ~/rescue % ffs2recov -c 259072 -n poly ad1s1f.dd This caused a file called "poly" in the ~/rescue directory. Okay, not what I wanted to get. So I tried something really stupid: % cd ~/rescue % sudo mdconfig -a -t vnode -u 10 -f ad1s1f.dd % mount -o rw /dev/md10 mnt/ % cd mnt % ffs2recov -c 259072 -n poly ad1s1f.dd % sync panic: ffs_write: type 0xc5d37e04 0 (0,16384) Dumping 136 MB: 121 105 89 73 57 41 (CTRL-C to abort) 25 9 Dump complete Automatic reboot in 15 seconds - press a key on the console to abort Rebooting... [...] ad0: 305245MB <WDC WD3200JB-00KFA0 08.05J08> at ata0-master UDMA100 ad1: 305245MB <WDC WD3200JB-00KFA0 08.05J08> at ata0-slave UDMA100 ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0 ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0 ad1: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=84<ICRC,ABORTED> LBA=0 ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0 ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0 ad1: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=84<ICRC,ABORTED> LBA=0 ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=64 ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=64 ad1: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=84<ICRC,ABORTED> LBA=64 ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0 ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0 ad1: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=84<ICRC,ABORTED> LBA=0 ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0 ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0 ad1: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=84<ICRC,ABORTED> LBA=0 ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0 ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0 ad1: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=84<ICRC,ABORTED> LBA=0 ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0 ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0 ad1: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=84<ICRC,ABORTED> LBA=0 savecore: reboot after panic: ffs_write: type 0xc5d37e04 0 (0,16384) You can imagine my heartbeat going up to 200 at this moment! :-) Fortunately, no data was lost. I've got no idea what happened, but I'm sure my approach was wrong. The system would not react in this way without a proper reason. And NB that the ad0 and ad1 you see are completely different things, the original 120 GB Seagate disk is on the shelf. This is the new FreeBSD 7 system is put on ad0, and ad1 is reserved for backup purposes. Why does it complain that much? Okay, don't mind, it's not important now. 5. What kind of solution should be possible? -------------------------------------------- In general, there would be two options: a) Modify fsck_ffs so it will work. b) Modify the file system so fsck_ffs will work. Of course, I've got no good clue how to do this in particular. Let me first describe what I did to fsck_ffs. I first took a look at fsck_ffs's source code. Well... it's not that I did understand very much of it, sadly, but I could find the position where the error fsck_ffs: bad inode number 306176 to nextinode came from: it was /usr/src/sbin/fsck_ffs/inode.c line 319: if (inumber != nextino++ || inumber > lastvalidinum) errx(EEXIT, "bad inode number %d to nextinode", inumber); Oh how I love disjunctions in exit conditions! :-) So I made a change to this part, just to see what would happen. (And: Yes, I know, "trial & error" is not a programming concept.) I used a copy of the subtrees sbin/fsck_ffs/ + sbin/mount/ and sys/ufs/ffs/ + sys/ufs/ufs/ from /usr/src/, then issued the command "make" from within ~/rescue/sbin/fsck_ffs/, which would give me an executable fsck_ffs in this directory. I copied it to ~/rescue and tested it with the 1:1 dd copy. if(inumber != nextino++) { printf("--- condition: inumber != nextino++\n"); printf("--- inumber=%d nextino(++)=%d lastinum=%d\n", inumber, nextino, lastinum); errx(EEXIT, "bad inode number %d to nextinode", inumber); } if(inumber > lastvalidinum) { printf("--- condition: inumber > lastvalidinum\n"); printf("--- inumber=%d lastvalidinum=%d, lastinum=%d\n", inumber, lastvalidinum, lastinum); errx(EEXIT, "bad inode number %d to nextinode", inumber); } This was the result: % ./fsck_ffs -yf ad1s1f.dd [...] --- condition: inumber > lastvalidinum --- inumber=306176 lastvalidinum=306175, lastinum=306176 fsck_ffs: bad inode number 306176 to nextinode So what's up with inode 306176? When invoking fsdb on this inode, I could see the content of a directory, and ils from The Sleuth Kit revealed that it seems to be a directory within the inaccessible home directory. slot 150 ino 306176 reclen 20: directory, `hellraiser' slot 1566 ino 306176 reclen 12: directory, `.' Strange, isn't it? Finally, I decided to comment out the whole part. I found fsck_ffs complaining in fsutil.c line 139: if (inum > maxino) errx(EEXIT, "inoinfo: inumber %d out of range", inum); So I put in another "checkpoint" there: printf("---> %d\n", inum); if (inum > maxino) { printf("--- condition: inum > maxino\n"); printf("--- inum=%d maxino=%d\n", inum, maxino); errx(EEXIT, "inoinfo: inumber %d out of range", inum); } The result was this: % ./fsck_ffs -yf ad1s1f.dd [...] THE FOLLOWING DISK SECTORS COULD NOT BE READ: 177638368, 177638369, 177638370, 177638371, 177638372, 177638373, 177638374, 177638375, 177638376, 177638377, 177638378, 177638379, 177638380, 177638381, 177638382, 177638383, 177638384, 177638385, 177638386, 177638387, 177638388, 177638389, 177638390, 177638391, 177638392, 177638393, 177638394, 177638395, 177638396, 177638397, 177638398, 177638399, 177638400, 177638401, 177638402, 177638403, 177638404, 177638405, 177638406, 177638407, 177638408, 177638409, 177638410, 177638411, 177638412, 177638413, 177638414, 177638415, 177638416, 177638417, 177638418, 177638419, 177638420, 177638421, 177638422, 177638423, 177638424, 177638425, 177638426, 177638427, 177638428, 177638429, 177638430, 177638431, 177638432, 177638433, 177638434, 177638435, 177638436, 177638437, 177638438, 177638439, 177638440, 177638441, 177638442, 177638443, 177638444, 177638445, 177638446, 177638447, 177638448, 177638449, 177638450, 177638451, 177638452, 177638453, 177638454, 177638455, 177638456, 177638457, 177638458, 177638459, 177638460, 177638461, 177638462, 177638463, 177638464, 177638465, 177638466, 177638467, 177638468, 177638469, 177638470, 177638471, 177638472, 177638473, 177638474, 177638475, 177638476, 177638477, 177638478, 177638479, 177638480, 177638481, 177638482, 177638483, 177638484, 177638485, 177638486, 177638487, 177638488, 177638489, 177638490, 177638491, 177638492, 177638493, 177638494, 177638495, --- condition: inum > maxino --- inum=11116545 maxino=11116544 fsck_ffs: inoinfo: inumber 11116545 out of range Seemed to be an important condition. :-) So what's this again? The answer was in setup.c line 209: maxino = sblock.fs_ncg * sblock.fs_ipg; Is there some information retrieved incorrectly from the file system's superblock causing all the trouble? Well, I did try checks with fsck_ffs with refering to alternate superblocks, but no luck. Or does it mean that there are 11116544 inodes on the partition? This would imply that (not mentioning directories) 11 millions of files can be created - or are stored on this disk totally? At this point, I decided to give up this way of "fixing"; most of the conditions seem to be well intended, the defect on the disk must be that bad that fsck_ffs can't handle it anymore. And now for the file system. As it is already clear, the inode of the home directory is gone. So an idea would be to create a new inode, with the same name and number as it should be. Good idea? No, obviously not. I tried it in two different ways, with no luck. So that seems to be insufficient. I do understand it: The inode number created would only be a kind of "link entry" inside the root directory which points to further information. But where should the new home directory entry know about its content? >From the friendly FreeBSD questions mailing list I even learned that there's no way to predict the inode numbers. If I assume a directory D with its inode number i(D), within D a file F with its inode number i(F), I can't claim i(D) < i(F), so I can't expect any special inode number. I think there's more to establish an intact directory structure, not just a simple "make inode with name". The directory needs to be populated correctly, but therefore, I would need to know which files are inside it. So it would be neccessary to pick all possible inode numbers and look what's behind them. This means I would need to "walk back" the .. paths to see which one finally leads to the home directory, and then put the 1st instance directory name (or inode number instead of the name, because the name is lost) into one of the directory slots; do I call them correctly? As far as I've already learned, when "walking back" the path from a file deep within a directory structure, every inode contains a field "where it comes from", let's say, where CWD and .. are (as an inode number). Let's assume we're at the inode 259301 refering to a file bla.txt. Then something like this structure should exist: bla.txt dingens/ foo/ poly/ / 259301 -----> 259285 -----> 259140 -----> 259072 -----> 2 This would be /home/poly/foo/dingens/bla.txt on ad0s1f (where / is then mounted as /home). When I can assume that every inode still knows "where it came from", what would be a useful tool to build poly/ (12345) again? I think I'll need to construct its content again, because just by creating poly/ as 12345, where does the filesystem know from what's the content of poly/? Is the term "directory slots" I came across related to that topic? Which sources could give good hints? For any considerations, I'll assume that only the inode of my home directory is gone. I can't tell for sure that it will be this way, it's possible that other inodes have died, too. I can't assert it won't be the case. In general, I think what's needed is a way to reconnect the "orphan" inodes to "normal" inodes again so they can be accessed. Because the home directory's inode is gone, any information about the files and directories on its 1st level is gone, too. So these would not be restored with their original names, but with the inode number as names, just like fsck_ffs would do it with its lost+found/ mechanism. All data within the directories from the 1st level would of course still have their names because these inodes are present. I'm thinking about something like this: Formerly: / poly/ foo/ bar.c baz/ boing/ boo.h boom.h bla.c .xchat/ xchat.conf .fetchmailrc After restore: / poly/ #123456/ bar.c #123789/ boing/ boo.h boom.h #124785 #127854/ xchat.conf #128745 ^^^^^^^ There are tools that can help to "restore" the 1st level, for example FreeBSD's file command. There aren't many files where a problem should occur: File names can usually be recognized from the data they contain (source, note, configuration file etc.), and directories can be recognized by the names of the files they contain. Of course, that's the thing that would happen if fsck_ffs would work as initially intended. When I see it, I will remember what the correct names were. So these were my first thoughts about this problem. I hope you can help me with some ideas, concepts or suggestions, or documents or source files worth studying. I don't expect you to solve my problem, I'm not greedy. :-) -- Polytropon >From Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ...
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20081102050601.9fccb80f.freebsd>