From owner-freebsd-bugs Sun May 2 18:42: 8 1999 Delivered-To: freebsd-bugs@freebsd.org Received: from moth.vicor-nb.com (moth.vicor-nb.com [208.206.78.46]) by hub.freebsd.org (Postfix) with ESMTP id 7A83814C34; Sun, 2 May 1999 18:42:05 -0700 (PDT) (envelope-from milt@moth.vicor-nb.com) Received: (from milt@localhost) by moth.vicor-nb.com (8.8.5/8.8.5) id SAA11898; Sun, 2 May 1999 18:41:37 -0700 (PDT) Date: Sun, 2 May 1999 18:41:37 -0700 (PDT) From: milt Message-Id: <199905030141.SAA11898@moth.vicor-nb.com> To: freebsd-bugs@FreeBSD.org, gnats-admin@FreeBSD.org Subject: Re: kern/11226: Invalid files on disk after fsync Cc: developers@moth.vicor-nb.com, nbavdoua@vicor-nb.com Sender: owner-freebsd-bugs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org HIYA This more or less replaces report "kern/11226". We are still seeing problems on multiple hosts while running 2.2.6. I am getting tired of looking for this. This time, the description uses tools that I didn't have last time and information that I didn't have last time and it probably more believable. I don't repeat the dmesg/df stuff from the host. If you need that or any other information, please say so. We are getting tired of this little problem. I can be reached as milt@vicor-nb.com or (530) 795-1807 for my home/office or (916) 523-1144 for my pager. HELP! The sequence is: 1. We write a file to disk. 2. A short time (usually minutes) later, we read the file and process it smoothly. 3. A long time later (usually days or weeks) the file comes to our attention again. Sometuents turn out to be invalid when we look at the file even though the file contents were valid in step 2 and even though the file modification/inode dates have not changed! a. I have examined five files. For the first two I didn't get a copy of the inode. For the last three, I did get a copy of the inode. The five files come from 3 different hosts. Probably the problem is on all of 10 similar hosts on one site. (Other sites do a lower volume and either haven't hit the problem or haven't complained.) We have also spent much time over problems such as fsck failures during boots and panics due to freeing a free disk block. I think this mail describes the root problem. a. fsck consistantly makes the file part of a complaint such as: 154976 DUP I=255052 154977 DUP I=255052 154978 DUP I=255052 154979 DUP I=255052 154980 DUP I=255052 154981 DUP I=255052 154982 DUP I=255052 154983 DUP I=255052 ** Phase 1b - Rescan For More DUPS 154976 DUP I=9357 154977 DUP I=9357 154978 DUP I=9357 154979 DUP I=9357 154980 DUP I=9357 154981 DUP I=9357 154982 DUP I=9357 154983 DUP I=9357 DUP/BAD I=9357 OWNER=lbxprod MODE=100664 SIZE=153914 MTIME=Apr 29 12:38 1999 FILE=/env1/data/wf/19990429/sup/SB-3811.g1 DUP/BAD I=255052 OWNER=lbxprod MODE=100664 SIZE=109316 MTIME=Apr 30 04:35 1999 FILE=/env1/data/oldstore/19990430/data.12994/DATL-105792-07.0430.g1 (The above is extracted from an fsck run. There were other complaints from fsck, but I think they were because an earlier incident of this type left problems in the disk tables. Note that with the above error in place, deleting either of these two files will leave the other pointing to some free disk space and setting us up for some confusing times.) b. We have the inode from three incidents. All three times, it looks as if the high 16 bits of the disk inode field di_db[5] are invalid. All three times, the contents of di_db[5] are a value which fsck points out is used twice! INCIDENT 1: From //wss0b.ga/usr/env1/data/oldstore/19990430/data.12994/DATL-105792-07.0430.g1 di_db[ 0] contains 0x125D38 di_db[ 1] contains 0x125D40 di_db[ 2] contains 0x125D48 di_db[ 3] contains 0x125D50 di_db[ 4] contains 0x125D58 di_db[ 5] contains 0x 25D60 di_db[ 6] contains 0x125D68 di_db[ 7] contains 0x125D70 di_db[ 8] contains 0x125D78 di_db[ 9] contains 0x125D80 di_db[10] contains 0x125D88 di_db[11] contains 0x125D90 The file size in bytes is 109,316 and the inode data continues into di_ib. File times from the disk inode: data modified 99/04/30 4:35:35 inode changed 99/04/30 4:35:35 In this case, the address 25D60 was also used by the file: //wss0b.ga/usr/env1/data/wf/19990429/sup/SB-3811.g1 which was found to have valid contents and which had the following times in the disk inode: data modified 99/04/29 12:38:17 inode changed 99/04/29 12:38:17 Looks to me like the contents of 25D60 were not changed as DATL-105792-07.0430.g1 was written. INCIDENT 2: From //wss0e.ga/usr/env1/data/wf/19990420/check/CE-6429.g3 di_db[ 0] contains 0x2C4C88 di_db[ 1] contains 0x2C4C90 di_db[ 2] contains 0x2C4C98 di_db[ 3] contains 0x2C4CA0 di_db[ 4] contains 0x2C4CA8 di_db[ 5] contains 0x244CB0 di_db[ 6] contains 0x2C4CB8 di_db[ 7] contains 0x2C4CC0 The file size in bytes is 61,493 so di_db[7] should point to 4 fragments. File times from the disk inode: data modified 99/04/20 5:55:14 inode changed 99/04/20 5:55:14 In this case, the address 244CB0 was also used by the file //wss0e.ga/usr/env1/data/wf/19990426/sup/SE-7408.L277349.g6 which was found to have valid contents and which had the following times in the disk inode: data modified 99/04/26 7:07:48 inode changed 99/04/26 7:07:48 Looks to me like 244CB0 was a free page after CE-6429.g3 was written and the page later received data from SE-7408.L277349.g6. INCIDENT 3: From //wss0e.ga/usr/env1/data/wf/19990421/check/CE-2694.g1 di_db[ 0] contains 0x599848 di_db[ 1] contains 0x599850 di_db[ 2] contains 0x599858 di_db[ 3] contains 0x599860 di_db[ 4] contains 0x599868 di_db[ 5] contains 0x419870 di_db[ 6] contains 0x59A23D The file size in bytes is 51,626 so di_db[6] should point to 3 fragments with only 2,544 bytes in use. File times from the disk inode: data modified 99/04/21 9:18:11 inode changed 99/04/21 9:18:11 The address 419870 was also used by the file //wss0e.ga/usr//env1/oogs/BAI/tr277217 whose contents were not checked and which had the following times in the disk inode: data modified 99/03/24 21:49:59 inode changed 99/03/25 22:31:06 I betcha this is identical to incident 1, but I don't have a convenient method for validating the contents of tr27717. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-bugs" in the body of the message