Date: Sun, 2 May 1999 18:41:37 -0700 (PDT) From: milt <milt@moth.vicor-nb.com> To: freebsd-bugs@FreeBSD.org, gnats-admin@FreeBSD.org Cc: developers@moth.vicor-nb.com, nbavdoua@vicor-nb.com Subject: Re: kern/11226: Invalid files on disk after fsync Message-ID: <199905030141.SAA11898@moth.vicor-nb.com>
index | next in thread | raw e-mail
HIYA
This more or less replaces report "kern/11226". We are still seeing problems
on multiple hosts while running 2.2.6. I am getting tired of looking for
this. This time, the description uses tools that I didn't have last time and
information that I didn't have last time and it probably more believable.
I don't repeat the dmesg/df stuff from the host.
If you need that or any other information, please say so. We are getting
tired of this little problem. I can be reached as milt@vicor-nb.com or (530)
795-1807 for my home/office or (916) 523-1144 for my pager. HELP!
The sequence is:
1. We write a file to disk.
2. A short time (usually minutes) later, we read the file and process it
smoothly.
3. A long time later (usually days or weeks) the file comes to our attention
again. Sometuents turn out to be invalid when we look at the file even
though the file contents were valid in step 2 and even though the file
modification/inode dates have not changed!
a. I have examined five files. For the first two I didn't get a copy of
the inode. For the last three, I did get a copy of the inode. The
five files come from 3 different hosts. Probably the problem is on all
of 10 similar hosts on one site. (Other sites do a lower volume and
either haven't hit the problem or haven't complained.)
We have also spent much time over problems such as fsck failures during
boots and panics due to freeing a free disk block. I think this mail
describes the root problem.
a. fsck consistantly makes the file part of a complaint such as:
154976 DUP I=255052
154977 DUP I=255052
154978 DUP I=255052
154979 DUP I=255052
154980 DUP I=255052
154981 DUP I=255052
154982 DUP I=255052
154983 DUP I=255052
** Phase 1b - Rescan For More DUPS
154976 DUP I=9357
154977 DUP I=9357
154978 DUP I=9357
154979 DUP I=9357
154980 DUP I=9357
154981 DUP I=9357
154982 DUP I=9357
154983 DUP I=9357
DUP/BAD I=9357 OWNER=lbxprod MODE=100664
SIZE=153914 MTIME=Apr 29 12:38 1999
FILE=/env1/data/wf/19990429/sup/SB-3811.g1
DUP/BAD I=255052 OWNER=lbxprod MODE=100664
SIZE=109316 MTIME=Apr 30 04:35 1999
FILE=/env1/data/oldstore/19990430/data.12994/DATL-105792-07.0430.g1
(The above is extracted from an fsck run. There were other complaints from
fsck, but I think they were because an earlier incident of this type left
problems in the disk tables. Note that with the above error in place,
deleting either of these two files will leave the other pointing to some
free disk space and setting us up for some confusing times.)
b. We have the inode from three incidents. All three times, it looks as if
the high 16 bits of the disk inode field di_db[5] are invalid. All
three times, the contents of di_db[5] are a value which fsck points out
is used twice!
INCIDENT 1:
From
//wss0b.ga/usr/env1/data/oldstore/19990430/data.12994/DATL-105792-07.0430.g1
di_db[ 0] contains 0x125D38
di_db[ 1] contains 0x125D40
di_db[ 2] contains 0x125D48
di_db[ 3] contains 0x125D50
di_db[ 4] contains 0x125D58
di_db[ 5] contains 0x 25D60
di_db[ 6] contains 0x125D68
di_db[ 7] contains 0x125D70
di_db[ 8] contains 0x125D78
di_db[ 9] contains 0x125D80
di_db[10] contains 0x125D88
di_db[11] contains 0x125D90
The file size in bytes is 109,316 and the inode data continues into di_ib.
File times from the disk inode:
data modified 99/04/30 4:35:35
inode changed 99/04/30 4:35:35
In this case, the address 25D60 was also used by the file:
//wss0b.ga/usr/env1/data/wf/19990429/sup/SB-3811.g1 which was found to have
valid contents and which had the following times in the disk inode:
data modified 99/04/29 12:38:17
inode changed 99/04/29 12:38:17
Looks to me like the contents of 25D60 were not changed as
DATL-105792-07.0430.g1 was written.
INCIDENT 2:
From
//wss0e.ga/usr/env1/data/wf/19990420/check/CE-6429.g3
di_db[ 0] contains 0x2C4C88
di_db[ 1] contains 0x2C4C90
di_db[ 2] contains 0x2C4C98
di_db[ 3] contains 0x2C4CA0
di_db[ 4] contains 0x2C4CA8
di_db[ 5] contains 0x244CB0
di_db[ 6] contains 0x2C4CB8
di_db[ 7] contains 0x2C4CC0
The file size in bytes is 61,493 so di_db[7] should point to 4 fragments.
File times from the disk inode:
data modified 99/04/20 5:55:14
inode changed 99/04/20 5:55:14
In this case, the address 244CB0 was also used by the file
//wss0e.ga/usr/env1/data/wf/19990426/sup/SE-7408.L277349.g6 which was found to
have valid contents and which had the following times in the disk inode:
data modified 99/04/26 7:07:48
inode changed 99/04/26 7:07:48
Looks to me like 244CB0 was a free page after CE-6429.g3 was written and the
page later received data from SE-7408.L277349.g6.
INCIDENT 3:
From
//wss0e.ga/usr/env1/data/wf/19990421/check/CE-2694.g1
di_db[ 0] contains 0x599848
di_db[ 1] contains 0x599850
di_db[ 2] contains 0x599858
di_db[ 3] contains 0x599860
di_db[ 4] contains 0x599868
di_db[ 5] contains 0x419870
di_db[ 6] contains 0x59A23D
The file size in bytes is 51,626 so di_db[6] should point to 3 fragments with
only 2,544 bytes in use.
File times from the disk inode:
data modified 99/04/21 9:18:11
inode changed 99/04/21 9:18:11
The address 419870 was also used by the file
//wss0e.ga/usr//env1/oogs/BAI/tr277217 whose contents were not checked and
which had the following times in the disk inode:
data modified 99/03/24 21:49:59
inode changed 99/03/25 22:31:06
I betcha this is identical to incident 1, but I don't have a convenient method
for validating the contents of tr27717.
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message
help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199905030141.SAA11898>
