From owner-freebsd-questions@FreeBSD.ORG Thu Mar 19 03:05:28 2015 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 6DFAB25A for ; Thu, 19 Mar 2015 03:05:28 +0000 (UTC) Received: from mout.gmx.com (mout.gmx.com [74.208.4.201]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 3AD5E92E for ; Thu, 19 Mar 2015 03:05:28 +0000 (UTC) Received: from mail.gmx.com ([72.251.118.65]) by mail.gmx.com (mrgmxus002) with ESMTPA (Nemesis) id 0MVMk4-1Z1aKv1SP8-00YmdX for ; Thu, 19 Mar 2015 04:05:26 +0100 Date: Wed, 18 Mar 2015 19:05:38 -0800 From: "CK" To: Cc: Subject: Re: thrashing + lost files Reply-To: "CK" X-Mailer: UMail v1.0 Message-ID: <0MTkBS-1YyrZe40hM-00QVB7@mail.gmx.com> X-Provags-ID: V03:K0:j56DeAI0Yhhfb5OHRRlmNM6niI+EWc7+2jbnq8WIyTXqcoNmuXu FbaRHZdaCvNvmCXeX7WWaa97T39FVOXU8q6XeGn3Gbjshd1mBQn/TgTIw/IhYuQQ9XEMYy1 NZWNTqXT0qZKq4S+kJNxqF5jl6p+Ks9FJtnmkYfwbhGOFlmiJ8/7HfxyJdnZrpIunw/gjXF cwPYixPSlbgsUl3Up0P6A== X-UI-Out-Filterresults: notjunk:1; X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Mar 2015 03:05:28 -0000 > > > > The result is the loss of many critical files from a hard drive, as if > > > > a "rm *" was done in the home directory. This occurs after the > > > > thrashing when Xwindow is accidently shutdown with Opera open with > > > > many javascript page tabs, eg, being a memory pig - consuming 1/2 of > > > > RAM (256M), which after dumping core, writes a large amount of data > > > > (crashlog) even after Xwindow is down: > > > > > > > > pid 1118 (opera), uid 1001: exited on signal 11 (core dumped) > > > > > > I thought Opera would simply write a core dump, well, still several 100s > > > of MB though... > > > > Interestingly, the core dump was deleted out of the home directory. I > > caught a quick glimpse of it doing "ls" before it was deleted. As I said, > > it was exactly like "rm *". Dot files were left intact. > > Oh, that's surprising! I also had that experience once - home directory > empty (!) _except_ dot files (and other directories), just like "rm *" had > been issued... very strange... Yes, that is interesting. Does not see like "coincidence". > > At first, I thought it was a bug with journaling/soft-updates, so I > > disabled those things with tunefs (to the best of my memory). But now it > > has happened again. > > I can't imagine it has to do with that. Massive file loss can appear when a > directory inode has been damaged. Then fsck will remove the directory > altogether. But it's possible to rescue the files _content_, as those are > written with their (orphan) inode number to lost+found/. So their names are > lost, but their content will be kept. I turned off journaling and soft-updates because the first time this problem occurred, it deleted the files in my home directory, as well as user-owned files in /home/tmp and /home/user/subdir's that were recently created, so I thought maybe they weren't being flushed out to the disk; eg, getting stuck in some journaling/soft-updates buffer. > > The drive was being written to for about 1 minute by the Opera > > crashlog/coredump. About 45 seconds after Xwindow was already down. > > Such kind of crash indicates a significant problem. Are you > sure the drives are fully intact? Check with "smartctl -a" just > to be sure. And even if it sounds stupid: check the cables. smartctl 6.2 2014-02-18 r3874 [FreeBSD 9.2-RELEASE i386] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Caviar WDxxxAB Device Model: WDC WD400AB-22CDB0 Serial Number: WD-WMA9T1222658 Firmware Version: 22.04A22 User Capacity: 40,020,664,320 bytes [40.0 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA/ATAPI-5 (minor revision not indicated) Local Time is: Wed Mar 18 17:40:59 2015 AKDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status:(0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (2376) seconds. Offline data collection capabilities: (0x3b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 42) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALU WORSTHRESH TYPE UPDATED WHEN_ FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 102 099 021 Pre-fail Always - 3975 4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 58 5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 1 7 Seek_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 084 084 000 Old_age Always - 12324 10 Spin_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 57 196 Reallocated_Event_Count 0x0032 199 199 000 Old_age Always - 1 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] Selective Self-tests/Logging not supported smartctl 6.2 2014-02-18 r3874 [FreeBSD 9.2-RELEASE i386] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Caviar AC Device Model: WDC AC24300L Serial Number: WD-WT4111658721 Firmware Version: 14.10R11 User Capacity: 4,311,982,080 bytes [4.31 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA/ATAPI-4 (minor revision not indicated) Local Time is: Wed Mar 18 17:41:00 2015 AKDT SMART support is: Ambiguous - ATA IDENTIFY DEVICE words 85-87 don't show if SMART is enabled. Checking to be sure by trying SMART RETURN STATUS command. SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status:(0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Total time to complete Offline data collection: (1280) seconds. Offline data collection capabilities: (0x03) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. No Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0002) Does not save SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x00) Error logging NOT supported. No General Purpose Logging support. SMART Attributes Data Structure revision number: 5 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALU WORSTHRESH TYPE UPDATED WHEN_ FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 4 Start_Stop_Count 0x0012 099 099 040 Old_age Always - 1545 5 Reallocated_Sector_Ct 0x0013 200 200 001 Pre-fail Always - 0 10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 51375 200 Multi_Zone_Error_Rate 0x0009 100 253 051 Pre-fail Offline - 0 SMART Error Log not supported SMART Self-test Log not supported Selective Self-tests/Logging not supported > > > > FSCK RESULTS: > > > > ------------ > > > > Of interest, is that each time fsck was run, more files were lost! > > > > > > > > # fsck -t ufs -p /dev/ada0p6.eli > > > > /dev/ada0p6.eli: NO WRITE ACCESS > > > > /dev/ada0p6.eli: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. > > > > > > This message should alert you. Don't just preen the disk. > > > In this mode, only a subset of errors will be detected, > > > and not all of them can be corrected. You should actually > > > perform > > > > > > # fsck -t ufs -f /dev/ada0p6.eli > > > > Thanks, I didn't think of using the -f option. > > The -f options *f*orces a *f*ull check. You can even run the command two > times. The 2nd run should then reveal "no errors", the file system is kept > marked clean. > > > After reading a paper by Marshall McKusick on fsck, it was my > > understanding that "preen mode" only fixed errors that could be fixed with > > 100% accuracy. > > I also read that famous paper to gain a better understanding of how UFS > works and what fsck does. Data loss teaches you a lot of fundamental > knowledge. :-) > > > > There are several errors shown: > > > > > > > INCORRECT BLOCK COUNT I=2327435 (8 should be 0) > > > > [...] > > > > UNREF FILE I=2327428 OWNER=abc MODE=100600 > > > > [...] > > > > UNREF FILE I=2327439 OWNER=abc MODE=100600 > > > > [...] > > > > FREE BLK COUNT(S) WRONG IN SUPERBLK > > > > [...] > > > > SUMMARY INFORMATION BAD > > > > [...] > > > > BLK(S) MISSING IN BIT MAPS > > > > I lost about 8 files, a lot of legal research/work, in case that is what > > the (8 should be 0) is citing. > > The question is: Is the data still there? Just because the file is gone - > the inode entry -, this does not have to imply that the data isn't still on > the disk. Everything is on the disk as long as it hasn't been overwritten. > > When I found out that one of my files (which I worked a whole day on) was > gone (0 bytes) after a freeze + reboot + fsck, I immediately forced a r/o > mount on the /home partition and grepped for some text fragment I could > remember. I found the block where it was in, dumped that block, and trimmed > it to become the original file again. The data wasn't lost, it was fully > intact. But not referenced (!) anymore. > > > > Unmount the partition, let fsck do its job. :-) > > > > fsck -t ufs -f /dev/ada0p6.eli only reported that everything was clean. > > So at _this_ point in time the file system was consistent. Do you maybe have > background_fsck="YES" in /etc/rc.conf? Set it to ="NO". Always perform file > system checks _prior_ to accessing a file system r/o or even r/w. This may > take some time, but you have to find a relation of time vs. data that > reflects your priorities. :-) No, I do not have background fsck's - and I never rebooted. I always run fsck before mounting a file-system. I have my own /etc/rc that is self-contained with a few lines to bring the system up, and nothing more, very much a minimialist, essentially: /sbin/geli onetime -d -e 3des -s 4096 /dev/ada0p3 /sbin/swapon /dev/ada0p3.eli /sbin/fsck -y /sbin/mount -a /bin/rm -rf /var/run/* /var/spool/lock/* umask 0077 # rw- --- --- /bin/hostname localhost... /sbin/ldconfig /usr/lib /usr/local/lib /usr/X11/lib /sbin/ifconfig lo0 127.0.0.1 > > > Copy files to a different disk (or maybe even external storage, such as > > > USB sticks) temporarily, just to be sure. > > > > Yes, I do this of course, with a USB SDRAM device. But I still lose days > > of work, because I can't back up every minute. > > You could automate this - but on the other hand, when a crash appears, this > might also affect the backup process and its results. > > > This should not happen at all. > > Yes, it sounds too unusual. > > > I have used FreeBSD for 20 years, since 1995, and I never had problems > > like this before - and I have the same hardware since 2003, which I ran > > FreeBSD 4.11 on until recently. But only now does this problem occur. > > Certainly, there is a bug somewhere. My gut feeling is that something is > > allowing Opera to do things it should not do, or something in the > > filesystem layers is breaking under the stress of Opera's crash dumps. > > I'd think it's somewhere filesystem-related. I have tortured Opera with > approx. 100 tabs open with "Flash" content and JS stuff in it. No crash, it > just started swapping heavily. Sometimes I can get Opera to crash, but it > successfully "resumes". However, when my system freezes (due to a faulty > GPU) and Opera has been running. sometimes the bookmarks are lost. That's > why I tend to copy them to ~/ from time to time, just to be sure. In few > cases, the Opera settings also are reset. A copy of ~/.opera is helpful. > Maybe it's just program design that got worse, like first reading a file > into memory, then keeping that file open, maybe modify it, or not, and upon > program exit, write memory content back to the file. When the normal program > termination is not reached, a damaged or empty file is left behind. I have > no idea what makes people write software that way, but it seems to be > "modern" now... We're in the 2nd half of the "peak usury-based-civilization" bell curve :) I was 30 when I started using FreeBSD, now 50. Of all things in life, FreeBSD was likely the greatest pleasure and best experience. I could easily enjoy doing much more with it for 100s of years, and it's been one of the few social circles where I've met people that I admire and respect for the development of their faculties and all-around good+honest+intelligent nature. > Polytropon > Magdeburg, Germany > Happy FreeBSD user since 4.0 > Andra moi ennepe, Mousa, ...