Date: Tue, 5 Mar 2013 23:15:13 -0800 (PST) From: Don Lewis <truckman@FreeBSD.org> To: lev@FreeBSD.org Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org Subject: Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift topic to ZFS! Message-ID: <201303060715.r267FDHS015118@gw.catspoiler.org> In-Reply-To: <612776324.20130301152756@serebryakov.spb.ru>
next in thread | previous in thread | raw e-mail | index | archive | help
On 1 Mar, Lev Serebryakov wrote: > Hello, Ivan. > You wrote 28 февраля 2013 г., 21:01:46: > >>> One time, Kirk say, that delayed writes are Ok for SU until bottom >>> layer doesn't lie about operation completeness. geom_raid5 could >>> delay writes (in hope that next writes will combine nicely and allow >>> not to do read-calculate-write cycle for read alone), but it never >>> mark BIO complete until it is really completed (layers down to >>> geom_raid5 returns completion). So, every BIO in wait queue is "in >>> flight" from GEOM/VFS point of view. Maybe, it is fatal for journal :( > IV> It shouldn't be - it could be a bug. > I understand, that it proves nothing, but I've tried to repeat > "previous crash corrupt FS in journal-undetectable way" theory by > killing virtual system when there is massive writing to > geom_radi5-based FS (on virtual drives, unfortunately). I've done 15 > tries (as it is manual testing, it takes about 1-1.5 hours total), > but every time FS was Ok after double-fsck (first with journal and > last without one). Of course, there was MASSIVE loss of data, as > timeout and size of cache in geom_raid5 was set very high (sometimes > FS becomes empty after unpacking 50% of SVN mirror seed, crash and > check) but FS was consistent every time! Did you have any power failures that took down the system sometime before this panic occured? By default FreeBSD enables write caching on ATA drives. kern.cam.ada.write_cache: 1 kern.cam.ada.0.write_cache: -1 (-1 => use system default value) That means that the drive will immediately acknowledge writes and is free to reorder them as it pleases. When UFS+SU allocates a new inode, it first clears the available bit in the bitmap and writes the bitmap block to disk before it writes the new inode contents to disk. When a file is deleted, the inode is zeroed on disk before the available bit is set in the bitmap and the bitmap block is written. That means that if an inode is marked as available in the bitmap, then it should be zero. The system panic that you experienced happened when the system was attempting to allocate an inode for a new file and when it peeked at an inode that was marked as available, it found that the inode was non-zero. What might have happened is that sometime in the past, the system was in the process of creating a new file when a power failure ocurred. It found an available inode, marked it as unavailable in the bitmap, and write the bitmap block to the drive. Because write caching was enabled, the bitmap block was cached in the drive's write cache, and the drive said that the write was complete. After getting this response, UFS+SU wrote the new inode contents to the drive, which was also cached. The drive then wrote the inode contents to the drive. At this point the power failed, losing all of the contents of the drive write cache before the bitmap block was updated. When the system was powered up again, fsck just replayed the journal because you were using SU+J, and didn't detect the inconsistency between the bitmap and the actual inode contents (which would require a full fsck). This damage could remain latent for quite some time, and wouldn't be found until the filesystem tried to allocate the inode in question.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201303060715.r267FDHS015118>