Date: Sun, 16 Jul 2017 17:52:39 +0200 From: Andreas Longwitz <longwitz@incore.de> To: Kirk McKusick <mckusick@mckusick.com> Cc: freebsd-fs@freebsd.org Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition Message-ID: <596B8BC7.3030505@incore.de> In-Reply-To: <594FB9E5.3060806@incore.de> References: <201706242203.v5OM3U2Q096283@chez.mckusick.com> <594FB9E5.3060806@incore.de>
next in thread | previous in thread | raw e-mail | index | archive | help
Hello, I like to discuss the following code snippet from function indiracct_ufs2() in source file ffs_snapshot.c: /* * We have to expand bread here since it will deadlock looking * up the block number for any blocks that are not in the cache. */ bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0); bp->b_blkno = fsbtodb(fs, blkno); if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 && (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) { brelse(bp); return (error); } In all my tests (all mount types, good or bad result) the flags B_DONE and B_DELWRI in bp->b_flags were alway zero, so the data buffer given by getblk() at bp->b_data is always overwritten with the data from the explicit I/O performed by readblock(). By the way the flag B_CACHE was always set. I have saved the buffer given by getblk() and compared this buffer with the data read by readblock(). Without gjournal there is never a difference. Using a gjournaled partition things are more sophisticated. Taking a snapshot in my example environment needs 678 calls of getblk(). In the "good case" all I/O's performed by readblock() gave the same data as getblk(), in the "bad case" there are some differences between the data in buffer cache seen by getblk() and the data on disk seen by readblock(). Mostly two or three of the 678 blocks are different. In this case the block read by readblock() is always zero, the block read by getblk() has a lot of BLK_NOCOPY values, sometimes 4096 (a full block). There is one other flag in bp->b_flags given by getblk that is of interest: B_CLUSTEROK. This flag is always in sync with the flag BV_SCANNED in bp->b_vflags. In the above code snippet I can check this flag too before going to disk: if (bp->b_flags & (B_DONE | B_DELWRI | B_CLUSTEROK)) == 0 && ... With this patch the problem has gone in all my tests. The resulting snapblklist written to the end of the snapfile is always the same. I am not familiar with buffer and memory management of FreeBSD, so I can not explain what is going on. But I see in cgaccount() a lot of BLK_NOCOPY values are set and in expunge_ufs2() these values should be seen again. It seems they are in the buffer cache but sometimes not on disk. The patch given above in my test example reads data from buffer cache instead of from disk 141 times. All sync operations in ffs_snapshot() use the parameter MNT_WAIT, the gjournal switcher runs all the time but syncs the same partition with MNT_NOWAIT (g_journal.c: vfs_msync followed by VFS_SYNC). I do not know if this can lead to a race condition between gjournal and snapshot. >> Kirk McKusick wrote: >>> Date: Sat, 24 Jun 2017 18:30:17 +0300 >>> From: Konstantin Belousov <kostikbel@gmail.com> >>> To: Andreas Longwitz <longwitz@incore.de> >>> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition >>> >>> On Fri, Jun 23, 2017 at 11:14:43AM +0200, Andreas Longwitz wrote: >>>> I try to understand the cause for the "free inode" problem described in >>>> https://lists.freebsd.org/pipermail/freebsd-fs/2013-November/018610.html >>>> >>>> I have setup a test server (FreeBSD 10.3-STABLE #4 r317936) with a >>>> gjournaled partition for /home: >>>> mount -t ufs | grep /home --> >>>> /dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local, >>>> noatime, gjournal) >>> As the first thing to try, if you perform your tests on the raw >>> partition without gjournal, does the problem stay around ? >> I concur that running your test without gjournal is the next test to try. >> I think that your suspicion that there is a race condition with gjournal >> is likely correct. And if that is true, then the problem will go away >> when gjournal is taken out of the stack. > > The problem only accurs when gjournal is active (tunefs -J) independent > of the three possible mount options I have tested: async, noasync, sync. > For my test I prefer async mount option as stated in gjournal(8). Using > soft updates instead of gjournal is ok in all cases. > >>>> My test creates one snapshot of /home (gets alway inode 4) and removes >>>> this snapshot: >>>> >>>> for i in 1 2 3 4 5 6 7 8; do >>>> echo "starting snaptest $i" >/dev/console >>>> mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /home >>>> echo $(ls -ils /home/.snap/fscktest) >/dev/console >>>> rm -f /home/.snap/fscktest >>>> done >>>> >>>> I never have more than this one snapshot at work and during the test I >>>> never have any other >>>> user processes working on /home. A typical output looks like this: >>>> >>>> Jun 21 15:59:52 root: starting snaptest 1 >>>> Jun 21 15:59:52 root: 4 26592 -r-------- 1 root operator 90762970240 >>>> 21 Jun 15:59 /home/.snap/fscktest >>>> Jun 21 15:59:53 root: starting snaptest 2 >>>> Jun 21 15:59:53 root: 4 26592 -r-------- 1 root operator 90762970152 >>>> 21 Jun 15:59 /home/.snap/fscktest >>>> Jun 21 15:59:54 kernel: freeing inode /home/4 with 704 blocks >>>> Jun 21 15:59:54 root: starting snaptest 3 >>>> Jun 21 15:59:54 kernel: free inode /home/4 had 704 blocks >>>> Jun 21 15:59:54 root: 4 26592 -r-------- 1 root operator 90762969976 >>>> 21 Jun 15:59 /home/.snap/fscktest >>>> Jun 21 15:59:56 kernel: freeing inode /home/4 with 2112 blocks >>>> Jun 21 15:59:56 root: starting snaptest 4 >>>> Jun 21 15:59:56 kernel: free inode /home/4 had 2112 blocks >>>> Jun 21 15:59:56 root: 4 26592 -r-------- 1 root operator 90762970240 >>>> 21 Jun 15:59 /home/.snap/fscktest >>>> Jun 21 15:59:57 root: starting snaptest 5 >>>> Jun 21 15:59:57 root: 4 26592 -r-------- 1 root operator 90762970240 >>>> 21 Jun 15:59 /home/.snap/fscktest >>>> Jun 21 15:59:58 root: starting snaptest 6 >>>> Jun 21 15:59:58 root: 4 26592 -r-------- 1 root operator 90762970216 >>>> 21 Jun 15:59 /home/.snap/fscktest >>>> Jun 21 15:59:59 kernel: freeing inode /home/4 with 192 blocks >>>> Jun 21 15:59:59 root: starting snaptest 7 >>>> Jun 21 15:59:59 kernel: free inode /home/4 had 192 blocks >>>> Jun 21 15:59:59 root: 4 26592 -r-------- 1 root operator 90762970240 >>>> 21 Jun 16:00 /home/.snap/fscktest >>>> Jun 21 16:00:00 root: starting snaptest 8 >>>> Jun 21 16:00:00 root: 4 26592 -r-------- 1 root operator 90762970240 >>>> 21 Jun 16:00 /home/.snap/fscktest >>>> >>>> The "free inode /home/4 had NNN blocks" message during run of the mount >>>> command is output of ffs_valloc(), because ffs_load_inode() has load the >>>> disk inode 4 with a non zero i_blocks field. The corresponding "freeing >>>> inode /home/4 with NNN blocks" message during the previous rm command >>>> is output of my following diagnostic patch in function ffs_truncate(): >>>> >>>> --- ffs_inode.c.1st 2016-06-08 17:25:21.000000000 +0200 >>>> +++ ffs_inode.c 2017-06-19 10:02:07.145360000 +0200 >>>> @@ -551,6 +551,9 @@ >>>> DIP_SET(ip, i_blocks, DIP(ip, i_blocks) - blocksreleased); >>>> else /* sanity */ >>>> DIP_SET(ip, i_blocks, 0); >>>> + if (bootverbose == 2 && DIP(ip, i_blocks) > 0) >>>> + printf("freeing inode %s/%lu with %ld blocks\n", >>>> + fs->fs_fsmnt, (u_long)ip->i_number, >>>> (long)DIP(ip, i_blocks)); >>>> ip->i_flag |= IN_CHANGE; >>>> #ifdef QUOTA >>>> (void) chkdq(ip, -blocksreleased, NOCRED, 0); >>>> >>>> The rm command can only free all the blocks of the snapshotfile (means >>>> i_blocks for inode 4 ends with zero) , if this file has the "correct" size: >>>> >>>> ls -ils /home/.snap/fscktest --> >>>> 4 53184 -r-------- 1 root operator 90762970240 Jun 17 06:15 >>>> /home/.snap/fscktest >>>> >>>> The size of the /home partition is given by >>>> diskinfo /dev/mirror/gmsvt7p10.journal --> >>>> /dev/mirror/gmsvt7p10.journal 512 90762954240 177271395 >>>> >>>> So we have 2769865 full 32kB blocks with size 90631864320. During >>>> creating a snapshot a "last block" (32kB) is written at this offset >>>> ending at 90762969088. Finally the snapblklist is written with >>>> VOP_WRITE: "Write out the list of allocated blocks to the end of the >>>> snapshot". In all my correct tests the table snapblklist is 1152 bytes >>>> in size giving the correct size of the snapshot file : 90762970240. In >>>> this case the table snapblklist has 144 entries of length 8: one lenght >>>> entry and 143 logical block numbers recorded in mapacct_ufs2(): >>>> >>>> if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP) >>>> *ip->i_snapblklist++ = lblkno; >>>> >>>> The console output above shows three error situations with block >>>> counters 704, 2112 and 192. Dividing these values by 8 gives exactly the >>>> reduced size of the snapblocklist at the end of the snapshotfile, so in >>>> these cases the snapshotfile is corrupt. >>> I am not sure what do you mean by 'match' there. Could you explicitely >>> mention what relations between snapshot size and leaked blocks of the >>> free snapshot inode did you noted ? >> I too am confused here. Are you saying for example that 192 / 8 == 24 >> and that the snapblocklist is short by 24 blocks? Because from the table >> below, it appears to be short by only 3 blocks. > > In this example the snapblocklist is short by 24 bytes (= 3 table > entries) and 24/8 = 3 calls to ffs_blkfree are missing. > > Ok, I give one more table to explain a little better: > > The meaning of the columns: > #1 number of "free inode blocks" in kernel message > #2 blocks from #1 divides by 8 (= fs_frag) > #3 size of snapblklist write at the end of snapshotfile > #4 ls -s of snapshotfile: 907629... > #5 enties in table snapblklist, we have #5 = #3 / 8, because each > entry in the table is of type ufs2_daddr_t > #6 counter ct_blkfree, thats the number of calls to ffs_blkfree(). > #7 missing table entries in snapblklist and calls of ffs_blkfree(). > > test #1 #2 #3 #4 #5 #6 #7 > ----------------------------------------------- > 1 ok 0 0 1152 ..70240 144 821 0 > 2 bad 704 88 1064 ..70152 133 810 11 > 3 bad 2112 264 888 ..69976 111 788 33 > 6 bad 192 24 1128 ..70216 141 818 3 > > The difference of #5 and #6 is the variable acctit which is set only for > blkno != -1. In all my test (ok and bad) the function mapacct_ufs2() is > called 680 times, the first calls look like this (diffblk = lastblk - > oldblk): > > mapacct_ufs2:entry /home, inum=4, diffblkp=0xc, lblkno=0 > mapacct_ufs2:entry /home, inum=4, diffblkp=0x3, lblkno=-1 > mapacct_ufs2:entry /home, inum=4, diffblkp=0x1000, lblkno=12 > mapacct_ufs2:entry /home, inum=4, diffblkp=0x2a4, lblkno=-1 > mapacct_ufs2:entry /home, inum=4, diffblkp=0x1000, lblkno=4108 > mapacct_ufs2:entry /home, inum=4, diffblkp=0x1000, lblkno=8204 > mapacct_ufs2:entry /home, inum=4, diffblkp=0x1000, lblkno=12300 > .... > > There are exact two calls with lblkno=-1 with together 0x2a7 -2 = 677 > runs of the for loop where #5 is skipped. > >>>> I use a test kernel with some extra counters ct_* in mapacct_ufs2(): >>>> >>>> ++ct_blkno_all; >>>> if (blkno == 0) >>>> ++ct_blkno_0; >>>> if (blkno == BLK_NOCOPY) >>>> ++ct_blkno_nocopy; >>>> if (blkno == BLK_SNAP) >>>> ++ct_blkno_snap; >>>> if (blkno == 0 || blkno == BLK_NOCOPY) >>>> continue; >>>> if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP) { >>>> *ip->i_snapblklist++ = lblkno; >>>> ++ct_snapblklist; >>>> } >>>> if (blkno == BLK_SNAP) >>>> blkno = blkstofrags(fs, lblkno); >>>> ++ct_blkfree; >>>> ffs_blkfree(ip->i_ump, fs, vp, blkno, fs->fs_bsize, inum, >>>> vp->v_type, NULL); >>>> >>>> and for the 8 test runs shown above I can see these results using DTrace >>>> at probe expunge_ufs2:return (blkno_snap is always 0): >>>> >>>> test blkno_all blkno_0 blkno_nocopy snapblklist blkfree cg_nocopy >>>> ------------------------------------------------------------------- >>>> 1 ok 2770545 353320 2416404 143 821 2416404 >>>> 2 bad 2770545 587860 2181875 132 810 2416404 >>>> 3 bad 2770545 956582 1813175 110 788 2416393 >>>> 4 ok 2770545 353364 2416360 143 821 2416360 >>>> 5 ok 2770545 353364 2416360 143 821 2416360 >>>> 6 bad 2770545 418376 2351351 140 818 2416360 >>>> 7 ok 2770545 353367 2416357 143 821 2416357 >>>> 8 ok 2770545 353367 2416357 143 821 2416357 >>>> >>>> For correct tests the sum of blkno_0 and blkno_nocopy is always the same >>>> (2769724), for bad tests especially the counter for blkno_nocopy is >>>> significant lower. In the test table I give one more column cg_nocopy >>>> for a counter I have added in cgaccount() to see how many entries are >>>> set to BLK_NOCOPY during copy of cylinder group maps: >>>> >>>> if (ffs_isblock(fs, cg_blksfree(cgp), loc)) { >>>> ++ct_cg_nocopy; >>>> DIP_SET(ip, i_db[loc], BLK_NOCOPY); >>>> } >>>> ... >>>> if (ffs_isblock(fs, cg_blksfree(cgp), loc)) { >>>> ++ct_cg_nocopy; >>>> ((ufs2_daddr_t *)(ibp->b_data))[indiroff] = >>>> BLK_NOCOPY; >>>> } >>>> >>>> For correct tests all the BLK_NOCOPY's which are set in cgaccount() can >>>> later be seen in mapacct_ufs2(), for bad tests many of the BLK_NOCOPY's >>>> have changed to 0. >>>> >>>> I looks like the rm command removing the previous snapshot in some way >>>> runs "in the background" simultan to expunge_ufs2() and changes some of >>>> the BLK_NOCOPY's to zero. So this may be a buffer management problem >>>> which only exists on gjourneled partitions, maybe getblk/readblock used >>>> in indiracct_ufs2() is not compatibel with gjournel in the special case >>>> of creating or removing a spapshot. A hint in this direction is the >>>> fact, that the first test after cleaning the partition with >>>> umount /home; fsck -y /home; mount /home >>>> always succeeds. The following modified test procedure never fails: >>>> >>>> for i in 1 2 3 4 5 6 7 8; do >>>> echo "starting snaptest $i" >/dev/console >>>> mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /home >>>> echo $(ls -ils /home/.snap/fscktest) >/dev/console >>>> rm -f /home/.snap/fscktest /home >>>> umount /home >>>> mount /home >>>> done >>> After the allocations of required blocks for the snapshot inode >>> are finished, the filesystem is suspended. You can see the >>> call to vfs_write_suspend() in the ffs_snapshot() where the >>> suspension is enforced. As part of the suspension, all soft-update >>> workitems are flushed, this is done by vfs_write_suspend() calling >>> VFS_SYNC(MNT_SUSPEND). >>> >>> UFS classifies writers into primary and secondary. Primary are mostly >>> the writes initiated by the top-level syscall entries, like direct >>> calls to write(2) or metadata-changing ops mkdir(), create() and so on. >>> Secondary are writes performed when system initiates metadata updates >>> during inactivation, quota updates, softdep background processing and >>> similar. Primary modifications are blocked outright on suspension, while >>> secondary are waited to finish in the mentioned VFS_SYNC(MNT_SUSPEND) >>> call. >>> >>> If you can provide a proof that some SU-related activity happens after the >>> suspension is established, this would be interesting to see. >>> Might be it is something different and much simpler, but I do not see >>> an obvious mistake in the code, after reading your observations. >> The mount information at the top shows: >> >> /dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local, noatime, gjournal) >> >> Thus no soft updates are being used. We are running on just a basic UFS >> filesystem. So, soft update processing has no relevance here. >> >>>> Another proof that the snapshot file is corrupt when the snapblklist is >>>> shortend is the fact that the rm command sporadically panics in a kernel >>>> routine that is known to be correct: >>>> >>>> nread portion of the kernel message buffer: >>>> dev = mirror/gmsvt7p10.journal, block = 19727560, fs = /home >>>> panic: ffs_blkfree_cg: freeing free block >>>> cpuid = 1 >>>> KDB: stack backtrace: >>>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame >>>> 0xfffffe0857e3b1c0 >>>> kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0857e3b270 >>>> vpanic() at vpanic+0x126/frame 0xfffffe0857e3b2b0 >>>> panic() at panic+0x43/frame 0xfffffe0857e3b310 >>>> ffs_blkfree_cg() at ffs_blkfree_cg+0x5c6/frame 0xfffffe0857e3b3d0 >>>> ffs_blkfree() at ffs_blkfree+0x99/frame 0xfffffe0857e3b430 >>>> ffs_indirtrunc() at ffs_indirtrunc+0x474/frame 0xfffffe0857e3b510 >>>> ffs_indirtrunc() at ffs_indirtrunc+0x423/frame 0xfffffe0857e3b5f0 >>>> ffs_truncate() at ffs_truncate+0x10b4/frame 0xfffffe0857e3b7d0 >>>> ufs_inactive() at ufs_inactive+0x16b/frame 0xfffffe0857e3b810 >>>> VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0xf7/frame 0xfffffe0857e3b840 >>>> vinactive() at vinactive+0xc6/frame 0xfffffe0857e3b890 >>>> vputx() at vputx+0x27a/frame 0xfffffe0857e3b8f0 >>>> kern_unlinkat() at kern_unlinkat+0x243/frame 0xfffffe0857e3bae0 >>>> amd64_syscall() at amd64_syscall+0x2c6/frame 0xfffffe0857e3bbf0 >>>> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0857e3bbf0 >>>> --- syscall (10, FreeBSD ELF64, sys_unlink), rip = 0x80095425a, rsp = >>>> 0x7fffffffe988, rbp = 0x7fffffffea20 --- >>>> >>>> Any hints solving this problem are welcome. >> Per the suggestion at the top, I recommend trying your test without >> gjournal present to see if the bug goes away. If that is true, then >> we can focus our attention on a possible race during rm in the gjournal >> code. >> >> Kirk McKusick > I have done as explained above. Some more hints may help for debugging > this. > > 1. Output of dumpfs of the partition is > > magic 19540119 (UFS2) time Sun Jun 25 03:46:09 2017 > superblock location 65536 id [ 561bcaa3 624b156b ] > ncg 139 size 22158924 blocks 21459451 > bsize 32768 shift 15 mask 0xffff8000 > fsize 4096 shift 12 mask 0xfffff000 > frag 8 shift 3 fsbtodb 3 > minfree 8% optim time symlinklen 120 > maxbsize 32768 maxbpg 4096 maxcontig 4 contigsumsize 4 > nbfree 2413023 ndir 229 nifree 11153267 nffree 274 > bpg 20035 fpg 160280 ipg 80256 unrefs 0 > nindir 4096 inopb 128 maxfilesize 2252349704110079 > sbsize 4096 cgsize 32768 csaddr 5056 cssize 4096 > sblkno 24 cblkno 32 iblkno 40 dblkno 5056 > cgrotor 123 fmod 0 ronly 0 clean 0 > metaspace 6408 avgfpdir 64 avgfilesize 16384 > flags gjournal > fsmnt /home > volname swuid 0 providersize 22158924 > > 2. I am quite sure that the process g_journal_switcher does not break > things during run of "mount -o snapshot". I have severel DTrace output > of the problem, where g_journal_switcher sleeps all the time. > > 3. The probe expunge_ufs2:return is the first probe from several DTrace > probing that shows a difference between good and bad. > > 4. The problem occurs for many years on several production server and > partitions where only one snapshot is taken every day, but not all and > not very often. On newer server the "free inode" messages increases. On > my test server I have a "freezed" partition /home that shows the problem > with less than ten runs all the time. If I run DTrace with a lot of > probes then the problem sometimes becomes a "Heisenbug". > Dr. Andreas Longwitz
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?596B8BC7.3030505>