Date: Sat, 24 Jun 2017 15:03:30 -0700 From: Kirk McKusick <mckusick@mckusick.com> To: Andreas Longwitz <longwitz@incore.de> Cc: Konstantin Belousov <kostikbel@gmail.com>, freebsd-fs@freebsd.org Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition Message-ID: <201706242203.v5OM3U2Q096283@chez.mckusick.com> In-Reply-To: <20170624153017.GP3437@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
> Date: Sat, 24 Jun 2017 18:30:17 +0300 > From: Konstantin Belousov <kostikbel@gmail.com> > To: Andreas Longwitz <longwitz@incore.de> > Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition > = > On Fri, Jun 23, 2017 at 11:14:43AM +0200, Andreas Longwitz wrote: >> I try to understand the cause for the "free inode" problem described in >> https://lists.freebsd.org/pipermail/freebsd-fs/2013-November/018610.h= tml >> = >> I have setup a test server (FreeBSD 10.3-STABLE #4 r317936) with a >> gjournaled partition for /home: >> mount -t ufs | grep /home --> >> /dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local, >> noatime, gjournal) > > As the first thing to try, if you perform your tests on the raw > partition without gjournal, does the problem stay around ? I concur that running your test without gjournal is the next test to try. I think that your suspicion that there is a race condition with gjournal is likely correct. And if that is true, then the problem will go away when gjournal is taken out of the stack. >> My test creates one snapshot of /home (gets alway inode 4) and removes >> this snapshot: >> = >> for i in 1 2 3 4 5 6 7 8; do >> echo "starting snaptest $i" >/dev/console >> mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /hom= e >> echo $(ls -ils /home/.snap/fscktest) >/dev/console >> rm -f /home/.snap/fscktest >> done >> = >> I never have more than this one snapshot at work and during the test I >> never have any other >> user processes working on /home. A typical output looks like this: >> = >> Jun 21 15:59:52 root: starting snaptest 1 >> Jun 21 15:59:52 root: 4 26592 -r-------- 1 root operator 90762970240 >> 21 Jun 15:59 /home/.snap/fscktest >> Jun 21 15:59:53 root: starting snaptest 2 >> Jun 21 15:59:53 root: 4 26592 -r-------- 1 root operator 90762970152 >> 21 Jun 15:59 /home/.snap/fscktest >> Jun 21 15:59:54 kernel: freeing inode /home/4 with 704 blocks >> Jun 21 15:59:54 root: starting snaptest 3 >> Jun 21 15:59:54 kernel: free inode /home/4 had 704 blocks >> Jun 21 15:59:54 root: 4 26592 -r-------- 1 root operator 90762969976 >> 21 Jun 15:59 /home/.snap/fscktest >> Jun 21 15:59:56 kernel: freeing inode /home/4 with 2112 blocks >> Jun 21 15:59:56 root: starting snaptest 4 >> Jun 21 15:59:56 kernel: free inode /home/4 had 2112 blocks >> Jun 21 15:59:56 root: 4 26592 -r-------- 1 root operator 90762970240 >> 21 Jun 15:59 /home/.snap/fscktest >> Jun 21 15:59:57 root: starting snaptest 5 >> Jun 21 15:59:57 root: 4 26592 -r-------- 1 root operator 90762970240 >> 21 Jun 15:59 /home/.snap/fscktest >> Jun 21 15:59:58 root: starting snaptest 6 >> Jun 21 15:59:58 root: 4 26592 -r-------- 1 root operator 90762970216 >> 21 Jun 15:59 /home/.snap/fscktest >> Jun 21 15:59:59 kernel: freeing inode /home/4 with 192 blocks >> Jun 21 15:59:59 root: starting snaptest 7 >> Jun 21 15:59:59 kernel: free inode /home/4 had 192 blocks >> Jun 21 15:59:59 root: 4 26592 -r-------- 1 root operator 90762970240 >> 21 Jun 16:00 /home/.snap/fscktest >> Jun 21 16:00:00 root: starting snaptest 8 >> Jun 21 16:00:00 root: 4 26592 -r-------- 1 root operator 90762970240 >> 21 Jun 16:00 /home/.snap/fscktest >> = >> The "free inode /home/4 had NNN blocks" message during run of the mount >> command is output of ffs_valloc(), because ffs_load_inode() has load th= e >> disk inode 4 with a non zero i_blocks field. The corresponding "freeing >> inode /home/4 with NNN blocks" message during the previous rm command >> is output of my following diagnostic patch in function ffs_truncate(): >> = >> --- ffs_inode.c.1st 2016-06-08 17:25:21.000000000 +0200 >> +++ ffs_inode.c 2017-06-19 10:02:07.145360000 +0200 >> @@ -551,6 +551,9 @@ >> DIP_SET(ip, i_blocks, DIP(ip, i_blocks) - blocksrelease= d); >> else /* sanity */ >> DIP_SET(ip, i_blocks, 0); >> + if (bootverbose =3D=3D 2 && DIP(ip, i_blocks) > 0) >> + printf("freeing inode %s/%lu with %ld blocks\n", >> + fs->fs_fsmnt, (u_long)ip->i_number, >> (long)DIP(ip, i_blocks)); >> ip->i_flag |=3D IN_CHANGE; >> #ifdef QUOTA >> (void) chkdq(ip, -blocksreleased, NOCRED, 0); >> = >> The rm command can only free all the blocks of the snapshotfile (means >> i_blocks for inode 4 ends with zero) , if this file has the "correct" s= ize: >> = >> ls -ils /home/.snap/fscktest --> >> 4 53184 -r-------- 1 root operator 90762970240 Jun 17 06:15 >> /home/.snap/fscktest >> = >> The size of the /home partition is given by >> diskinfo /dev/mirror/gmsvt7p10.journal --> >> /dev/mirror/gmsvt7p10.journal 512 90762954240 177271395 >> = >> So we have 2769865 full 32kB blocks with size 90631864320. During >> creating a snapshot a "last block" (32kB) is written at this offset >> ending at 90762969088. Finally the snapblklist is written with >> VOP_WRITE: "Write out the list of allocated blocks to the end of the >> snapshot". In all my correct tests the table snapblklist is 1152 bytes >> in size giving the correct size of the snapshot file : 90762970240. In >> this case the table snapblklist has 144 entries of length 8: one lenght >> entry and 143 logical block numbers recorded in mapacct_ufs2(): >> = >> if (acctit && expungetype =3D=3D BLK_SNAP && blkno !=3D BLK_S= NAP) >> *ip->i_snapblklist++ =3D lblkno; >> = >> The console output above shows three error situations with block >> counters 704, 2112 and 192. Dividing these values by 8 gives exactly th= e >> reduced size of the snapblocklist at the end of the snapshotfile, so in >> these cases the snapshotfile is corrupt. > > I am not sure what do you mean by 'match' there. Could you explicitely > mention what relations between snapshot size and leaked blocks of the > free snapshot inode did you noted ? I too am confused here. Are you saying for example that 192 / 8 =3D=3D 24 and that the snapblocklist is short by 24 blocks? Because from the table below, it appears to be short by only 3 blocks. >> I use a test kernel with some extra counters ct_* in mapacct_ufs2(): >> = >> ++ct_blkno_all; >> if (blkno =3D=3D 0) >> ++ct_blkno_0; >> if (blkno =3D=3D BLK_NOCOPY) >> ++ct_blkno_nocopy; >> if (blkno =3D=3D BLK_SNAP) >> ++ct_blkno_snap; >> if (blkno =3D=3D 0 || blkno =3D=3D BLK_NOCOPY) >> continue; >> if (acctit && expungetype =3D=3D BLK_SNAP && blkno !=3D BLK_= SNAP) { >> *ip->i_snapblklist++ =3D lblkno; >> ++ct_snapblklist; >> } >> if (blkno =3D=3D BLK_SNAP) >> blkno =3D blkstofrags(fs, lblkno); >> ++ct_blkfree; >> ffs_blkfree(ip->i_ump, fs, vp, blkno, fs->fs_bsize, inum, >> vp->v_type, NULL); >> = >> and for the 8 test runs shown above I can see these results using DTrac= e >> at probe expunge_ufs2:return (blkno_snap is always 0): >> = >> test blkno_all blkno_0 blkno_nocopy snapblklist blkfree cg_nocopy >> ------------------------------------------------------------------- >> 1 ok 2770545 353320 2416404 143 821 2416404 >> 2 bad 2770545 587860 2181875 132 810 2416404 >> 3 bad 2770545 956582 1813175 110 788 2416393 >> 4 ok 2770545 353364 2416360 143 821 2416360 >> 5 ok 2770545 353364 2416360 143 821 2416360 >> 6 bad 2770545 418376 2351351 140 818 2416360 >> 7 ok 2770545 353367 2416357 143 821 2416357 >> 8 ok 2770545 353367 2416357 143 821 2416357 >> = >> For correct tests the sum of blkno_0 and blkno_nocopy is always the sam= e >> (2769724), for bad tests especially the counter for blkno_nocopy is >> significant lower. In the test table I give one more column cg_nocopy >> for a counter I have added in cgaccount() to see how many entries are >> set to BLK_NOCOPY during copy of cylinder group maps: >> = >> if (ffs_isblock(fs, cg_blksfree(cgp), loc)) { >> ++ct_cg_nocopy; >> DIP_SET(ip, i_db[loc], BLK_NOCOPY); >> } >> ... >> if (ffs_isblock(fs, cg_blksfree(cgp), loc)) { >> ++ct_cg_nocopy; >> ((ufs2_daddr_t *)(ibp->b_data))[indiroff] =3D >> BLK_NOCOPY; >> } >> = >> For correct tests all the BLK_NOCOPY's which are set in cgaccount() can >> later be seen in mapacct_ufs2(), for bad tests many of the BLK_NOCOPY's >> have changed to 0. >> = >> I looks like the rm command removing the previous snapshot in some way >> runs "in the background" simultan to expunge_ufs2() and changes some of >> the BLK_NOCOPY's to zero. So this may be a buffer management problem >> which only exists on gjourneled partitions, maybe getblk/readblock used >> in indiracct_ufs2() is not compatibel with gjournel in the special case >> of creating or removing a spapshot. A hint in this direction is the >> fact, that the first test after cleaning the partition with >> umount /home; fsck -y /home; mount /home >> always succeeds. The following modified test procedure never fails: >> = >> for i in 1 2 3 4 5 6 7 8; do >> echo "starting snaptest $i" >/dev/console >> mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /hom= e >> echo $(ls -ils /home/.snap/fscktest) >/dev/console >> rm -f /home/.snap/fscktest /home >> umount /home >> mount /home >> done > > After the allocations of required blocks for the snapshot inode > are finished, the filesystem is suspended. You can see the > call to vfs_write_suspend() in the ffs_snapshot() where the > suspension is enforced. As part of the suspension, all soft-update > workitems are flushed, this is done by vfs_write_suspend() calling > VFS_SYNC(MNT_SUSPEND). > = > UFS classifies writers into primary and secondary. Primary are mostly > the writes initiated by the top-level syscall entries, like direct > calls to write(2) or metadata-changing ops mkdir(), create() and so on. > Secondary are writes performed when system initiates metadata updates > during inactivation, quota updates, softdep background processing and > similar. Primary modifications are blocked outright on suspension, while > secondary are waited to finish in the mentioned VFS_SYNC(MNT_SUSPEND) > call. > = > If you can provide a proof that some SU-related activity happens after t= he > suspension is established, this would be interesting to see. > Might be it is something different and much simpler, but I do not see > an obvious mistake in the code, after reading your observations. The mount information at the top shows: /dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local, noatime,= gjournal) Thus no soft updates are being used. We are running on just a basic UFS filesystem. So, soft update processing has no relevance here. >> Another proof that the snapshot file is corrupt when the snapblklist is >> shortend is the fact that the rm command sporadically panics in a kerne= l >> routine that is known to be correct: >> = >> nread portion of the kernel message buffer: >> dev =3D mirror/gmsvt7p10.journal, block =3D 19727560, fs =3D /home >> panic: ffs_blkfree_cg: freeing free block >> cpuid =3D 1 >> KDB: stack backtrace: >> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame >> 0xfffffe0857e3b1c0 >> kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0857e3b270 >> vpanic() at vpanic+0x126/frame 0xfffffe0857e3b2b0 >> panic() at panic+0x43/frame 0xfffffe0857e3b310 >> ffs_blkfree_cg() at ffs_blkfree_cg+0x5c6/frame 0xfffffe0857e3b3d0 >> ffs_blkfree() at ffs_blkfree+0x99/frame 0xfffffe0857e3b430 >> ffs_indirtrunc() at ffs_indirtrunc+0x474/frame 0xfffffe0857e3b510 >> ffs_indirtrunc() at ffs_indirtrunc+0x423/frame 0xfffffe0857e3b5f0 >> ffs_truncate() at ffs_truncate+0x10b4/frame 0xfffffe0857e3b7d0 >> ufs_inactive() at ufs_inactive+0x16b/frame 0xfffffe0857e3b810 >> VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0xf7/frame 0xfffffe0857e3b840 >> vinactive() at vinactive+0xc6/frame 0xfffffe0857e3b890 >> vputx() at vputx+0x27a/frame 0xfffffe0857e3b8f0 >> kern_unlinkat() at kern_unlinkat+0x243/frame 0xfffffe0857e3bae0 >> amd64_syscall() at amd64_syscall+0x2c6/frame 0xfffffe0857e3bbf0 >> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0857e3bbf0 >> --- syscall (10, FreeBSD ELF64, sys_unlink), rip =3D 0x80095425a, rsp =3D >> 0x7fffffffe988, rbp =3D 0x7fffffffea20 --- >> = >> Any hints solving this problem are welcome. Per the suggestion at the top, I recommend trying your test without gjournal present to see if the bug goes away. If that is true, then we can focus our attention on a possible race during rm in the gjournal code. Kirk McKusick
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201706242203.v5OM3U2Q096283>