From owner-freebsd-fs@freebsd.org Sat Jun 24 22:00:43 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7ACA7DA66CA for ; Sat, 24 Jun 2017 22:00:43 +0000 (UTC) (envelope-from mckusick@mckusick.com) Received: from chez.mckusick.com (chez.mckusick.com [70.36.157.235]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 59AF37A595 for ; Sat, 24 Jun 2017 22:00:43 +0000 (UTC) (envelope-from mckusick@mckusick.com) Received: from chez.mckusick.com (localhost [IPv6:::1]) by chez.mckusick.com (8.15.2/8.15.2) with ESMTP id v5OM3U2Q096283; Sat, 24 Jun 2017 15:03:30 -0700 (PDT) (envelope-from mckusick@chez.mckusick.com) Message-Id: <201706242203.v5OM3U2Q096283@chez.mckusick.com> From: Kirk McKusick To: Andreas Longwitz Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition cc: Konstantin Belousov , freebsd-fs@freebsd.org In-reply-to: <20170624153017.GP3437@kib.kiev.ua> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <96281.1498341810.1@chez.mckusick.com> Content-Transfer-Encoding: quoted-printable Date: Sat, 24 Jun 2017 15:03:30 -0700 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 24 Jun 2017 22:00:43 -0000 > Date: Sat, 24 Jun 2017 18:30:17 +0300 > From: Konstantin Belousov > To: Andreas Longwitz > Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition > = > On Fri, Jun 23, 2017 at 11:14:43AM +0200, Andreas Longwitz wrote: >> I try to understand the cause for the "free inode" problem described in >> https://lists.freebsd.org/pipermail/freebsd-fs/2013-November/018610.h= tml >> = >> I have setup a test server (FreeBSD 10.3-STABLE #4 r317936) with a >> gjournaled partition for /home: >> mount -t ufs | grep /home --> >> /dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local, >> noatime, gjournal) > > As the first thing to try, if you perform your tests on the raw > partition without gjournal, does the problem stay around ? I concur that running your test without gjournal is the next test to try. I think that your suspicion that there is a race condition with gjournal is likely correct. And if that is true, then the problem will go away when gjournal is taken out of the stack. >> My test creates one snapshot of /home (gets alway inode 4) and removes >> this snapshot: >> = >> for i in 1 2 3 4 5 6 7 8; do >> echo "starting snaptest $i" >/dev/console >> mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /hom= e >> echo $(ls -ils /home/.snap/fscktest) >/dev/console >> rm -f /home/.snap/fscktest >> done >> = >> I never have more than this one snapshot at work and during the test I >> never have any other >> user processes working on /home. A typical output looks like this: >> = >> Jun 21 15:59:52 root: starting snaptest 1 >> Jun 21 15:59:52 root: 4 26592 -r-------- 1 root operator 90762970240 >> 21 Jun 15:59 /home/.snap/fscktest >> Jun 21 15:59:53 root: starting snaptest 2 >> Jun 21 15:59:53 root: 4 26592 -r-------- 1 root operator 90762970152 >> 21 Jun 15:59 /home/.snap/fscktest >> Jun 21 15:59:54 kernel: freeing inode /home/4 with 704 blocks >> Jun 21 15:59:54 root: starting snaptest 3 >> Jun 21 15:59:54 kernel: free inode /home/4 had 704 blocks >> Jun 21 15:59:54 root: 4 26592 -r-------- 1 root operator 90762969976 >> 21 Jun 15:59 /home/.snap/fscktest >> Jun 21 15:59:56 kernel: freeing inode /home/4 with 2112 blocks >> Jun 21 15:59:56 root: starting snaptest 4 >> Jun 21 15:59:56 kernel: free inode /home/4 had 2112 blocks >> Jun 21 15:59:56 root: 4 26592 -r-------- 1 root operator 90762970240 >> 21 Jun 15:59 /home/.snap/fscktest >> Jun 21 15:59:57 root: starting snaptest 5 >> Jun 21 15:59:57 root: 4 26592 -r-------- 1 root operator 90762970240 >> 21 Jun 15:59 /home/.snap/fscktest >> Jun 21 15:59:58 root: starting snaptest 6 >> Jun 21 15:59:58 root: 4 26592 -r-------- 1 root operator 90762970216 >> 21 Jun 15:59 /home/.snap/fscktest >> Jun 21 15:59:59 kernel: freeing inode /home/4 with 192 blocks >> Jun 21 15:59:59 root: starting snaptest 7 >> Jun 21 15:59:59 kernel: free inode /home/4 had 192 blocks >> Jun 21 15:59:59 root: 4 26592 -r-------- 1 root operator 90762970240 >> 21 Jun 16:00 /home/.snap/fscktest >> Jun 21 16:00:00 root: starting snaptest 8 >> Jun 21 16:00:00 root: 4 26592 -r-------- 1 root operator 90762970240 >> 21 Jun 16:00 /home/.snap/fscktest >> = >> The "free inode /home/4 had NNN blocks" message during run of the mount >> command is output of ffs_valloc(), because ffs_load_inode() has load th= e >> disk inode 4 with a non zero i_blocks field. The corresponding "freeing >> inode /home/4 with NNN blocks" message during the previous rm command >> is output of my following diagnostic patch in function ffs_truncate(): >> = >> --- ffs_inode.c.1st 2016-06-08 17:25:21.000000000 +0200 >> +++ ffs_inode.c 2017-06-19 10:02:07.145360000 +0200 >> @@ -551,6 +551,9 @@ >> DIP_SET(ip, i_blocks, DIP(ip, i_blocks) - blocksrelease= d); >> else /* sanity */ >> DIP_SET(ip, i_blocks, 0); >> + if (bootverbose =3D=3D 2 && DIP(ip, i_blocks) > 0) >> + printf("freeing inode %s/%lu with %ld blocks\n", >> + fs->fs_fsmnt, (u_long)ip->i_number, >> (long)DIP(ip, i_blocks)); >> ip->i_flag |=3D IN_CHANGE; >> #ifdef QUOTA >> (void) chkdq(ip, -blocksreleased, NOCRED, 0); >> = >> The rm command can only free all the blocks of the snapshotfile (means >> i_blocks for inode 4 ends with zero) , if this file has the "correct" s= ize: >> = >> ls -ils /home/.snap/fscktest --> >> 4 53184 -r-------- 1 root operator 90762970240 Jun 17 06:15 >> /home/.snap/fscktest >> = >> The size of the /home partition is given by >> diskinfo /dev/mirror/gmsvt7p10.journal --> >> /dev/mirror/gmsvt7p10.journal 512 90762954240 177271395 >> = >> So we have 2769865 full 32kB blocks with size 90631864320. During >> creating a snapshot a "last block" (32kB) is written at this offset >> ending at 90762969088. Finally the snapblklist is written with >> VOP_WRITE: "Write out the list of allocated blocks to the end of the >> snapshot". In all my correct tests the table snapblklist is 1152 bytes >> in size giving the correct size of the snapshot file : 90762970240. In >> this case the table snapblklist has 144 entries of length 8: one lenght >> entry and 143 logical block numbers recorded in mapacct_ufs2(): >> = >> if (acctit && expungetype =3D=3D BLK_SNAP && blkno !=3D BLK_S= NAP) >> *ip->i_snapblklist++ =3D lblkno; >> = >> The console output above shows three error situations with block >> counters 704, 2112 and 192. Dividing these values by 8 gives exactly th= e >> reduced size of the snapblocklist at the end of the snapshotfile, so in >> these cases the snapshotfile is corrupt. > > I am not sure what do you mean by 'match' there. Could you explicitely > mention what relations between snapshot size and leaked blocks of the > free snapshot inode did you noted ? I too am confused here. Are you saying for example that 192 / 8 =3D=3D 24 and that the snapblocklist is short by 24 blocks? Because from the table below, it appears to be short by only 3 blocks. >> I use a test kernel with some extra counters ct_* in mapacct_ufs2(): >> = >> ++ct_blkno_all; >> if (blkno =3D=3D 0) >> ++ct_blkno_0; >> if (blkno =3D=3D BLK_NOCOPY) >> ++ct_blkno_nocopy; >> if (blkno =3D=3D BLK_SNAP) >> ++ct_blkno_snap; >> if (blkno =3D=3D 0 || blkno =3D=3D BLK_NOCOPY) >> continue; >> if (acctit && expungetype =3D=3D BLK_SNAP && blkno !=3D BLK_= SNAP) { >> *ip->i_snapblklist++ =3D lblkno; >> ++ct_snapblklist; >> } >> if (blkno =3D=3D BLK_SNAP) >> blkno =3D blkstofrags(fs, lblkno); >> ++ct_blkfree; >> ffs_blkfree(ip->i_ump, fs, vp, blkno, fs->fs_bsize, inum, >> vp->v_type, NULL); >> = >> and for the 8 test runs shown above I can see these results using DTrac= e >> at probe expunge_ufs2:return (blkno_snap is always 0): >> = >> test blkno_all blkno_0 blkno_nocopy snapblklist blkfree cg_nocopy >> ------------------------------------------------------------------- >> 1 ok 2770545 353320 2416404 143 821 2416404 >> 2 bad 2770545 587860 2181875 132 810 2416404 >> 3 bad 2770545 956582 1813175 110 788 2416393 >> 4 ok 2770545 353364 2416360 143 821 2416360 >> 5 ok 2770545 353364 2416360 143 821 2416360 >> 6 bad 2770545 418376 2351351 140 818 2416360 >> 7 ok 2770545 353367 2416357 143 821 2416357 >> 8 ok 2770545 353367 2416357 143 821 2416357 >> = >> For correct tests the sum of blkno_0 and blkno_nocopy is always the sam= e >> (2769724), for bad tests especially the counter for blkno_nocopy is >> significant lower. In the test table I give one more column cg_nocopy >> for a counter I have added in cgaccount() to see how many entries are >> set to BLK_NOCOPY during copy of cylinder group maps: >> = >> if (ffs_isblock(fs, cg_blksfree(cgp), loc)) { >> ++ct_cg_nocopy; >> DIP_SET(ip, i_db[loc], BLK_NOCOPY); >> } >> ... >> if (ffs_isblock(fs, cg_blksfree(cgp), loc)) { >> ++ct_cg_nocopy; >> ((ufs2_daddr_t *)(ibp->b_data))[indiroff] =3D >> BLK_NOCOPY; >> } >> = >> For correct tests all the BLK_NOCOPY's which are set in cgaccount() can >> later be seen in mapacct_ufs2(), for bad tests many of the BLK_NOCOPY's >> have changed to 0. >> = >> I looks like the rm command removing the previous snapshot in some way >> runs "in the background" simultan to expunge_ufs2() and changes some of >> the BLK_NOCOPY's to zero. So this may be a buffer management problem >> which only exists on gjourneled partitions, maybe getblk/readblock used >> in indiracct_ufs2() is not compatibel with gjournel in the special case >> of creating or removing a spapshot. A hint in this direction is the >> fact, that the first test after cleaning the partition with >> umount /home; fsck -y /home; mount /home >> always succeeds. The following modified test procedure never fails: >> = >> for i in 1 2 3 4 5 6 7 8; do >> echo "starting snaptest $i" >/dev/console >> mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /hom= e >> echo $(ls -ils /home/.snap/fscktest) >/dev/console >> rm -f /home/.snap/fscktest /home >> umount /home >> mount /home >> done > > After the allocations of required blocks for the snapshot inode > are finished, the filesystem is suspended. You can see the > call to vfs_write_suspend() in the ffs_snapshot() where the > suspension is enforced. As part of the suspension, all soft-update > workitems are flushed, this is done by vfs_write_suspend() calling > VFS_SYNC(MNT_SUSPEND). > = > UFS classifies writers into primary and secondary. Primary are mostly > the writes initiated by the top-level syscall entries, like direct > calls to write(2) or metadata-changing ops mkdir(), create() and so on. > Secondary are writes performed when system initiates metadata updates > during inactivation, quota updates, softdep background processing and > similar. Primary modifications are blocked outright on suspension, while > secondary are waited to finish in the mentioned VFS_SYNC(MNT_SUSPEND) > call. > = > If you can provide a proof that some SU-related activity happens after t= he > suspension is established, this would be interesting to see. > Might be it is something different and much simpler, but I do not see > an obvious mistake in the code, after reading your observations. The mount information at the top shows: /dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local, noatime,= gjournal) Thus no soft updates are being used. We are running on just a basic UFS filesystem. So, soft update processing has no relevance here. >> Another proof that the snapshot file is corrupt when the snapblklist is >> shortend is the fact that the rm command sporadically panics in a kerne= l >> routine that is known to be correct: >> = >> nread portion of the kernel message buffer: >> dev =3D mirror/gmsvt7p10.journal, block =3D 19727560, fs =3D /home >> panic: ffs_blkfree_cg: freeing free block >> cpuid =3D 1 >> KDB: stack backtrace: >> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame >> 0xfffffe0857e3b1c0 >> kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0857e3b270 >> vpanic() at vpanic+0x126/frame 0xfffffe0857e3b2b0 >> panic() at panic+0x43/frame 0xfffffe0857e3b310 >> ffs_blkfree_cg() at ffs_blkfree_cg+0x5c6/frame 0xfffffe0857e3b3d0 >> ffs_blkfree() at ffs_blkfree+0x99/frame 0xfffffe0857e3b430 >> ffs_indirtrunc() at ffs_indirtrunc+0x474/frame 0xfffffe0857e3b510 >> ffs_indirtrunc() at ffs_indirtrunc+0x423/frame 0xfffffe0857e3b5f0 >> ffs_truncate() at ffs_truncate+0x10b4/frame 0xfffffe0857e3b7d0 >> ufs_inactive() at ufs_inactive+0x16b/frame 0xfffffe0857e3b810 >> VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0xf7/frame 0xfffffe0857e3b840 >> vinactive() at vinactive+0xc6/frame 0xfffffe0857e3b890 >> vputx() at vputx+0x27a/frame 0xfffffe0857e3b8f0 >> kern_unlinkat() at kern_unlinkat+0x243/frame 0xfffffe0857e3bae0 >> amd64_syscall() at amd64_syscall+0x2c6/frame 0xfffffe0857e3bbf0 >> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0857e3bbf0 >> --- syscall (10, FreeBSD ELF64, sys_unlink), rip =3D 0x80095425a, rsp =3D >> 0x7fffffffe988, rbp =3D 0x7fffffffea20 --- >> = >> Any hints solving this problem are welcome. Per the suggestion at the top, I recommend trying your test without gjournal present to see if the bug goes away. If that is true, then we can focus our attention on a possible race during rm in the gjournal code. Kirk McKusick