Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 23 Jun 2017 11:14:43 +0200
From:      Andreas Longwitz <longwitz@incore.de>
To:        freebsd-fs@freebsd.org
Subject:   ufs snapshot is sometimes corrupt on gjourneled partition
Message-ID:  <594CDC03.3010309@incore.de>

next in thread | raw e-mail | index | archive | help
I try to understand the cause for the "free inode" problem described in
  https://lists.freebsd.org/pipermail/freebsd-fs/2013-November/018610.html.

I have setup a test server (FreeBSD 10.3-STABLE #4 r317936) with a
gjournaled partition for /home:
mount -t ufs | grep /home -->
/dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local,
noatime, gjournal)

My test creates one snapshot of /home (gets alway inode 4) and removes
this snapshot:

  for i in 1 2 3 4 5 6 7 8; do
     echo "starting snaptest $i" >/dev/console
     mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /home
     echo $(ls -ils /home/.snap/fscktest) >/dev/console
     rm -f /home/.snap/fscktest /home
  done

I never have more than this one snapshot at work and during the test I
never have any other
user processes working on /home. A typical output looks like this:

Jun 21 15:59:52 root: starting snaptest 1
Jun 21 15:59:52 root: 4 26592 -r--------  1 root  operator  90762970240
21 Jun 15:59 /home/.snap/fscktest
Jun 21 15:59:53 root: starting snaptest 2
Jun 21 15:59:53 root: 4 26592 -r--------  1 root  operator  90762970152
21 Jun 15:59 /home/.snap/fscktest
Jun 21 15:59:54 kernel: freeing inode /home/4 with 704 blocks
Jun 21 15:59:54 root: starting snaptest 3
Jun 21 15:59:54 kernel: free inode /home/4 had 704 blocks
Jun 21 15:59:54 root: 4 26592 -r--------  1 root  operator  90762969976
21 Jun 15:59 /home/.snap/fscktest
Jun 21 15:59:56 kernel: freeing inode /home/4 with 2112 blocks
Jun 21 15:59:56 root: starting snaptest 4
Jun 21 15:59:56 kernel: free inode /home/4 had 2112 blocks
Jun 21 15:59:56 root: 4 26592 -r--------  1 root  operator  90762970240
21 Jun 15:59 /home/.snap/fscktest
Jun 21 15:59:57 root: starting snaptest 5
Jun 21 15:59:57 root: 4 26592 -r--------  1 root  operator  90762970240
21 Jun 15:59 /home/.snap/fscktest
Jun 21 15:59:58 root: starting snaptest 6
Jun 21 15:59:58 root: 4 26592 -r--------  1 root  operator  90762970216
21 Jun 15:59 /home/.snap/fscktest
Jun 21 15:59:59 kernel: freeing inode /home/4 with 192 blocks
Jun 21 15:59:59 root: starting snaptest 7
Jun 21 15:59:59 kernel: free inode /home/4 had 192 blocks
Jun 21 15:59:59 root: 4 26592 -r--------  1 root  operator  90762970240
21 Jun 16:00 /home/.snap/fscktest
Jun 21 16:00:00 root: starting snaptest 8
Jun 21 16:00:00 root: 4 26592 -r--------  1 root  operator  90762970240
21 Jun 16:00 /home/.snap/fscktest

The "free inode /home/4 had NNN blocks" message during run of the mount
command is output of ffs_valloc(), because ffs_load_inode() has load the
disk inode 4 with a non zero i_blocks field. The corresponding "freeing
inode /home/4 with NNN blocks" message during the previous rm command
is output of my following diagnostic patch in function ffs_truncate():

--- ffs_inode.c.1st     2016-06-08 17:25:21.000000000 +0200
+++ ffs_inode.c 2017-06-19 10:02:07.145360000 +0200
@@ -551,6 +551,9 @@
                DIP_SET(ip, i_blocks, DIP(ip, i_blocks) - blocksreleased);
        else    /* sanity */
                DIP_SET(ip, i_blocks, 0);
+       if (bootverbose == 2 && DIP(ip, i_blocks) > 0)
+               printf("freeing inode %s/%lu with %ld blocks\n",
+                       fs->fs_fsmnt, (u_long)ip->i_number,
(long)DIP(ip, i_blocks));
        ip->i_flag |= IN_CHANGE;
 #ifdef QUOTA
        (void) chkdq(ip, -blocksreleased, NOCRED, 0);

The rm command can only free all the blocks of the snapshotfile (means
i_blocks for inode 4 ends with zero) , if this file has the "correct" size:

ls -ils /home/.snap/fscktest -->
4 53184 -r--------  1 root  operator  90762970240 Jun 17 06:15
/home/.snap/fscktest

The size of the /home partition is given by
diskinfo /dev/mirror/gmsvt7p10.journal -->
/dev/mirror/gmsvt7p10.journal   512     90762954240     177271395

So we have 2769865 full 32kB blocks with size 90631864320. During
creating a snapshot a "last block" (32kB) is written at this offset
ending at 90762969088. Finally the snapblklist is written with
VOP_WRITE: "Write out the list of allocated blocks to the end of the
snapshot". In all my correct tests the table snapblklist is 1152 bytes
in size giving the correct size of the snapshot file : 90762970240. In
this case the table snapblklist has 144 entries of length 8: one lenght
entry and 143 logical block numbers recorded in mapacct_ufs2():

          if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP)
                   *ip->i_snapblklist++ = lblkno;

The console output above shows three error situations with block
counters 704, 2112 and 192. Dividing these values by 8 gives exactly the
reduced size of the snapblocklist at the end of the snapshotfile, so in
these cases the snapshotfile is corrupt.

I use a test kernel with some extra counters ct_* in mapacct_ufs2():

           ++ct_blkno_all;
           if (blkno == 0)
                   ++ct_blkno_0;
           if (blkno == BLK_NOCOPY)
                   ++ct_blkno_nocopy;
           if (blkno == BLK_SNAP)
                   ++ct_blkno_snap;
           if (blkno == 0 || blkno == BLK_NOCOPY)
                   continue;
           if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP) {
                   *ip->i_snapblklist++ = lblkno;
                   ++ct_snapblklist;
           }
           if (blkno == BLK_SNAP)
                   blkno = blkstofrags(fs, lblkno);
           ++ct_blkfree;
           ffs_blkfree(ip->i_ump, fs, vp, blkno, fs->fs_bsize, inum,
               vp->v_type, NULL);

and for the 8 test runs shown above I can see these results using DTrace
at probe expunge_ufs2:return (blkno_snap is always 0):

 test   blkno_all blkno_0 blkno_nocopy snapblklist blkfree cg_nocopy
 -------------------------------------------------------------------
 1 ok   2770545   353320  2416404      143         821     2416404
 2 bad  2770545   587860  2181875      132         810     2416404
 3 bad  2770545   956582  1813175      110         788     2416393
 4 ok   2770545   353364  2416360      143         821     2416360
 5 ok   2770545   353364  2416360      143         821     2416360
 6 bad  2770545   418376  2351351      140         818     2416360
 7 ok   2770545   353367  2416357      143         821     2416357
 8 ok   2770545   353367  2416357      143         821     2416357

For correct tests the sum of blkno_0 and blkno_nocopy is always the same
(2769724), for bad tests especially the counter for blkno_nocopy is
significant lower. In the test table I give one more column cg_nocopy
for a counter I have added in cgaccount() to see how many entries are
set to BLK_NOCOPY during copy of cylinder group maps:

                if (ffs_isblock(fs, cg_blksfree(cgp), loc)) {
                        ++ct_cg_nocopy;
                        DIP_SET(ip, i_db[loc], BLK_NOCOPY);
                }
                ...
                if (ffs_isblock(fs, cg_blksfree(cgp), loc)) {
                        ++ct_cg_nocopy;
                        ((ufs2_daddr_t *)(ibp->b_data))[indiroff] =
BLK_NOCOPY;
                }

For correct tests all the BLK_NOCOPY's which are set in cgaccount() can
later be seen in mapacct_ufs2(), for bad tests many of the BLK_NOCOPY's
have changed to 0.

I looks like the rm command removing the previous snapshot in some way
runs "in the background" simultan to expunge_ufs2() and changes some of
the BLK_NOCOPY's to zero. So this may be a buffer management problem
which only exists on gjourneled partitions, maybe getblk/readblock used
in indiracct_ufs2() is not compatibel with gjournel in the special case
of creating or removing a spapshot. A hint in this direction is the
fact, that the first test after cleaning the partition with
        umount /home; fsck -y /home; mount /home
always succeeds. The following modified test procedure never fails:

  for i in 1 2 3 4 5 6 7 8; do
     echo "starting snaptest $i" >/dev/console
     mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /home
     echo $(ls -ils /home/.snap/fscktest) >/dev/console
     rm -f /home/.snap/fscktest /home
     umount /home
     mount /home
  done

Another proof that the snapshot file is corrupt when the snapblklist is
shortend is the fact that the rm command sporadically panics in a kernel
routine that is known to be correct:

nread portion of the kernel message buffer:
dev = mirror/gmsvt7p10.journal, block = 19727560, fs = /home
panic: ffs_blkfree_cg: freeing free block
cpuid = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
0xfffffe0857e3b1c0
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0857e3b270
vpanic() at vpanic+0x126/frame 0xfffffe0857e3b2b0
panic() at panic+0x43/frame 0xfffffe0857e3b310
ffs_blkfree_cg() at ffs_blkfree_cg+0x5c6/frame 0xfffffe0857e3b3d0
ffs_blkfree() at ffs_blkfree+0x99/frame 0xfffffe0857e3b430
ffs_indirtrunc() at ffs_indirtrunc+0x474/frame 0xfffffe0857e3b510
ffs_indirtrunc() at ffs_indirtrunc+0x423/frame 0xfffffe0857e3b5f0
ffs_truncate() at ffs_truncate+0x10b4/frame 0xfffffe0857e3b7d0
ufs_inactive() at ufs_inactive+0x16b/frame 0xfffffe0857e3b810
VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0xf7/frame 0xfffffe0857e3b840
vinactive() at vinactive+0xc6/frame 0xfffffe0857e3b890
vputx() at vputx+0x27a/frame 0xfffffe0857e3b8f0
kern_unlinkat() at kern_unlinkat+0x243/frame 0xfffffe0857e3bae0
amd64_syscall() at amd64_syscall+0x2c6/frame 0xfffffe0857e3bbf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0857e3bbf0
--- syscall (10, FreeBSD ELF64, sys_unlink), rip = 0x80095425a, rsp =
0x7fffffffe988, rbp = 0x7fffffffea20 ---

Any hints solving this problem are welcome.
-- 

Andreas Longwitz




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?594CDC03.3010309>