Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 18 Jul 2013 11:43:54 +0100
From:      Dan Thomas <godders@gmail.com>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Kirk McKusick <mckusick@mckusick.com>, freebsd-fs@freebsd.org, Palle Girgensohn <girgen@freebsd.org>, Jeff Roberson <jroberson@jroberson.net>, Julian Akehurst <julian@pingpong.se>
Subject:   Re: leaking lots of unreferenced inodes (pg_xlog files?)
Message-ID:  <CAG8duQ2gB11=bXzJ6hFFzzNUSwofn3WGd4=EEPJuhyNr6UmjwQ@mail.gmail.com>
In-Reply-To: <20130717053431.GN5991@kib.kiev.ua>
References:  <201307151932.r6FJWSxM087108@chez.mckusick.com> <51E5CD7A.2020109@FreeBSD.org> <20130717053431.GN5991@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
After a bit of experimentation, we've managed to nail down a
reasonably short run that exhibits this leak. Postgres' verbose log
output is linked below - whatever is causing the leak is in there
somewhere, but alas I lack the necessary understanding of Postgres'
internals to be able to pin it down any further.

https://dl.dropboxusercontent.com/u/13916028/pg_leak_log.txt

I've also got a 2.4M ktrace of this run, which is still pretty big,
I'll admit. Unfortunately it's got some data in it that I'd rather not
publish, but I'm happy to send it directly to anyone who might find it
useful.

Thanks,

Dan

On 17 July 2013 06:34, Konstantin Belousov <kostikbel@gmail.com> wrote:
> On Wed, Jul 17, 2013 at 12:47:22AM +0200, Palle Girgensohn wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Kirk McKusick skrev:
>> >> Date: Mon, 15 Jul 2013 10:51:10 +0100 Subject: Re: leaking lots of
>> >> unreferenced inodes (pg_xlog files?) From: Dan Thomas
>> >> <godders@gmail.com> To: Kirk McKusick <mckusick@mckusick.com> Cc:
>> >> Palle Girgensohn <girgen@freebsd.org>, freebsd-fs@freebsd.org, Jeff
>> >> Roberson <jroberson@jroberson.net>, Julian Akehurst
>> >> <julian@pingpong.se> X-ASK-Info: Message Queued (2013/07/15
>> >> 02:51:22) X-ASK-Info: Confirmed by User (2013/07/15 02:55:04)
>> >>
>> >> On 11 June 2013 01:17, Kirk McKusick <mckusick@mckusick.com>
>> >> wrote:
>> >>> OK, good to have it narrowed down. I will look to devise some
>> >>> additional diagnostics that hopefully will help tease out the
>> >>> bug. I'll hopefully get back to you soon.
>> >> Hi,
>> >>
>> >> Is there any news on this issue? We're still running several
>> >> servers that are exhibiting this problem (most recently, one that
>> >> seems to be leaking around 10gb/hour), and it's getting to the
>> >> point where we're looking at moving to a different OS until it's
>> >> resolved.
>> >>
>> >> We have access to several production systems with this problem and
>> >> (at least from time to time) will have systems with a significant
>> >> leak on them that we can experiment with. Is there any way we can
>> >> assist with tracking this down? Any diagnostics or testing that
>> >> would be useful?
>> >>
>> >> Thanks, Dan
>> >
>> > Hi Dan (and Palle),
>> >
>> > Sorry for the long delay with no help / news. I have gotten
>> > side-tracked on several projects and have had little time to try and
>> > devise some tests that would help find the cause of the lost space.
>> > It almost certainly is a one-line fix (a missing vput or vrele
>> > probably in some error path), but finding where it goes is the hard
>> > part :-)
>> >
>> > I have had little success in inserting code that tracks reference
>> > counts (too many false positives). So, I am going to need some help
>> > from you to narrow it down. My belief is that there is some set of
>> > filesystem operations (system calls) that are leading to the
>> > problem. Notably, a file is being created, data put into it, then the
>> > file is deleted (either before or after being closed).  Somehow a
>> > reference to that file is persisting despite there being no valid
>> > reference to it. Hence the filesystem thinks it is still live and is
>> > not deleting it. When you do the forcible unmount, these files get
>> > cleared and the space shows back up.
>> >
>> > What I need to devise is a small test program doing the set of system
>> > calls that cause this to happen. The way that I would like to try and
>> > get it is to have you `ktrace -i' your application and then run your
>> > application just long enough to create at least one of these lost
>> > files. The goal is to minimize the amount of ktrace data through
>> > which we need to sift.
>> >
>> > In preparation for doing this test you need to have a kernel compiled
>> > with `option DIAGNOSTIC' or if you prefer, just add `#define
>> > DIAGNOSTIC 1' to the top of sys/kern/vfs_subr.c. You will know you
>> > have at least one offending file when you try to unmount the affected
>> > filesystem and find it busy. Before doing the `umount -f', enable
>> > busy printing using `sysctl debug.busyprt=1'. Then capture the
>> > console output which will show the details of all the vnodes that had
>> > to be forcibly flushed. Hopefully we will then be able to correlate
>> > them back to the files (NAMI in the ktrace output) with which they
>> > were associated. We may need to augment the NAMI data with the inode
>> > number of the associated file to make the association with the
>> > busyprt output. Anyway, once we have that, we can look at all the
>> > system calls done on those files and create a small test program that
>> > exhibits the problem. Given a small test program, Jeff or I can track
>> > down the offending system call path and nail this pernicious bug once
>> > and for all.
>> >
>> > Kirk McKusick
>>
>> Hi,
>>
>> I have run ktrace -i on pg_ctl (which forks off all the postgresql
>> processes) and I got two "busy" files that where "lost" after a few
>> hours. dmesg reveals this:
>>
>> vflush: busy vnode
>> 0xfffffe067cdde960: tag ufs, type VREG
>>     usecount 1, writecount 0, refcount 2 mountedhere 0
>>     flags (VI(0x200))
>>  VI_LOCKed    v_object 0xfffffe0335922000 ref 0 pages 0
>>     lock type ufs: EXCL by thread 0xfffffe01600eb8e0 (pid 56723)
>>       ino 11047146, on dev da2s1d
>> vflush: busy vnode
>> 0xfffffe039f35bb40: tag ufs, type VREG
>>     usecount 1, writecount 0, refcount 3 mountedhere 0
>>     flags (VI(0x200))
>>  VI_LOCKed    v_object 0xfffffe03352701d0 ref 0 pages 0
>>     lock type ufs: EXCL by thread 0xfffffe01600eb8e0 (pid 56723)
>>       ino 11045961, on dev da2s1d
>>
>>
>> I had to umount -f, so they where "lost".
>>
>> So, now I have 55 GB ktrace output... ;)  Is there anything I can do to
>> filter it, or shall I compress it and put it on a web server for you to
>> fetch as it is?
>
> I think that 55GB of ktrace is obviously useless.  The Kirk' idea was to
> have an isolated test case that would only create the situation triggering
> the leak, without irrelevant activity.  This indeed requires drilling down
> and isolating the file activities to get to the core of problem.
>
> FWIW, I and Peter Holm used the following alternative approach quite
> successfully when tracking down other vnode reference leaks.  The approach
> still requires some understanding of the specifics of the problematic
> files to be useful, but not as much as isolated test.
>
> Basically, you take the patch below, and set the VV_DEBUGVREF flag for
> the vnode that has characteristics as much specific for the leaked vnode
> as possible.  The patch has example of setting the flag for all new NFS
> vnodes.  You would probably want to do the same in vfs_vgetf(),
> checking  e.g. for the partition where your leaks happen.  The limiting
> of the vnodes for which the vref traces are accumulated is needed to
> save the kernel memory.
>
> Then after the leak was observed, you just print the vnode with ddb
> command 'show vnode addr' and send the output to developer.
>
> Index: sys/sys/vnode.h
> ===================================================================
> --- sys/sys/vnode.h     (revision 248723)
> +++ sys/sys/vnode.h     (working copy)
> @@ -94,6 +94,13 @@ struct vpollinfo {
>
>  #if defined(_KERNEL) || defined(_KVM_VNODE)
>
> +struct debug_ref {
> +       TAILQ_ENTRY(debug_ref) link;
> +       int val;
> +       const char *op;
> +       struct stack stack;
> +};
> +
>  struct vnode {
>         /*
>          * Fields which define the identity of the vnode.  These fields are
> @@ -169,6 +176,7 @@ struct vnode {
>         int     v_writecount;                   /* v ref count of writers */
>         u_int   v_hash;
>         enum    vtype v_type;                   /* u vnode type */
> +       TAILQ_HEAD(, debug_ref) v_debug_ref;
>  };
>
>  #endif /* defined(_KERNEL) || defined(_KVM_VNODE) */
> @@ -253,6 +261,7 @@ struct xvnode {
>  #define        VV_DELETED      0x0400  /* should be removed */
>  #define        VV_MD           0x0800  /* vnode backs the md device */
>  #define        VV_FORCEINSMQ   0x1000  /* force the insmntque to succeed */
> +#define        VV_DEBUGVREF    0x2000
>
>  /*
>   * Vnode attributes.  A field value of VNOVAL represents a field whose value
> Index: sys/kern/vfs_subr.c
> ===================================================================
> --- sys/kern/vfs_subr.c (revision 248723)
> +++ sys/kern/vfs_subr.c (working copy)
> @@ -71,6 +71,7 @@ __FBSDID("$FreeBSD$");
>  #include <sys/sched.h>
>  #include <sys/sleepqueue.h>
>  #include <sys/smp.h>
> +#include <sys/stack.h>
>  #include <sys/stat.h>
>  #include <sys/sysctl.h>
>  #include <sys/syslog.h>
> @@ -871,6 +872,23 @@ static struct kproc_desc vnlru_kp = {
>  };
>  SYSINIT(vnlru, SI_SUB_KTHREAD_UPDATE, SI_ORDER_FIRST, kproc_start,
>      &vnlru_kp);
> +
> +MALLOC_DEFINE(M_RECORD_REF, "recordref", "recordref");
> +static void
> +v_record_ref(struct vnode *vp, int val, const char *op)
> +{
> +       struct debug_ref *r;
> +
> +       if ((vp->v_type != VREG && vp->v_type != VBAD) ||
> +           (vp->v_vflag & VV_DEBUGVREF) == 0)
> +               return;
> +       r = malloc(sizeof(struct debug_ref), M_RECORD_REF, M_NOWAIT |
> +           M_USE_RESERVE);
> +       r->val = val;
> +       r->op = op;
> +       stack_save(&r->stack);
> +       TAILQ_INSERT_TAIL(&vp->v_debug_ref, r, link);
> +}
>
>  /*
>   * Routines having to do with the management of the vnode table.
> @@ -1073,6 +1091,7 @@ alloc:
>                         vp->v_vflag |= VV_NOKNOTE;
>         }
>         rangelock_init(&vp->v_rl);
> +       TAILQ_INIT(&vp->v_debug_ref);
>
>         /*
>          * For the filesystems which do not use vfs_hash_insert(),
> @@ -1082,6 +1101,7 @@ alloc:
>          */
>         vp->v_hash = (uintptr_t)vp >> vnsz2log;
>
> +       TAILQ_INIT(&vp->v_debug_ref);
>         *vpp = vp;
>         return (0);
>  }
> @@ -2197,6 +2217,7 @@ vget(struct vnode *vp, int flags, struct thread *t
>                         vinactive(vp, td);
>                 vp->v_iflag &= ~VI_OWEINACT;
>         }
> +       v_record_ref(vp, 1, "vget");
>         VI_UNLOCK(vp);
>         return (0);
>  }
> @@ -2211,6 +2232,7 @@ vref(struct vnode *vp)
>         CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
>         VI_LOCK(vp);
>         v_incr_usecount(vp);
> +       v_record_ref(vp, 1, "vref");
>         VI_UNLOCK(vp);
>  }
>
> @@ -2253,6 +2275,7 @@ vputx(struct vnode *vp, int func)
>                 KASSERT(func == VPUTX_VRELE, ("vputx: wrong func"));
>         CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
>         VI_LOCK(vp);
> +       v_record_ref(vp, -1, "vputx");
>
>         /* Skip this v_writecount check if we're going to panic below. */
>         VNASSERT(vp->v_writecount < vp->v_usecount || vp->v_usecount < 1, vp,
> @@ -2409,6 +2432,7 @@ void
>  vdropl(struct vnode *vp)
>  {
>         struct bufobj *bo;
> +       struct debug_ref *r, *r1;
>         struct mount *mp;
>         int active;
>
> @@ -2489,6 +2513,9 @@ vdropl(struct vnode *vp)
>         lockdestroy(vp->v_vnlock);
>         mtx_destroy(&vp->v_interlock);
>         mtx_destroy(BO_MTX(bo));
> +       TAILQ_FOREACH_SAFE(r, &vp->v_debug_ref, link, r1) {
> +               free(r, M_RECORD_REF);
> +       }
>         uma_zfree(vnode_zone, vp);
>  }
>
> @@ -2888,6 +2915,8 @@ vn_printf(struct vnode *vp, const char *fmt, ...)
>         va_list ap;
>         char buf[256], buf2[16];
>         u_long flags;
> +       int ref;
> +       struct debug_ref *r;
>
>         va_start(ap, fmt);
>         vprintf(fmt, ap);
> @@ -2960,8 +2989,21 @@ vn_printf(struct vnode *vp, const char *fmt, ...)
>                     vp->v_object->resident_page_count);
>         printf("    ");
>         lockmgr_printinfo(vp->v_vnlock);
> -       if (vp->v_data != NULL)
> -               VOP_PRINT(vp);
> +#if DDB
> +       if (kdb_active) {
> +               if (vp->v_data != NULL)
> +                       VOP_PRINT(vp);
> +       }
> +#endif
> +
> +       /* Getnewvnode() initial reference is not recorded due to VNON */
> +       ref = 1;
> +       TAILQ_FOREACH(r, &vp->v_debug_ref, link) {
> +               ref += r->val;
> +               printf("REF %d %s\n", ref, r->op);
> +               stack_print(&r->stack);
> +       }
> +
>  }
>
>  #ifdef DDB
> Index: sys/fs/nfsclient/nfs_clport.c
> ===================================================================
> --- sys/fs/nfsclient/nfs_clport.c       (revision 248723)
> +++ sys/fs/nfsclient/nfs_clport.c       (working copy)
> @@ -273,6 +273,7 @@ nfscl_nget(struct mount *mntp, struct vnode *dvp,
>                 /* vfs_hash_insert() vput()'s the losing vnode */
>                 return (0);
>         }
> +       vp->v_vflag |= VV_DEBUGVREF;
>         *npp = np;
>
>         return (0);
> Index: sys/fs/nfsclient/nfs_clnode.c
> ===================================================================
> --- sys/fs/nfsclient/nfs_clnode.c       (revision 248723)
> +++ sys/fs/nfsclient/nfs_clnode.c       (working copy)
> @@ -179,6 +179,7 @@ ncl_nget(struct mount *mntp, u_int8_t *fhp, int fh
>                 /* vfs_hash_insert() vput()'s the losing vnode */
>                 return (0);
>         }
> +       vp->v_vflag |= VV_DEBUGVREF;
>         *npp = np;
>
>         return (0);



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAG8duQ2gB11=bXzJ6hFFzzNUSwofn3WGd4=EEPJuhyNr6UmjwQ>