Date: Thu, 18 Jul 2013 14:23:25 +0300 From: Konstantin Belousov <kostikbel@gmail.com> To: Dan Thomas <godders@gmail.com> Cc: Kirk McKusick <mckusick@mckusick.com>, freebsd-fs@freebsd.org, Palle Girgensohn <girgen@freebsd.org>, Jeff Roberson <jroberson@jroberson.net>, Julian Akehurst <julian@pingpong.se> Subject: Re: leaking lots of unreferenced inodes (pg_xlog files?) Message-ID: <20130718112325.GZ5991@kib.kiev.ua> In-Reply-To: <CAG8duQ2gB11=bXzJ6hFFzzNUSwofn3WGd4=EEPJuhyNr6UmjwQ@mail.gmail.com> References: <201307151932.r6FJWSxM087108@chez.mckusick.com> <51E5CD7A.2020109@FreeBSD.org> <20130717053431.GN5991@kib.kiev.ua> <CAG8duQ2gB11=bXzJ6hFFzzNUSwofn3WGd4=EEPJuhyNr6UmjwQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--MrbiU6dcJfOZ616B Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jul 18, 2013 at 11:43:54AM +0100, Dan Thomas wrote: > After a bit of experimentation, we've managed to nail down a > reasonably short run that exhibits this leak. Postgres' verbose log > output is linked below - whatever is causing the leak is in there > somewhere, but alas I lack the necessary understanding of Postgres' > internals to be able to pin it down any further. >=20 > https://dl.dropboxusercontent.com/u/13916028/pg_leak_log.txt This is of no use, at least for me. >=20 > I've also got a 2.4M ktrace of this run, which is still pretty big, > I'll admit. Unfortunately it's got some data in it that I'd rather not > publish, but I'm happy to send it directly to anyone who might find it > useful. Such big ktrace is also unusable. If you want me to look at the leak, use the patch which I sent earlier, and add the flag to the vnodes which are likely to be leaked. Then, after 'show vnode', I would be able to see what is going on, I hope. >=20 > Thanks, >=20 > Dan >=20 > On 17 July 2013 06:34, Konstantin Belousov <kostikbel@gmail.com> wrote: > > On Wed, Jul 17, 2013 at 12:47:22AM +0200, Palle Girgensohn wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- > >> Hash: SHA1 > >> > >> Kirk McKusick skrev: > >> >> Date: Mon, 15 Jul 2013 10:51:10 +0100 Subject: Re: leaking lots of > >> >> unreferenced inodes (pg_xlog files?) From: Dan Thomas > >> >> <godders@gmail.com> To: Kirk McKusick <mckusick@mckusick.com> Cc: > >> >> Palle Girgensohn <girgen@freebsd.org>, freebsd-fs@freebsd.org, Jeff > >> >> Roberson <jroberson@jroberson.net>, Julian Akehurst > >> >> <julian@pingpong.se> X-ASK-Info: Message Queued (2013/07/15 > >> >> 02:51:22) X-ASK-Info: Confirmed by User (2013/07/15 02:55:04) > >> >> > >> >> On 11 June 2013 01:17, Kirk McKusick <mckusick@mckusick.com> > >> >> wrote: > >> >>> OK, good to have it narrowed down. I will look to devise some > >> >>> additional diagnostics that hopefully will help tease out the > >> >>> bug. I'll hopefully get back to you soon. > >> >> Hi, > >> >> > >> >> Is there any news on this issue? We're still running several > >> >> servers that are exhibiting this problem (most recently, one that > >> >> seems to be leaking around 10gb/hour), and it's getting to the > >> >> point where we're looking at moving to a different OS until it's > >> >> resolved. > >> >> > >> >> We have access to several production systems with this problem and > >> >> (at least from time to time) will have systems with a significant > >> >> leak on them that we can experiment with. Is there any way we can > >> >> assist with tracking this down? Any diagnostics or testing that > >> >> would be useful? > >> >> > >> >> Thanks, Dan > >> > > >> > Hi Dan (and Palle), > >> > > >> > Sorry for the long delay with no help / news. I have gotten > >> > side-tracked on several projects and have had little time to try and > >> > devise some tests that would help find the cause of the lost space. > >> > It almost certainly is a one-line fix (a missing vput or vrele > >> > probably in some error path), but finding where it goes is the hard > >> > part :-) > >> > > >> > I have had little success in inserting code that tracks reference > >> > counts (too many false positives). So, I am going to need some help > >> > from you to narrow it down. My belief is that there is some set of > >> > filesystem operations (system calls) that are leading to the > >> > problem. Notably, a file is being created, data put into it, then the > >> > file is deleted (either before or after being closed). Somehow a > >> > reference to that file is persisting despite there being no valid > >> > reference to it. Hence the filesystem thinks it is still live and is > >> > not deleting it. When you do the forcible unmount, these files get > >> > cleared and the space shows back up. > >> > > >> > What I need to devise is a small test program doing the set of system > >> > calls that cause this to happen. The way that I would like to try and > >> > get it is to have you `ktrace -i' your application and then run your > >> > application just long enough to create at least one of these lost > >> > files. The goal is to minimize the amount of ktrace data through > >> > which we need to sift. > >> > > >> > In preparation for doing this test you need to have a kernel compiled > >> > with `option DIAGNOSTIC' or if you prefer, just add `#define > >> > DIAGNOSTIC 1' to the top of sys/kern/vfs_subr.c. You will know you > >> > have at least one offending file when you try to unmount the affected > >> > filesystem and find it busy. Before doing the `umount -f', enable > >> > busy printing using `sysctl debug.busyprt=3D1'. Then capture the > >> > console output which will show the details of all the vnodes that had > >> > to be forcibly flushed. Hopefully we will then be able to correlate > >> > them back to the files (NAMI in the ktrace output) with which they > >> > were associated. We may need to augment the NAMI data with the inode > >> > number of the associated file to make the association with the > >> > busyprt output. Anyway, once we have that, we can look at all the > >> > system calls done on those files and create a small test program that > >> > exhibits the problem. Given a small test program, Jeff or I can track > >> > down the offending system call path and nail this pernicious bug once > >> > and for all. > >> > > >> > Kirk McKusick > >> > >> Hi, > >> > >> I have run ktrace -i on pg_ctl (which forks off all the postgresql > >> processes) and I got two "busy" files that where "lost" after a few > >> hours. dmesg reveals this: > >> > >> vflush: busy vnode > >> 0xfffffe067cdde960: tag ufs, type VREG > >> usecount 1, writecount 0, refcount 2 mountedhere 0 > >> flags (VI(0x200)) > >> VI_LOCKed v_object 0xfffffe0335922000 ref 0 pages 0 > >> lock type ufs: EXCL by thread 0xfffffe01600eb8e0 (pid 56723) > >> ino 11047146, on dev da2s1d > >> vflush: busy vnode > >> 0xfffffe039f35bb40: tag ufs, type VREG > >> usecount 1, writecount 0, refcount 3 mountedhere 0 > >> flags (VI(0x200)) > >> VI_LOCKed v_object 0xfffffe03352701d0 ref 0 pages 0 > >> lock type ufs: EXCL by thread 0xfffffe01600eb8e0 (pid 56723) > >> ino 11045961, on dev da2s1d > >> > >> > >> I had to umount -f, so they where "lost". > >> > >> So, now I have 55 GB ktrace output... ;) Is there anything I can do to > >> filter it, or shall I compress it and put it on a web server for you to > >> fetch as it is? > > > > I think that 55GB of ktrace is obviously useless. The Kirk' idea was to > > have an isolated test case that would only create the situation trigger= ing > > the leak, without irrelevant activity. This indeed requires drilling d= own > > and isolating the file activities to get to the core of problem. > > > > FWIW, I and Peter Holm used the following alternative approach quite > > successfully when tracking down other vnode reference leaks. The appro= ach > > still requires some understanding of the specifics of the problematic > > files to be useful, but not as much as isolated test. > > > > Basically, you take the patch below, and set the VV_DEBUGVREF flag for > > the vnode that has characteristics as much specific for the leaked vnode > > as possible. The patch has example of setting the flag for all new NFS > > vnodes. You would probably want to do the same in vfs_vgetf(), > > checking e.g. for the partition where your leaks happen. The limiting > > of the vnodes for which the vref traces are accumulated is needed to > > save the kernel memory. > > > > Then after the leak was observed, you just print the vnode with ddb > > command 'show vnode addr' and send the output to developer. > > > > Index: sys/sys/vnode.h > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > --- sys/sys/vnode.h (revision 248723) > > +++ sys/sys/vnode.h (working copy) > > @@ -94,6 +94,13 @@ struct vpollinfo { > > > > #if defined(_KERNEL) || defined(_KVM_VNODE) > > > > +struct debug_ref { > > + TAILQ_ENTRY(debug_ref) link; > > + int val; > > + const char *op; > > + struct stack stack; > > +}; > > + > > struct vnode { > > /* > > * Fields which define the identity of the vnode. These fields= are > > @@ -169,6 +176,7 @@ struct vnode { > > int v_writecount; /* v ref count of write= rs */ > > u_int v_hash; > > enum vtype v_type; /* u vnode type */ > > + TAILQ_HEAD(, debug_ref) v_debug_ref; > > }; > > > > #endif /* defined(_KERNEL) || defined(_KVM_VNODE) */ > > @@ -253,6 +261,7 @@ struct xvnode { > > #define VV_DELETED 0x0400 /* should be removed */ > > #define VV_MD 0x0800 /* vnode backs the md device */ > > #define VV_FORCEINSMQ 0x1000 /* force the insmntque to succe= ed */ > > +#define VV_DEBUGVREF 0x2000 > > > > /* > > * Vnode attributes. A field value of VNOVAL represents a field whose= value > > Index: sys/kern/vfs_subr.c > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > --- sys/kern/vfs_subr.c (revision 248723) > > +++ sys/kern/vfs_subr.c (working copy) > > @@ -71,6 +71,7 @@ __FBSDID("$FreeBSD$"); > > #include <sys/sched.h> > > #include <sys/sleepqueue.h> > > #include <sys/smp.h> > > +#include <sys/stack.h> > > #include <sys/stat.h> > > #include <sys/sysctl.h> > > #include <sys/syslog.h> > > @@ -871,6 +872,23 @@ static struct kproc_desc vnlru_kp =3D { > > }; > > SYSINIT(vnlru, SI_SUB_KTHREAD_UPDATE, SI_ORDER_FIRST, kproc_start, > > &vnlru_kp); > > + > > +MALLOC_DEFINE(M_RECORD_REF, "recordref", "recordref"); > > +static void > > +v_record_ref(struct vnode *vp, int val, const char *op) > > +{ > > + struct debug_ref *r; > > + > > + if ((vp->v_type !=3D VREG && vp->v_type !=3D VBAD) || > > + (vp->v_vflag & VV_DEBUGVREF) =3D=3D 0) > > + return; > > + r =3D malloc(sizeof(struct debug_ref), M_RECORD_REF, M_NOWAIT | > > + M_USE_RESERVE); > > + r->val =3D val; > > + r->op =3D op; > > + stack_save(&r->stack); > > + TAILQ_INSERT_TAIL(&vp->v_debug_ref, r, link); > > +} > > > > /* > > * Routines having to do with the management of the vnode table. > > @@ -1073,6 +1091,7 @@ alloc: > > vp->v_vflag |=3D VV_NOKNOTE; > > } > > rangelock_init(&vp->v_rl); > > + TAILQ_INIT(&vp->v_debug_ref); > > > > /* > > * For the filesystems which do not use vfs_hash_insert(), > > @@ -1082,6 +1101,7 @@ alloc: > > */ > > vp->v_hash =3D (uintptr_t)vp >> vnsz2log; > > > > + TAILQ_INIT(&vp->v_debug_ref); > > *vpp =3D vp; > > return (0); > > } > > @@ -2197,6 +2217,7 @@ vget(struct vnode *vp, int flags, struct thread *t > > vinactive(vp, td); > > vp->v_iflag &=3D ~VI_OWEINACT; > > } > > + v_record_ref(vp, 1, "vget"); > > VI_UNLOCK(vp); > > return (0); > > } > > @@ -2211,6 +2232,7 @@ vref(struct vnode *vp) > > CTR2(KTR_VFS, "%s: vp %p", __func__, vp); > > VI_LOCK(vp); > > v_incr_usecount(vp); > > + v_record_ref(vp, 1, "vref"); > > VI_UNLOCK(vp); > > } > > > > @@ -2253,6 +2275,7 @@ vputx(struct vnode *vp, int func) > > KASSERT(func =3D=3D VPUTX_VRELE, ("vputx: wrong func")); > > CTR2(KTR_VFS, "%s: vp %p", __func__, vp); > > VI_LOCK(vp); > > + v_record_ref(vp, -1, "vputx"); > > > > /* Skip this v_writecount check if we're going to panic below. = */ > > VNASSERT(vp->v_writecount < vp->v_usecount || vp->v_usecount < = 1, vp, > > @@ -2409,6 +2432,7 @@ void > > vdropl(struct vnode *vp) > > { > > struct bufobj *bo; > > + struct debug_ref *r, *r1; > > struct mount *mp; > > int active; > > > > @@ -2489,6 +2513,9 @@ vdropl(struct vnode *vp) > > lockdestroy(vp->v_vnlock); > > mtx_destroy(&vp->v_interlock); > > mtx_destroy(BO_MTX(bo)); > > + TAILQ_FOREACH_SAFE(r, &vp->v_debug_ref, link, r1) { > > + free(r, M_RECORD_REF); > > + } > > uma_zfree(vnode_zone, vp); > > } > > > > @@ -2888,6 +2915,8 @@ vn_printf(struct vnode *vp, const char *fmt, ...) > > va_list ap; > > char buf[256], buf2[16]; > > u_long flags; > > + int ref; > > + struct debug_ref *r; > > > > va_start(ap, fmt); > > vprintf(fmt, ap); > > @@ -2960,8 +2989,21 @@ vn_printf(struct vnode *vp, const char *fmt, ...) > > vp->v_object->resident_page_count); > > printf(" "); > > lockmgr_printinfo(vp->v_vnlock); > > - if (vp->v_data !=3D NULL) > > - VOP_PRINT(vp); > > +#if DDB > > + if (kdb_active) { > > + if (vp->v_data !=3D NULL) > > + VOP_PRINT(vp); > > + } > > +#endif > > + > > + /* Getnewvnode() initial reference is not recorded due to VNON = */ > > + ref =3D 1; > > + TAILQ_FOREACH(r, &vp->v_debug_ref, link) { > > + ref +=3D r->val; > > + printf("REF %d %s\n", ref, r->op); > > + stack_print(&r->stack); > > + } > > + > > } > > > > #ifdef DDB > > Index: sys/fs/nfsclient/nfs_clport.c > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > --- sys/fs/nfsclient/nfs_clport.c (revision 248723) > > +++ sys/fs/nfsclient/nfs_clport.c (working copy) > > @@ -273,6 +273,7 @@ nfscl_nget(struct mount *mntp, struct vnode *dvp, > > /* vfs_hash_insert() vput()'s the losing vnode */ > > return (0); > > } > > + vp->v_vflag |=3D VV_DEBUGVREF; > > *npp =3D np; > > > > return (0); > > Index: sys/fs/nfsclient/nfs_clnode.c > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > --- sys/fs/nfsclient/nfs_clnode.c (revision 248723) > > +++ sys/fs/nfsclient/nfs_clnode.c (working copy) > > @@ -179,6 +179,7 @@ ncl_nget(struct mount *mntp, u_int8_t *fhp, int fh > > /* vfs_hash_insert() vput()'s the losing vnode */ > > return (0); > > } > > + vp->v_vflag |=3D VV_DEBUGVREF; > > *npp =3D np; > > > > return (0); --MrbiU6dcJfOZ616B Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.20 (FreeBSD) iQIcBAEBAgAGBQJR59AsAAoJEJDCuSvBvK1BYsMQAJZj+18pwjeJQfyBoaK4Rif7 FK32mFljb85RxdGs9nX4Pmq91/00B6lv+8HmdlydYbOw4qiRm8x/hNverAr2GHSf 5ArGEJeTCwfXheG+kulivTo+sMrapeyR6XN5THIHjglBjrSBu8nUrAyNzyjaOFRq 2tLDn/NdibMJeBUKkVWMV3L7cmrIw3snF+kJc6f/1iDnBahOKPADxAHo/N1Exg2s AC5wUuG+d5lrb/jFYaSoND1eDnWIVVu588GQuXrIo9N9GM9D+UVk1OM2pLEOISLM 7hL5mKagLD4wHpzW9FW6nlQjDcGQqJfkvYp+PqsVO6KGaVCk740N3rItDk9WGTfn WNMGl0i9rDopvuaOBX/BwLrwdN/TQaXTHPdVdOHjDWyjqlmaG2r37AYZ2OOVfvhr EH2UOAj1bcRfucGOrczxjSRFEI7honiOuw48RYZYNd4WUujSaA61vINdDXIhrYky /+kTwvGpoBxNvE7pMmUg1fI0Ww/Kp1QsaObh/Kb9KmbOKDSNc8luVfLNbU+EnQaS bfS9eUUkPvk86lgMLVoXRoXV707IF0r7SBosRgrc9IVc9xZjj4fNqDIvUiZrSoow hQOPX6X9fZ1DSCngp9DuL4GZocv3levC/8wirpAydi5dTHs2XoP66P8Opn+SBzct TRwTCYGSbZslIgTHuvMd =NmSV -----END PGP SIGNATURE----- --MrbiU6dcJfOZ616B--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130718112325.GZ5991>