From owner-freebsd-fs@FreeBSD.ORG Fri Jan 18 21:23:41 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BD4FF16A417 for ; Fri, 18 Jan 2008 21:23:41 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.freebsd.org (Postfix) with ESMTP id 79E6613C45D for ; Fri, 18 Jan 2008 21:23:41 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 2AAB0487D6; Fri, 18 Jan 2008 16:23:41 -0500 (EST) Date: Fri, 18 Jan 2008 21:23:41 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Jan Harkes In-Reply-To: <20080118210621.GF7898@cs.cmu.edu> Message-ID: <20080118211556.T46437@fledge.watson.org> References: <18CC5A4A2AC36D7FF57615EE@ganymede.hub.org> <478AF6BC.8050604@highperformance.net> <20080114142124.Y55696@fledge.watson.org> <20080116085630.GA32361@pappardelle.tekno.chalmers.se> <20080117080359.U51764@fledge.watson.org> <20080118073445.GA30721@pappardelle.tekno.chalmers.se> <20080118095652.GC30721@pappardelle.tekno.chalmers.se> <20080118103952.D18977@fledge.watson.org> <20080118210621.GF7898@cs.cmu.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org Subject: Re: Coda on FreeBSD problem reports? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 18 Jan 2008 21:23:41 -0000 On Fri, 18 Jan 2008, Jan Harkes wrote: > On Fri, Jan 18, 2008 at 11:10:26AM +0000, Robert Watson wrote: >> This is likely a VM interaction involving either an improperly managed or >> unmanaged VM object for Coda vnodes. > > That sounds right. I haven't looked at vobjects and how they are manager, > didn't even know FreeBSD had these. > > It sounds a lot like the i_mapping/address_space in the Linux kernel and if > these are even slightly similar, we would want to share the vobject between > the Coda vnode and the cache/container vnode. In FreeBSD, as in Mach from which the FreeBSD VM was derived, a VM object is what holds cached pages for a file. VM objects are managed by a pager, and in the case of vnode-backed VM objects, this is the vnode pager (src/sys/vm/vnode_pager.c). When a memory mapping is created, the VM object is referenced, and whenever it needs to fill a page, the vnode pager loads the page using VOP_READ(), and when it gets bored (i.e., msync, memory pressure), it will write them back out using VOP_WRITE(). Due to the magic of a merged VM/buffer cache, it's actually the same memory as used in the buffer cache, so if you do write() on the file, it is visible to mmap() and vice versa for a write via the memory mapping. Vnodes can float around without VM objects, but they can't be mapped without one, so normally we set up a VM object on open(), and then don't GC the VM object until the vnode references hit zero and the vnode falls out of memory. >> loosely guess the former if cache vnodes are reused between Coda vnodes. > > Cache vnodes are reused, but under very specific conditions, and for other > reasons we are going to switch to unlinking / recreating them. This sounds like a generally good and safe idea. >> However, sharing makes more sense in other ways, as it means there won't be >> data cache coherency problems between the Coda and cache VM objects if both >> are written too simultaneously (or even not simultaneously, given that when >> there's little memory pressure, pages hang around for a long time). > > We never write simultaneously because of the session semantics + whole file > caching. When we get an open the Coda using application is blocked until we > know that all data has been copied to the cache file before we hand the > reference to the cache file back to the kernel. But we don't actually sync > the dirty pages to disk so if the Coda vnode uses it's own vobject it would > miss the few dirty pages that are still associated with the cache vnode's > vobject. It is also a huge performance benefit for a lot of short lived > files which are unlinked before their dirty pages have even hit the disk. > > So sharing these definitely seems like the cleaner solution. Two things to be aware of: (1) If the VM object is the one of the cache vnode, then when the page is read or written to disk, it will bypass the Coda VOP's and go directly to the cache VOP's, since the cache vnode VM object uses the cache vnode's vnode operation vector. (2) Be aware that memory mappings can persist beyond close() -- i.e., you can open() a file, mmap() it, and then close() it. This means that writes can happen "later", and since it's hitting the cache vnode operations rather than the Coda ones, you won't get an explicit notification. I've not tested it, but the attached patch may do something like what you want. I have some reservations about this approach, though, due to the above concerns. Robert N M Watson Computer Laboratory University of Cambridge Index: coda_vnops.c =================================================================== RCS file: /home/ncvs/src/sys/fs/coda/coda_vnops.c,v retrieving revision 1.78 diff -u -r1.78 coda_vnops.c --- coda_vnops.c 13 Jan 2008 14:44:02 -0000 1.78 +++ coda_vnops.c 17 Jan 2008 15:22:12 -0000 @@ -244,6 +244,8 @@ if (error) { printf("coda_open: VOP_OPEN on container failed %d\n", error); return (error); + } else { + (*vpp)->v_object = vp->v_object; } /* grab (above) does this when it calls newvnode unless it's in the cache*/ @@ -747,6 +749,8 @@ CODADEBUG(CODA_INACTIVE, myprintf(("in inactive, %s, vfsp %p\n", coda_f2s(&cp->c_fid), vp->v_mount));) + + vp->v_object = NULL; /* If an array has been allocated to hold the symlink, deallocate it */ if ((coda_symlink_cache) && (VALID_SYMLINK(cp))) { @@ -1552,7 +1556,7 @@ cache_purge(vp); coda_free(VTOC(vp)); vp->v_data = NULL; - vnode_destroy_vobject(vp); + vp->v_object = NULL; return (0); }