From owner-freebsd-arch@FreeBSD.ORG Tue Mar 15 19:11:36 2005 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4886216A4CE for ; Tue, 15 Mar 2005 19:11:36 +0000 (GMT) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id B8B4B43D31 for ; Tue, 15 Mar 2005 19:11:35 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) j2FJBX0e055488; Tue, 15 Mar 2005 11:11:33 -0800 (PST) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.12.9p2/8.12.9/Submit) id j2FJBWpd055485; Tue, 15 Mar 2005 11:11:32 -0800 (PST) (envelope-from dillon) Date: Tue, 15 Mar 2005 11:11:32 -0800 (PST) From: Matthew Dillon Message-Id: <200503151911.j2FJBWpd055485@apollo.backplane.com> To: Jeff Roberson References: <20050314213038.V20708@mail.chesapeake.net> <20050315035032.T20708@mail.chesapeake.net> cc: arch@freebsd.org Subject: Re: Freeing vnodes. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Mar 2005 19:11:36 -0000 :> I think you did not intend this. Didn't you just want to destroy :> enough vnodes to have 'wantfreevnodes' worth of slop so getnewvnode() :> could allocate new vnodes? In that case the calculation would be: : :On my system wantfreevnodes is at 2500. Let's say I have 4500 free :vnodes. 4500 - 2500 = 2000. Divide by 2 gives you 1000. I don't think :you read the whole patch. I'm not trying to be confrontational here, Jeff. Please remember that I'm the one who has done most of the algorithmic work on these subsystems. I designed the whole 'trigger' mechanism, for example. The wantfreevnodes calculation is: minvnodes / 10. That's a very small number. The 'freevnodes' value is typically a much larger value, especially if a program running through stat()ing things. It is possible to have tens of thousands of free vnodes. This makes your current count calculation effectively 'freevnodes / 2'. I really don't think you want to destroy half the current freevnodes on each pass, do you? :> can be a HUGE load on getnewvnode() (think of cvsupd and find, or :> a cvs update, etc...). This load can easily outstrip vnlru_proc()'s :> new ability to free vnodes and potentially cause a lot of unnecessarily :> blockages. : :We have one buf daemon, one page daemon, one syncer, one vnlru proc, etc. :In all these cases it would be nice if they gained new contexts when they :had a lot of work to do, but they don't, and it doesn't seem to be a huge :problem today. On my system one vnlruproc easily keeps up with the job of That's because they are carefully written (mostly by me) to not be subject to pure cpu loads. buf_daemon: Is primarily only responsible for flushing DIRTY buffers. The buffer allocation code will happily reuse clean buffers in-line. Dirty buffers are subject to the I/O limitations of the system (and they are flushed asynchronously for the most part), which means that one daemon should have no trouble handle the buffer load on an MP sytem. Since a system naturally has many more clean buffers then dirty buffers (even without algorithmic limitations), except in certain particular large-write cases which are handled elsewhere, the buf_daemon usually has very little effect on the buffer cache's ability to allocate a new buffer. page_daemon: Same deal. The page daemon is primarily responsible for flushing out dirty pages and for rebalancing the lists if they get really out of whack. Pages in the VM page cache (PQ_CACHE) can be reused on the fly and there are several *natural* ways for a page to go directly to the VM page cache without having to pass through the page daemon. In fact, MOST of the pages that get onto the PQ_CACHE or PQ_FREE queues are placed there directly by mechanisms unrelated to the page daemon. syncer: I've always wanted to rewrite the syncer to be per-mount or per-physical-device so it could sync out to multiple physical devices simultaniously. vnlru_proc: Prior to your patch, vnlru_proc was only responsible for rebalancing the freevnode list. Typically the ONLY case where a vnode needs to be forcefully put on the freevnode list is if there are a lot of vnodes which have VM objects which still have just one or two VM pages associated with them, because otherwise a vnode either gets put on the freevnode list directly by the vnode release code, or it has enough associated pages for us to not want to recycle it anyway (which is what the trigger code handles). The mechanism that leads to the creation of such vnodes also typically requires a lot of random I/O, which makes vnlru_proc() immune to cpu load. This means that vnlru_proc is only PARTIALLY responsible for maintaining the freevnode list and the part it Is responsible for tends to be unrelated to pure cpu loads. There are a ton of ways for a vnode to make it to that list WITHOUT passing through vnlru_proc, which means that prior to your patch getnewvnode() typically only has to wait for vnlru_proc() in the most extreme situations. By my read, the changes you are currently contemplating for vnlru_proc changes its characteristics such that it is now COMPLETELY responsible for freeing up vnodes for getnewvnode(). This was not the case before. I can only repeat that getnewvnode() has a massive dynamic loading range, one that is not necessarily dependant on or limited by I/O. For example, when you are stat()ing a lot of files over and over again there is a good chance that the related inodes are cached in the VM object representing the backing store for the filesystem. This means that getnewvnode() can cycle very quickly, on the order of tens of thousands of vnodes per second in certain situations. By my read, you are forcing *ALL* the vnode recycling activity to run through vnlru_proc() now. The only way now for getnewvnode() to get a new vnode is by allocating it out of the zone. This was not the case before. :freeing free vnodes. Remember these vnodes have no pages associated with :them, so at most you're freeing an inode for a deleted file, and in the :common case the whole operation runs on memory without blocking for io. :... :We presently single thread the most critical case, where we have no free :vnodes and are not allowed to allocate any more while we wait for :vnlru_proc() to do io on vnodes with cached pages to reclaim some. I'm :not convinced this is a real problem. Which means that in systems with a large amount of memory (large VM page cache) doing certainly operations (such as stat()ing a large number of files e.g. a find or cvsupd), where the file set is larger then the number of vnodes available, will now have to cycle all of those vnodes through a single thread in order to reuse them. The current pre-patch case is very different. With your patch, in addition to the issues already mentioned, the inode synchronization is now being single-threaded and while the writes are asynchronous, the reads are not (if the inode happens to not be in the VM page cache any more because it's been cached so long the system has decided to throw away the page to accomodate other cached data). In the current pre-patch case, that read load was distributed over ALL processes trying to do a getnewvnode(). i.e. it was a parallel read load that actually scaled fairly well to load. :> I love the idea of being able to free vnodes in vnlru_proc() rather :> then free-and-reuse them in allocvnode(), but I cannot figure out how :> vnlru_proc() could possibly adapt to the huge load range that :> getnewvnode() has to deal with. Plus keep in mind that the vnodes :> being reused at that point are basically already dead except for :> the vgonel(). :> :> This brings up the true crux of the problem, where the true overhead :> of reusing a vnode inline with the getnewvnode() call is... and that :> is that vgonel() potentially has to update the related inode and could :> cause an unrelated process to block inside getnewvnode(). But even : :Yes, this is kind of gross, and would cause lock order problems except :that we LK_NOWAIT on the vn lock in vtryrecycle(). It'd be better if we :didn't try doing io on unrelated vnodes while this deep in the stack. I agree. It is gross, though I will note that the fact that the vnode is ON the free list tends to mean that it isn't being referenced by anyone so there should not be any significant lock ordering issues. I haven't 'fixed' this in DragonFly because I haven't been able to figure out how to distribute the recycling load and deal with the huge dynamic loading range that getnewvnode() has. I've been working on the buffer cache code since, what, 1998? These are real issues. It's always very easy to design algorithms that work for specific machine configurations, the trick is to make them work across the board. One thing I LIKE about your code is the concept of being able to reuse a vnode (or in your case allocate a new vnode) without having to perform any I/O. The re-use case in the old code always has the potential to block an unrelated process if it has to do I/O recycling the vnode it wants to reuse. But this is a very easy effect to accomplish simply by leaving the recycling code in getnewvnode() intact but STILL adding new code to vnlru_proc() to ensure that a minimum number of vnodes are truely reusable without having to perform any I/O. This would enhance light-load (light getnewvnode() load that is) performance. It would have virtually no effect under heavier loads, which is why the vnode re-use code in getnewvnode() would have to stay, but the light-load benefit is undeniable. -Matt Matthew Dillon