Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 15 Mar 2005 20:11:05 -0500 (EST)
From:      Jeff Roberson <jroberson@chesapeake.net>
To:        Matthew Dillon <dillon@apollo.backplane.com>
Cc:        arch@freebsd.org
Subject:   Re: Freeing vnodes.
Message-ID:  <20050315195525.F20708@mail.chesapeake.net>
In-Reply-To: <200503151911.j2FJBWpd055485@apollo.backplane.com>
References:  <20050314213038.V20708@mail.chesapeake.net> <20050315035032.T20708@mail.chesapeake.net> <200503151911.j2FJBWpd055485@apollo.backplane.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 15 Mar 2005, Matthew Dillon wrote:

> :>     I think you did not intend this.  Didn't you just want to destroy
> :>     enough vnodes to have 'wantfreevnodes' worth of slop so getnewvnode()
> :>     could allocate new vnodes?  In that case the calculation would be:
> :
> :On my system wantfreevnodes is at 2500.  Let's say I have 4500 free
> :vnodes.  4500 - 2500 = 2000.  Divide by 2 gives you 1000.  I don't think
> :you read the whole patch.
>
>     I'm not trying to be confrontational here, Jeff.  Please remember that
>     I'm the one who has done most of the algorithmic work on these
>     subsystems.  I designed the whole 'trigger' mechanism, for example.

I wasn't being confrontational, I gave you a real example, that's all.

>
>     The wantfreevnodes calculation is:  minvnodes / 10.  That's a very small
>     number.  The 'freevnodes' value is typically a much larger value,
>     especially if a program running through stat()ing things.  It is possible
>     to have tens of thousands of free vnodes.  This makes your current
>     count calculation effectively 'freevnodes / 2'.  I really don't think
>     you want to destroy half the current freevnodes on each pass, do you?

I haven't seen my machine even get to 10,000 free vnodes with this patch
running.  Even while doing a 'find . -exec stat {} \; >> /dev/null'.  With
the new mechanism all of these vnodes would go away after a second.  It
might be nice to keep them around for longer if resources permit, but I do
think a timeout based approach is the correct one.  Please note that with
the old code you wouldn't even keep them around for a second if we were
above minvnodes.  They would be recycled on the next call to
getnewvnode().

>
> :>     can be a HUGE load on getnewvnode() (think of cvsupd and find, or
> :>     a cvs update, etc...).  This load can easily outstrip vnlru_proc()'s
> :>     new ability to free vnodes and potentially cause a lot of unnecessarily
> :>     blockages.
> :
> :We have one buf daemon, one page daemon, one syncer, one vnlru proc, etc.
> :In all these cases it would be nice if they gained new contexts when they
> :had a lot of work to do, but they don't, and it doesn't seem to be a huge
> :problem today.  On my system one vnlruproc easily keeps up with the job of
>
>     That's because they are carefully written (mostly by me) to not be
>     subject to pure cpu loads.
>
>     buf_daemon:	  Is primarily only responsible for flushing DIRTY buffers.
> 		  The buffer allocation code will happily reuse clean buffers
> 		  in-line.  Dirty buffers are subject to the I/O limitations
> 		  of the system (and they are flushed asynchronously for the
> 		  most part), which means that one daemon should have no
> 		  trouble handle the buffer load on an MP sytem.  Since a
> 		  system naturally has many more clean buffers then dirty
> 		  buffers (even without algorithmic limitations), except in
> 		  certain particular large-write cases which are handled
> 		  elsewhere, the buf_daemon usually has very little effect
> 		  on the buffer cache's ability to allocate a new buffer.
>
>     page_daemon:  Same deal.  The page daemon is primarily responsible for
> 		  flushing out dirty pages and for rebalancing the lists
> 		  if they get really out of whack.  Pages in the VM page
> 		  cache (PQ_CACHE) can be reused on the fly and there are
> 		  several *natural* ways for a page to go directly to the
> 		  VM page cache without having to pass through the page
> 		  daemon.  In fact, MOST of the pages that get onto the
> 		  PQ_CACHE or PQ_FREE queues are placed there directly by
> 		  mechanisms unrelated to the page daemon.
>
>     syncer:	  I've always wanted to rewrite the syncer to be per-mount
> 		  or per-physical-device so it could sync out to multiple
> 		  physical devices simultaniously.

The syncer is kind of bogus anyway, because it mostly just destroys the
buf daemon's delayed writes by forcing it all out at once.  It does
redundant work, except for updating inodes, which should be all it really
does.

>
>     vnlru_proc:	  Prior to your patch, vnlru_proc was only responsible for
> 		  rebalancing the freevnode list.  Typically the ONLY case
> 		  where a vnode needs to be forcefully put on the freevnode
> 		  list is if there are a lot of vnodes which have VM objects
> 		  which still have just one or two VM pages associated with
> 		  them, because otherwise a vnode either gets put on the
> 		  freevnode list directly by the vnode release code, or it
> 		  has enough associated pages for us to not want to recycle
> 		  it anyway (which is what the trigger code handles).  The
> 		  mechanism that leads to the creation of such vnodes also
> 		  typically requires a lot of random I/O, which makes
> 		  vnlru_proc() immune to cpu load.  This means that
> 		  vnlru_proc is only PARTIALLY responsible for maintaining
> 		  the freevnode list and the part it Is responsible for
> 		  tends to be unrelated to pure cpu loads.
>
> 		  There are a ton of ways for a vnode to make it to that
> 		  list WITHOUT passing through vnlru_proc, which means that
> 		  prior to your patch getnewvnode() typically only has to
> 		  wait for vnlru_proc() in the most extreme situations.

I still haven't seen any code wait on vnlru_proc.  I've tested it on a big
fast system with lots of disks.  I'll test it on a little slow system with
one disk too.  I understand what you're arguing, I'm just not sure if it's
presently a real problem.

>
>     By my read, the changes you are currently contemplating for vnlru_proc
>     changes its characteristics such that it is now COMPLETELY responsible
>     for freeing up vnodes for getnewvnode().  This was not the case before.
>
>     I can only repeat that getnewvnode() has a massive dynamic loading range,
>     one that is not necessarily dependant on or limited by I/O.  For example,
>     when you are stat()ing a lot of files over and over again there is a good
>     chance that the related inodes are cached in the VM object representing
>     the backing store for the filesystem.  This means that getnewvnode() can
>     cycle very quickly, on the order of tens of thousands of vnodes per
>     second in certain situations.  By my read, you are forcing *ALL* the
>     vnode recycling activity to run through vnlru_proc() now.  The only way
>     now for getnewvnode() to get a new vnode is by allocating it out of
>     the zone.  This was not the case before.
>
> :freeing free vnodes.  Remember these vnodes have no pages associated with
> :them, so at most you're freeing an inode for a deleted file, and in the
> :common case the whole operation runs on memory without blocking for io.
> :...
> :We presently single thread the most critical case, where we have no free
> :vnodes and are not allowed to allocate any more while we wait for
> :vnlru_proc() to do io on vnodes with cached pages to reclaim some.  I'm
> :not convinced this is a real problem.
>
>     Which means that in systems with a large amount of memory (large VM page
>     cache) doing certainly operations (such as stat()ing a large number
>     of files e.g. a find or cvsupd), where the file set is larger then
>     the number of vnodes available, will now have to cycle all of those
>     vnodes through a single thread in order to reuse them.

Doesn't seem to be a problem with any workload I've been able to devise.

>
>     The current pre-patch case is very different.  With your patch,
>     in addition to the issues already mentioned, the inode synchronization
>     is now being single-threaded and while the writes are asynchronous,
>     the reads are not (if the inode happens to not be in the VM page cache
>     any more because it's been cached so long the system has decided to
>     throw away the page to accomodate other cached data).

If it really ends up being a problem we can spin up one thread per proc.
As it doesn't seem to be a problem at all with a single thread, I'm sure
that would be sufficient to handle any extreme cases.

>
>     In the current pre-patch case, that read load was distributed over ALL
>     processes trying to do a getnewvnode().  i.e. it was a parallel read
>     load that actually scaled fairly well to load.
>
> :>     I love the idea of being able to free vnodes in vnlru_proc() rather
> :>     then free-and-reuse them in allocvnode(), but I cannot figure out how
> :>     vnlru_proc() could possibly adapt to the huge load range that
> :>     getnewvnode() has to deal with.  Plus keep in mind that the vnodes
> :>     being reused at that point are basically already dead except for
> :>     the vgonel().
> :>
> :>     This brings up the true crux of the problem, where the true overhead
> :>     of reusing a vnode inline with the getnewvnode() call is... and that
> :>     is that vgonel() potentially has to update the related inode and could
> :>     cause an unrelated process to block inside getnewvnode().  But even
> :
> :Yes, this is kind of gross, and would cause lock order problems except
> :that we LK_NOWAIT on the vn lock in vtryrecycle().  It'd be better if we
> :didn't try doing io on unrelated vnodes while this deep in the stack.
>
>     I agree.  It is gross, though I will note that the fact that the vnode
>     is ON the free list tends to mean that it isn't being referenced by
>     anyone so there should not be any significant lock ordering issues.
>
>     I haven't 'fixed' this in DragonFly because I haven't been able to
>     figure out how to distribute the recycling load and deal with the
>     huge dynamic loading range that getnewvnode() has.
>
>     I've been working on the buffer cache code since, what, 1998?  These
>     are real issues.   It's always very easy to design algorithms that
>     work for specific machine configurations, the trick is to make them
>     work across the board.
>
>     One thing I LIKE about your code is the concept of being able to reuse
>     a vnode (or in your case allocate a new vnode) without having to perform
>     any I/O.  The re-use case in the old code always has the potential to
>     block an unrelated process if it has to do I/O recycling the vnode it
>     wants to reuse.  But this is a very easy effect to accomplish simply by
>     leaving the recycling code in getnewvnode() intact but STILL adding new
>     code to vnlru_proc() to ensure that a minimum number of vnodes are
>     truely reusable without having to perform any I/O.  This would enhance
>     light-load (light getnewvnode() load that is) performance.  It would
>     have virtually no effect under heavier loads, which is why the vnode
>     re-use code in getnewvnode() would have to stay, but the light-load
>     benefit is undeniable.
>
> 					-Matt
> 					Matthew Dillon
> 					<dillon@backplane.com>
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050315195525.F20708>