Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 1 Apr 1997 11:19:28 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        pmchen@eecs.umich.edu (Peter M. Chen)
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: question on buffer cache and VMIO
Message-ID:  <199704011819.LAA11677@phaeton.artisoft.com>
In-Reply-To: <199704011439.JAA26225@life.eecs.umich.edu> from "Peter M. Chen" at Apr 1, 97 09:39:42 am

next in thread | previous in thread | raw e-mail | index | archive | help
> Hi,
>   I'm starting to hack on FreeBSD and had some questions.
> 
> Background:
>     My goal is to make the file cache reliable without writing to
> disk under normal operation (see Rio paper in ASPLOS 1996).  We
> protect against power loss by using a battery; we protect against
> kernel crashes by write-protecting the file cache.  The main
> benefit is you get reliability equivalent to mounting the file
> system sync, yet with the performance of async (actually even
> better, since FreeBSD's async option still does substantial
> amounts of I/O).  This works really well for mmap'ed files, which
> makes it possible to have VERY fast transactions.  It should also
> extend the battery life for portables.
> See http://www.eecs.umich.edu/~pmchen/ for more info.

This is hardware... is it going to be generally available/purchasable?

One big issue: I'm not sure how this will interact with the "soft
updates" facility described in the Ganger/Patt paper.  This facility
is a planned future feature on nearly everyone's whiteboards.  If you
have read the paper, I can discuss the idea of a generic implementation
that is not FS dependent (like the paper's Appendix A UFS example is)
and the ramifications that would have on FS/cache interaction.


> Questions re: buffer cache and VMIO
>     1) What's the relationship between the buffer cache and VM cache (I'm
> 	not sure this is the proper terminology)?  The buffer cache seems to
> 	hold file data and metadata, while the VM cache holds only file data.
> 	Do buffer headers point to the VM cache data?  Can the VM cache hold
> 	file data that is not in the buffer cache?

The VM and the buffer cache have been unified.  File buffers are chains
of pages hung off of the in core vnodes for the inodes backing them.  RAM
is for actualization of virtual pages, and for allocation of kernel
resources.

John Dyson and David Greenman are the people to contact about documenting
this in detail.


>     2) The buffer cache seems small relative to the physical memory.  E.g.
> 	on a 64 MB machine (51 MB available), maxbufspace defaults to only
> 	6.3 MB.  For working sets larger than this, there appears to be
> 	significant overhead in frequently moving data between the VM
> 	cache and buffer cache.  Would it make sense to set NBUF larger
> 	(e.g. enough to have the buffer cache fill memory)?

This is a different issue.  This has to do with the ratio of in core
pages backed by files to those not backed by files.


>     3) Dirty data gets written to disk when it leaves the buffer cache, even
> 	if it is also in the VM cache.  This makes sense normally (since
> 	Unix traditionally bawrote data to disk as soon as it filled a block),
> 	but this prevents my keeping lots of dirty file data around.

This is an issue which boils down to what happens to a vnode when it
is about to be reused.  Effectively, there can be "good" cache data
in core, but no vnode which references it.  When the vnode reference
is being destroyed, dirty pages are forced out.  The data will then
need to be re-read.

The main benefit to having a vnode/extent based caching mechanism over
a device/extent caching mechanism, is that you effectively remove the
device and FS size limitations inherent in the processor architecture
limits on the VM system.  The limit applies on a per vnode basis instead
of applying on a per device basis.

The main drawback is that pages in core which contain valid data, but
without vnode references, must be reloaded from disk in order to be
used.

Part of this problem is that vnodes are referenced as entities that
are seperate from the per FS on disk data which backs instantiation;
that is, there is a global vnode pool, with a discrete data reference
to the per FS data in the vnode, rather than the vnodes being allocated
as part of the in core FS data object... ie:

	/* in core vnode*/
	struct vnode {
		void *v_data;	/* fs specific data*/
	};

	/* in core inode*/
	struct inode {
		struct	vnode *i_vnode;	/* vnode for this inode*/
		struct	dinode i_din;	/* on disk inode*/
	};

Instead of the more desirable:

	/* in core inode*/
	struct inode {
		struct	vnode i_vnode;	/* vnode instance for this inode*/
		struct	dinode i_din;	/* on disk inode*/
	};

Part of the reasoning behind this is explained by following the use
of the "struct fileops" in the kernel, something which should probably
go away as quickly as possible (the vnode v_un pointer union is an
artifact of this same patchwork glue).
		

>     4) What happens to mmap'ed data?  Does it reside in the VM cache?  Are
> 	there buffer headers for mmap'ed data?

All mmap'ed data is treated as file data; unlike System V shared memory,
it does not need to remain in core at all times (the SYSV SHM has to
because it is established in the kernel map, not in an independent file
map; there is no concept of backing store other than file as backing
store, and the SYSV SHM was not, last time I looked, swappable).


>     5) I came across a strange phenomenon when trying to get rid of all
> 	disk writes in ufs_remove.  Even if i_nlink goes to 0, the file
> 	is still fsync'ed.  The call graph is: ufs_remove -> vput -> vrele ->
> 	vm_object_deallocate -> vm_object_terminate -> vinvalbuf (with V_SAVE)
> 	-> VOP_FSYNC.  The system fails when I have vm_object_terminate
> 	check for i_nlink and call vinvalbuf without V_SAVE.  Can someone
> 	explain why a deleted file needs to get fsynced (note that this
> 	isn't the directory, but the actual file)?

It probably does not. 8-(.  I expect the problem is where the directory
vs. open instance refrence counting occurs.


> General kernel questions:
>     1) I'd like the ability to read and write kernel global variables (without
> 	going to ddb).  I tried kvm, but that only works for variables in
> 	i386/i386/symbols.raw.  kgdb only works for off-line core dumps.
> 	I finally used nm to get the symbol address and directly read and
> 	wrote to /dev/kmem.  This works fine, but I was wondering if there's
> 	a better solution that exists already.

Are these experimentation tunables?  If they are, you should probably
compile with debug and use SYSCTL_INT() to install debugging variables
into the sysctl tree.  There is a good example of this in the kernel
source file /sys/ufs/ffs/ffs_alloc.c.

Alternately, if these are permanent tunables, there is a good example
of doing this for VFS in /sys/kern/vfs_init.c.

To add your own hierarchy node, SYSCTL_NODE() in /sys/kern/subr_prof.c
is a good example.

If you are running a "cleaner" or similar process, you may need a
process parameter.  SYSCTL_PROC in /sys/kern/kern_proc.c is the
example I'd use for that.


>     By the way, I've been SUPER impressed with FreeBSD.  The code is
> well-written, especially compared to other operating systems I've
> worked on; Compiling the kernel is VERY fast; the system is fast and
> small; the installation process is easy; the small number of packages
> and ports I've tried work right away; the boot manager understands
> the file system.  It has the feel of a solid, well-put-together system.

Heh.  And with Rio added it should be 4-22 times faster, right?  8-) 8-).

I really look forward to seeing how you do... and when you get done,
you will probably find a ready market for the hardware you will be
using.

I'll help on any question I can, but it sure looks like you will need
John Dyson's and David Greenman's help on some of these things which
are not well documented...


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199704011819.LAA11677>