Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 17 Sep 1996 11:16:39 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        dg@root.com
Cc:        terry@lambert.org, bde@zeta.org.au, proff@suburbia.net, freebsd-hackers@FreeBSD.org
Subject:   Re: attribute/inode caching
Message-ID:  <199609171816.LAA04481@phaeton.artisoft.com>
In-Reply-To: <199609170512.WAA08889@root.com> from "David Greenman" at Sep 16, 96 10:12:10 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> >I find this hard to believe.  This would imply a limitation of the device
> >size of the file size, since the adressable extent for a vnode is smaller
> >than the addressable extent for a device.
> 
>    Huh? In FreeBSD, the device is refered to via the device vnode. How do
> you think FFS does the I/O for the inode block? It uses the block device-
> special vnode.

It dereferences function pointers from the struct fileops, of course
(which is itself An Abomination Which Must Be Destroyed).

The same routines are not invoked to page from a device as are invoked
to page from a file: that arrangement would be recursive when it went
to page a file from a device.

> As for any implied size limitation, vnodes don't have any
> "size" associated with them. Anything (except the VM system) that deals with
> file offsets deals in 64bit quad_t's, and it doesn't matter if it's a file
> or a device or whatever. Depending on which version of the merged VM/buffer
> cache we're talking about, metadata may or may not be stored in VM pages. In
> all versions, however, it is cached in buffers (buffers can point to either
> malloced memory or VM pages).

Ask John Dyson.  It has to do with the mapping of the file as a virtual
address space.  Last time I looked in /a/src-fs/sys/ufs/ffs/ffs_vfsops.c,
I saw the following:


	static int
	ffs_oldfscompat(fs)
		struct fs *fs;
	{
		...
			fs->fs_maxfilesize = (u_quad_t) 1LL << 39;
		...

A clearly intentional limitation of 39 bit based on the mappable virtual
address space.


Yes, this is a VM limitation, but since that's how you do file I/O
(via faulting of pages mapped in a virtual address space), I can't
see how anything could be more relevent.

1)	There *IS* a limitation of 2^39 bits on individual file size.
2)	There *ISN'T* a limitation of 2^39 bits on device size (ask
	Satoshi on that, if you don't believe me).
3)	Files are mapped as VM objects (ie: have cache on their vnodes).
4)	Devices are not mapped as VM objects.


I believe devices *should* be mapped as VM objects, and the quad arithmatic
overhead should be eaten (or the object references and manipulation should
be abstracted, duplicated for 64 and 32 bits, and flagged for reference
in the object itself).


> >Which value are you caliming is in error?  It seems to me that if inode
> >blocks are hung of the device vnode (so why have the ihash?!?), then it
> >is an error to not limit the device size to the max file size.
> 
>    I think you're really starting to confuse things. The maximum file size
> is not a function of vnodes. We do have a problem with representing file
> offsets in the VM system beyond 31bits worth of pages (43bits total == 8TB),
> but this is hardly a concern. John may correct me on this, but I believe in
> the current scheme we do cache inode blocks in VM pages in -current. In 2.1.5,
> we couldn't because of the vm_page offset limitation. So for 2.1.5, we only
> cache inode blocks in malloced memory that is attached to buffers. Offsets
> in buffers are 40 bits large (31bits for signed long to hold the block number
> which is in units of 512 bytes (9bits)), this effectively limits all operations
> that involve struct buf's to 1TB, thus neither a device nor a file may be
> larger than this. We've had no compelling reason to fix this as it is more
> difficult than just changing the size of a daddr_t, and noone that I know of
> is using a 1TB filesystem.

Hopefully Satoshi will get his CCD that lage, and then you will no
longer be able to ignore the issue.  8-).

Personally, I'd like to use page anonymity based protections to establish
Chorus-like access priveledge domains for IPC; specifically, for stacks
capable of being grown by fault for use by threads.  I think the POSIX
model is broken: I should not be required to preallocate stack for a
thread just because SVR4 and Solaris have bogus architectures (actually,
the SVR4 VM does *not* impose this limitation: it is a limitation of
the threading code alone.  Steve Baumel, the author of the SVR4 VM, and
I discussed this at some length when discussing context sharing models
that would be useful for the NetWare for UNIX product).

The vnodes are, among other things, container objects for cached pages;
you could argue that (assuming that most of the other vnode cruft is
useless, which it is) their sole purpose in life is to establish address
space mappings for FS objects.  The problem with this scenario, which I
don't seem to be communicating effectively, is that only objects that
are contained in the FS are then capable of being cached in the unified
cache.

I guess it boils down to whether you trust locality of reference or you
don't.  Clearly, a referenced object can be referenced without existing
in the unified cache, as long as there is a page assigned as backing
for it.  And that is the ihash reference, and the vnode reference to
the in core inode, which are not managed as domains on the device at
a page boundry resoloution.  This implementation fails to "trust" the
locality of reference model inherent in all caching systems.


> >The fact that the device size was allowed to be larger than the max file
> >size was one of the justifications John Dyson gave for not using caching
> >based on device/extent instead of (in addition to) vnode/extent in order
> >to keep the buffer cache unification of the vnode/extent mapping, but
> >resolve a lot of other issues.  For instance, if the device vnode is
> >in fact a device/extent cache, then there is no need for the ihash, since
> >the inodes are determiistically layed out and thus indexable by fault.  In
> >addition, the abilit to address device blocks by fault on the device vnode
> >means that vclean is totally unnecessary.
> 
>    I can't parse this.

You need ihash because you can't page inodes into cache becuse they
aren't in the buffers hung off the device vnode, like you claim.

If they were in the buffers hung off the device vnode, like you claim,
then they would be, by definition, in the cache, since this mapping
is what constitutes the VM/buffer cache unification.  As such, ihash
would be unnecessary; you could simply directly reference the pages
and they would be faulted in; if they were already in core, they would
be looked up off the device vnode (just like file pages currently are).


I would like to see this happen, but it damn well has not, and the 39
bit limitation (which is not a 42FS compatability hack, despite it's
location) is a limitation of *file* size and does not apply to *device*
size.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199609171816.LAA04481>