Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 12 Aug 1997 01:51:18 -0700 (PDT)
From:      Simon Shapiro <Shimon@i-Connect.Net>
To:        FreeBSD-Hackers@FreeBSD.org
Subject:   DBFS - A new Filesystem for FreeBSD - Proposal
Message-ID:  <XFMail.970812015118.Shimon@i-Connect.Net>

next in thread | raw e-mail | index | archive | help
Hi Y'all,

I am building a special filesystem, purpose built for RDBMS service.  In
case someone else wants to use it, I will briefly discuss it below.  In any
case, in the end there is a question I would like to get your opinion on.

Known Features:

*  No subdirectories
*  No (or very simplistic) permissions
*  Very simple and linear block allocation
*  Able to span devices
*  Able to use resources (devices) on other, remote systems
*  Able to share the filesystem with another processor across the physical
   medium
*  Very fast directory and resource search
*  Able to perform either buffered or raw I/O on the same file.
*  Able to preallocate storage to a file, so files are contigious on a
   device
*  Able to pre-specify the number and size of extents a file can acquite
   beyond initial allocation

If you have opinions so far, let me know.  Remember, this file systems is
NOT designed to compete with, replace, augment, nor complement existing
filesystems.  It is designed specifically as a storage manager for RDBMS
engines.

Questions:

a.  How do I provide the proper semantics for file creation in the Unix
    context?

The Unix semantics for filesystems are that files are a stream of bytes,
managed via a buffred I/O subsystem, etc.  There is no place to specify
size upon creation, extent policy, etc.

b.  How do i provide for buffered and non-buffered (raw) I/O on the same
    file?

We have considered several options:

a.  Put the whole thing in a userspace library.  This is what virtually
    every RDBMS vendor has done.  It has the advantage of being easily
    portable and very easy to acomplish all the above as the semantics can
    be arbitrarily specified in the API without any consideration to
    exising Unix semantics.  We do not like this option as it interferes
    with the distributed and shared aspects and is very costly in the area
    of locking and synchronization 9too many syscalls, too many semaphores,
    etc.

b.  Put the whole thing in a filesystem and forget any feature that vilates
    Unix file I/O semantics.  We like this one the least as this is a 
    ``Sodom Bed'' (Genesis, shorten the legs of the too tall and stretch 
    he who is too short).

c.  Put the whole thing in the kernel and do all access from a library that
    does all the I/O via ioctl syscalls to the device.  This is a hack that
    might work but will force lots of copyin and copyout, where raw I/O 
    would have done much better.

d.  Extend exisitng system calls to accept the extra arguments we need.
    This is dangerous in our humble opinion as it will tamper with unbroken
    things and violate more standards than we have fingers to count them
    on.

e.  Add new syscalls to take care of what we need, exactly the way we need
    it.  Without too many details, as an invitation to discuss, one can
    envision these:

    int dbopen(const char *path, size_t inital_size, size_t extents,
               size_t extent_size);

    We really do not need permissions and modes.  But could have them if
    necessary. sizes are expressed in blocks, native to that fs instance.

    ssize_t dbread(int fd, void *buff, ssize_t offset, size_t blocks,
                   int flags);

    Flags can be DBIO_RAW which forces a read from disk, rather than 
    buffered read.  Yes, we are aware of double jeopardy in reading
    unbuffered where buffered copies exist.  We either have NO buffered 
    I/O at all, or do our own buffering, in user space, or you come with
    a better solution :-)
    Also notice that we eliminate lseek in favor of specifying the seek as
    number of blocks from beginning of file.

    ssize_t dbwrite(int fd, const void *buff, ssize_t offset, 
                    size_t blocks, int flags);

    Again, flags can be DBFS_RAW which forces flushed/synchronous write.

We are currently implemnting this code and an alpha version should be
running  in the next week or so.  We would like it to be as acceptable 
to you gals and guys, so let me know what you think.

Simon


    



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.970812015118.Shimon>