From owner-freebsd-hackers@FreeBSD.ORG  Sun Mar 29 22:52:29 2009
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D49211065674
	for <freebsd-hackers@freebsd.org>; Sun, 29 Mar 2009 22:52:29 +0000 (UTC)
	(envelope-from aram.h@mgk.ro)
Received: from fg-out-1718.google.com (fg-out-1718.google.com [72.14.220.153])
	by mx1.freebsd.org (Postfix) with ESMTP id 5D7958FC08
	for <freebsd-hackers@freebsd.org>; Sun, 29 Mar 2009 22:52:28 +0000 (UTC)
	(envelope-from aram.h@mgk.ro)
Received: by fg-out-1718.google.com with SMTP id 19so190546fgg.12
	for <freebsd-hackers@freebsd.org>; Sun, 29 Mar 2009 15:52:28 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.204.55.13 with SMTP id s13mr1679168bkg.180.1238365776957; Sun, 
	29 Mar 2009 15:29:36 -0700 (PDT)
Date: Mon, 30 Mar 2009 01:29:36 +0300
Message-ID: <be00cc30903291529v4862b53bs93414a86bad77430@mail.gmail.com>
From: Aram Havarneanu <aram.h@mgk.ro>
To: freebsd-hackers@freebsd.org
Content-Type: text/plain; charset=ISO-8859-2
Content-Transfer-Encoding: quoted-printable
Subject: Shared Disk/Transactional/Distributed file system (GSoC Proposal)
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 29 Mar 2009 22:52:30 -0000

I have been giving some thought lately on some ideas I would like to
do for Google Summer of Code. I haven't posted my application yet, as
I hope to get some feedback first.

I want to make an OpenVMS inspired file system. The key elements would
be record oriented I/O, transaction processing and asynchronous I/O.
Ideally, the file system will have redundancy features (for high
availability) implemented through clustering. The file system should
be a shared disk file system, usable in a SAN environment with
multiple clients that use the exported block devices simultaneously.

The first design issue is weather spreading the file system over a
number of machines on the network is a feature that's relevant today
or not. OpenVMS did that and it provided redundancy (you could mirror
data between nodes) and performance (multiple machines could be
serving you data at the same time). This days people tent to
centralize storage in a SAN. The SAN provides it's own redundancy, and
so far performance is not an issue, as SANs seem to handle scalability
extremely well. Even though the idea of spreading the file system
through the network seems to have some potential theoretical
performance advantages, the current network throughputs are a
bottleneck for taking advantage of it. A current hard drive is faster
then Gigabit Ethernet, 10Gigabit which still has a prohibitive price
today is easily saturated in a small cluster of only a few machines.
Other network technologies are again, prohibitively priced. In the
past, you could not put to much storage on one machine, so the ability
to spread storage across multiple machines was important, but today
storage is almost free and usually you can scale it enough in a SAN.

Another question is whether to make it a pure record oriented I/O file
system, or also implement traditional I/O. A pure record oriented I/O
file system would make the distributed lock manager's job much
simpler, as there is a simpler mapping between raw bits from the block
device and the resources (files/records/fields) the DLM manages. Of
course, the VFS interface would be just for convenience for such a
system, as the abstraction it provides would add no value to such a
file system. But for such a file system to be really useful needs to
be used by the transactional, record oriented I/O API anyway.

The other option would be to make it a mixed file system, like
Files-11 in OpenVMS, with traditional I/O and record oriented I/O. In
that case, I would probably use UFS on-disk structure, so it would be
more likely an addition to UFS than a new file system. But then of
course, FreeBSD has ZFS which has a really nice layered architecture,
so I could just use the ZFS lower layers that deal with block devices
and can be used for things like redundancy and implement a file system
on this architecture. But FreeBSD also has GEOM, and with some clever
programming I can also use that. There are many options of doing
things.

One other thing I like to address is asynchronous I/O. In a way it's
just fancy buffered I/O, but from the perspective of the programmer
that uses the API, it is much more than that. There are a lot of cases
where you want to make lots of unrelated commits to the resource pool,
and this I/O operations rarely fail. Or there is the case where you
make a big commit that takes time and you want to make smaller commits
in the same time that are more crucial to be finished first than the
big one and you don't want the big commit to block the smaller ones.
Of course you can solve this issues in multiple ways, but with ASYNC
I/O it makes the job much easier to the programmer. You just make
transactions, install handlers for Asynchronous System Traps that
manage aborted transactions, or finished transactions etc. The AST
mechanism works in a way like UNIX signals, but the ASTs don't stop
system calls and can be queued with some mechanism.

There is also the question about how to solve the issue of cache
coherency between different nodes. With ASYNC I/O, caching write
operations is not that important, but I think that caching read
operations is important. This must be implemented in the distributed
lock manager. Simply put, what is not locked by anybody should be
current in the local cache and can be accessed from there after you
make a request to the DLM and the DLM grants the read lock. The
distributed lock manager maintains a directory of requested resources,
either in concurrent read mode, concurrent write mode, protected read
mode, protected write mode and exclusive lock mode. When transactions
are made, it is the responsibility of the DLM to invalidate caches.
This implementation is pretty expensive in terms of time spend in the
round trip to the DLM, but I think the time necessary to make a
request can be less then 0.75ms for LANs, and disk access time is in
the order of 5ms for fast disks, so that would not be an issue.

>From a high level programmer perspective things work like this:

1) You make a request for a lock to the DLM. Requests are queued.
Requests can be for concurrent read (desire to read, doesn't stop
other from updating), concurrent write (non blocking read-write),
protected read (locks the resource globally in a read only mode
preventing other to modify it), protected write (locks the resource
globally so that only you can update it) and exclusive mode, where
only you can hold a lock. Locks can be on full files, records or even
fields allowing for flexibility and granularity.

2) Eventually the DLM grants you a lock and you can do transactions
with the resource. Transactions are asynchronous by default, but can
be made synchronous if needed. You can install handlers for ASTs to do
various tasks when some events occur.

3) You release the lock.

There is a lot of stuff than can be done and it can be done in various
ways, so basically that is why I posted this on the list -- for
discussion and suggestions. My ideas may seem vague at the moment,
because with this simple ideas you can implement a lot of different
things. Hopefully with your help we will be able to come with
something that is interesting, usable and feasible to be done in such
a short time, at least to some prototype level. In any case, if I do
this, I don't plan to stop working on it after GSoC. I will work on it
as long as it is necessary. Any feedback is greatly appreciated.

I would also appreciate any hints toward general FreeBSD kernel
programming. I read the developer docs on the website, I have (and
mostly read) "The Design and Implementation of the FreeBSD Operating
System" by Marshall Kirk McKusick and, George V. Neville-Neil (also
read the 4.4BSD version) and also read "Designing BSD Rootkits -- An
Introduction to Kernel Hacking" by Joseph Kong.

Thanks,

--
Aram H=E3v=E3rneanu