From owner-freebsd-hackers@FreeBSD.ORG Sun Mar 29 22:52:29 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D49211065674 for ; Sun, 29 Mar 2009 22:52:29 +0000 (UTC) (envelope-from aram.h@mgk.ro) Received: from fg-out-1718.google.com (fg-out-1718.google.com [72.14.220.153]) by mx1.freebsd.org (Postfix) with ESMTP id 5D7958FC08 for ; Sun, 29 Mar 2009 22:52:28 +0000 (UTC) (envelope-from aram.h@mgk.ro) Received: by fg-out-1718.google.com with SMTP id 19so190546fgg.12 for ; Sun, 29 Mar 2009 15:52:28 -0700 (PDT) MIME-Version: 1.0 Received: by 10.204.55.13 with SMTP id s13mr1679168bkg.180.1238365776957; Sun, 29 Mar 2009 15:29:36 -0700 (PDT) Date: Mon, 30 Mar 2009 01:29:36 +0300 Message-ID: From: Aram Havarneanu To: freebsd-hackers@freebsd.org Content-Type: text/plain; charset=ISO-8859-2 Content-Transfer-Encoding: quoted-printable Subject: Shared Disk/Transactional/Distributed file system (GSoC Proposal) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 29 Mar 2009 22:52:30 -0000 I have been giving some thought lately on some ideas I would like to do for Google Summer of Code. I haven't posted my application yet, as I hope to get some feedback first. I want to make an OpenVMS inspired file system. The key elements would be record oriented I/O, transaction processing and asynchronous I/O. Ideally, the file system will have redundancy features (for high availability) implemented through clustering. The file system should be a shared disk file system, usable in a SAN environment with multiple clients that use the exported block devices simultaneously. The first design issue is weather spreading the file system over a number of machines on the network is a feature that's relevant today or not. OpenVMS did that and it provided redundancy (you could mirror data between nodes) and performance (multiple machines could be serving you data at the same time). This days people tent to centralize storage in a SAN. The SAN provides it's own redundancy, and so far performance is not an issue, as SANs seem to handle scalability extremely well. Even though the idea of spreading the file system through the network seems to have some potential theoretical performance advantages, the current network throughputs are a bottleneck for taking advantage of it. A current hard drive is faster then Gigabit Ethernet, 10Gigabit which still has a prohibitive price today is easily saturated in a small cluster of only a few machines. Other network technologies are again, prohibitively priced. In the past, you could not put to much storage on one machine, so the ability to spread storage across multiple machines was important, but today storage is almost free and usually you can scale it enough in a SAN. Another question is whether to make it a pure record oriented I/O file system, or also implement traditional I/O. A pure record oriented I/O file system would make the distributed lock manager's job much simpler, as there is a simpler mapping between raw bits from the block device and the resources (files/records/fields) the DLM manages. Of course, the VFS interface would be just for convenience for such a system, as the abstraction it provides would add no value to such a file system. But for such a file system to be really useful needs to be used by the transactional, record oriented I/O API anyway. The other option would be to make it a mixed file system, like Files-11 in OpenVMS, with traditional I/O and record oriented I/O. In that case, I would probably use UFS on-disk structure, so it would be more likely an addition to UFS than a new file system. But then of course, FreeBSD has ZFS which has a really nice layered architecture, so I could just use the ZFS lower layers that deal with block devices and can be used for things like redundancy and implement a file system on this architecture. But FreeBSD also has GEOM, and with some clever programming I can also use that. There are many options of doing things. One other thing I like to address is asynchronous I/O. In a way it's just fancy buffered I/O, but from the perspective of the programmer that uses the API, it is much more than that. There are a lot of cases where you want to make lots of unrelated commits to the resource pool, and this I/O operations rarely fail. Or there is the case where you make a big commit that takes time and you want to make smaller commits in the same time that are more crucial to be finished first than the big one and you don't want the big commit to block the smaller ones. Of course you can solve this issues in multiple ways, but with ASYNC I/O it makes the job much easier to the programmer. You just make transactions, install handlers for Asynchronous System Traps that manage aborted transactions, or finished transactions etc. The AST mechanism works in a way like UNIX signals, but the ASTs don't stop system calls and can be queued with some mechanism. There is also the question about how to solve the issue of cache coherency between different nodes. With ASYNC I/O, caching write operations is not that important, but I think that caching read operations is important. This must be implemented in the distributed lock manager. Simply put, what is not locked by anybody should be current in the local cache and can be accessed from there after you make a request to the DLM and the DLM grants the read lock. The distributed lock manager maintains a directory of requested resources, either in concurrent read mode, concurrent write mode, protected read mode, protected write mode and exclusive lock mode. When transactions are made, it is the responsibility of the DLM to invalidate caches. This implementation is pretty expensive in terms of time spend in the round trip to the DLM, but I think the time necessary to make a request can be less then 0.75ms for LANs, and disk access time is in the order of 5ms for fast disks, so that would not be an issue. >From a high level programmer perspective things work like this: 1) You make a request for a lock to the DLM. Requests are queued. Requests can be for concurrent read (desire to read, doesn't stop other from updating), concurrent write (non blocking read-write), protected read (locks the resource globally in a read only mode preventing other to modify it), protected write (locks the resource globally so that only you can update it) and exclusive mode, where only you can hold a lock. Locks can be on full files, records or even fields allowing for flexibility and granularity. 2) Eventually the DLM grants you a lock and you can do transactions with the resource. Transactions are asynchronous by default, but can be made synchronous if needed. You can install handlers for ASTs to do various tasks when some events occur. 3) You release the lock. There is a lot of stuff than can be done and it can be done in various ways, so basically that is why I posted this on the list -- for discussion and suggestions. My ideas may seem vague at the moment, because with this simple ideas you can implement a lot of different things. Hopefully with your help we will be able to come with something that is interesting, usable and feasible to be done in such a short time, at least to some prototype level. In any case, if I do this, I don't plan to stop working on it after GSoC. I will work on it as long as it is necessary. Any feedback is greatly appreciated. I would also appreciate any hints toward general FreeBSD kernel programming. I read the developer docs on the website, I have (and mostly read) "The Design and Implementation of the FreeBSD Operating System" by Marshall Kirk McKusick and, George V. Neville-Neil (also read the 4.4BSD version) and also read "Designing BSD Rootkits -- An Introduction to Kernel Hacking" by Joseph Kong. Thanks, -- Aram H=E3v=E3rneanu