Date: Mon, 19 Jun 2006 15:11:01 +0200 From: Pawel Jakub Dawidek <pjd@FreeBSD.org> To: freebsd-current@FreeBSD.org Cc: freebsd-fs@FreeBSD.org, freebsd-geom@FreeBSD.org Subject: Journaling UFS with gjournal. Message-ID: <20060619131101.GD1130@garage.freebsd.pl>
next in thread | raw e-mail | index | archive | help
--rJwd6BRFiFCcLxzm Content-Type: text/plain; charset=iso-8859-2 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hello. For the last few months I have been working on gjournal project. To stop confusion right here, I want to note, that this project is not related to gjournal project on which Ivan Voras was working on the last SoC (2005). The lack of journaled file system in FreeBSD was a tendon of achilles for many years. We do have many file systems, but none with journaling: - ext2fs (journaling is in ext3fs), - XFS (read-only), - ReiserFS (read-only), - HFS+ (read-write, but without journaling), - NTFS (read-only). GJournal was designed to journal GEOM providers, so it actually works below file system layer, but it has hooks which allow to work with file systems. In other words, gjournal is not file system-depended, it can work probably with any file system with minimum knowledge about it. I implemented only UFS support. The patches are here: http://people.freebsd.org/~pjd/patches/gjournal.patch (for HEAD) http://people.freebsd.org/~pjd/patches/gjournal6.patch (for RELENG_6) To patch your sources you need to: # cd /usr/src # mkdir sbin/geom/class/journal sys/geom/journal sys/modules/geom/geom_jou= rnal # patch < /path/to/gjournal.patch Add 'options UFS_GJOURNAL' to your kernel configuration file and recompile kernel and world. How it works (in short). You may define one or two providers which gjournal will use. If one provider is given, it will be used for both - data and journal. If two providers are given, one will be used for data and one for journal. Every few seconds (you may define how many) journal is terminated and marked as consistent and gjournal starts to copy data from it to the data provider. In the same time new data are stored in new journal. Let's call the moment in which journal is terminated as "journal switch". Journal switch looks as follows: 1. Start journal switch if we have timeout or if we run out of cache. Don't perform journal switch if there were no write requests. 2. If we have file system, synchronize it. 3. Mark file system as clean. 4. Block all write requests to the file system. 5. Terminate the journal. 6. Eventually wait if copying of the previous journal is not yet finished. 7. Send BIO_FLUSH request (if the given provider supports it). 8. Mark new journal position on the journal provider. 9. Unblock write requests. 10. Start copying data from the terminated journal to the data provider. There were few things I needed to implement outside gjournal to make it work reliable: - The BIO_FLUSH request. Currently we have three I/O requests: BIO_READ, BIO_WRITE and BIO_DELETE. I added BIO_FLUSH, which means "flush your write cache". The request is send always with the biggest bio_offset set (mediasize of the destination provider), so it will work properly with bioq_disksort(). The caller need to stop further I/O requests before BIO_FLUSH return, so we don't have starvation effect. The hard part is that is has to be implemented in every disk driver, because flushing the cache is driver-depended operation. I implemented it for ata(4) disks and amr(4). The good news is that it's easy. GJournal can also work with providers that don't support BIO_FLUSH and in my power-failure tests it worked well (no problems), but it depend on fact, that gjournal cache is bigger than the controller cache, so it is hard to call it reliable. You can read in documentation to many journaled file systems, that you should turn off write cache if you want to use it. This is not the case for gjournal (especially when your disk driver does support BIO_FLUSH). The 'gjournal' mount option. To implement gjournal support in UFS I needed to change the way of how deleted, but still open objects are handled. Currently when file or directory is open and we deleted last name which reference it, it will still be usable by those who keep it open. When the last consumer closes it, the inode and blocks are freed. On journal switch I cannot leave such objects, because after a crash fsck(8) is not used to check the file system, so inode and blocks will never be freed. When file system is mounted with 'gjournal' mount option, such objects are not removed when they are open. When last name is deleted, the file/directory is moved to the .deleted/ directory and removed from there on last close. This way, I can just clean the .deleted/ directory after a crash at mount time. Quick start: # gjournal label /dev/ad0 # gjournal load # newfs /dev/ad0.journal # mount -o async,gjournal /dev/ad0.journal /mnt (yes, with gjournal 'async' is safe) Now, after a power failure or system crash no fsck is needed (yay!). There are two hacks in the current implementation, which I'd like to reimplement. First is how 'gjournal' mount option is implemented. There is a garbage collector thread which is responsible for deleting objects from .deleted/ directory and it is using full paths. Because of this when your mount point is /foo/bar/baz and you rename 'bar' to something else, it will not work. This is not what is often done, but definitely should be fixed and I'm working on it. The second hack is related to communication between gjournal and file system. GJournal decides when to make the switch and has to find file system which is mounted on it. Looking for this file system is not nice and should be reimplemented. There are some additional goods which came with gjournal. For example if gjournal is configured over gmirror or graid3, even on power failure or system crash, there is no need to synchronize mirror/raid3 device, because data will be consistent. I spend a lot of time working on gjournal optimization. Because I've few seconds before the data hit the data provider I can perform things like combining smaller write requests into larger once, ignoring data written twice to the same place, etc. Because of this, operations on small files are quite fast. On the other hand, operations on large files are slower, because I need to write the data twice and there is no place for optimization. Here are some numbers. gjournal(1) - the data provider and the journal provider on the same disk gjournal(2) - the data provider and the journal provider on separate disks Copying one large file: UFS: 8s UFS+SU: 8s gjournal(1): 16s gjournal(2): 14s Copying eight large files in parallel: UFS: 120s UFS+SU: 120s gjournal(1): 184s gjournal(2): 165s Untaring eight src.tgz in parallel: UFS: 791s UFS+SU: 650s gjournal(1): 333s gjournal(2): 309s Reading. grep -r on two src/ directories in parallel: UFS: 84s UFS+SU: 138s gjournal(1): 102s gjournal(2): 89s As you can see, even on one disk, untaring eight src.tgz is two times faster than UFS+SU. I've no idea why gjournal is faster in reading. There are a bunch of sysctls to tune gjournal (kern.geom.journal tree). When only one provider is given for both data and journal, the journal part is placed at the end of the provider, so one can use file system without journaling. If you use such configuration (one disk), it is better for performance to place journal before data, so you may want to create two partitions (eg. 2GB for ad0a and the rest for ad0d) and create gjournal this way: # gjournal label ad0d ad0a Enjoy! The work was sponsored by home.pl (http://home.pl). The work was made by Wheel LTD (http://www.wheel.pl). The work was tested in the netperf cluster. I want to thank Alexander Kabaev (kan@) for the help with VFS and Mike Tancsa for test hardware. --=20 Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! --rJwd6BRFiFCcLxzm Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (FreeBSD) iD8DBQFElqJlForvXbEpPzQRAtSFAJ9+Q+NjIqImiypsAFNG6bT6+dGu3wCgkOD0 q1HU94X2QsliV8rtIQRNt2s= =HWoE -----END PGP SIGNATURE----- --rJwd6BRFiFCcLxzm--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20060619131101.GD1130>