From owner-freebsd-stable@FreeBSD.ORG Tue Jul 15 17:52:44 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6C232106566C for ; Tue, 15 Jul 2008 17:52:44 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 3DF308FC0A for ; Tue, 15 Jul 2008 17:52:44 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m6FHqftc007806; Tue, 15 Jul 2008 10:52:41 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m6FHqf7E007803; Tue, 15 Jul 2008 10:52:41 -0700 (PDT) Date: Tue, 15 Jul 2008 10:52:41 -0700 (PDT) From: Matthew Dillon Message-Id: <200807151752.m6FHqf7E007803@apollo.backplane.com> To: freebsd-stable@freebsd.org, sven@dmv.com References: <200807151523.m6FFNIRJ044047@lurza.secnetix.de> <487CC919.2010203@FreeBSD.org> Cc: Subject: Re: Multi-machine mirroring choices X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Jul 2008 17:52:44 -0000 :Oliver Fromme wrote: : :> Yet another way would be to use DragoFly's "Hammer" file :> system which is part of DragonFly BSD 2.0 which will be :> released in a few days. It supports remote mirroring, :> i.e. mirror source and mirror target can run on different :> machines. Of course it is still very new and experimental :> (however, ZFS is marked experimental, too), so you probably :> don't want to use it on critical production machines. : :Let's not get carried away here :) : :Kris Heh. I think its safe to say that a *NATIVE* uninterrupted and fully cache coherent fail-over feature is not something any of us in BSDland have yet. It's a damn difficult problem that is frankly best solved above the filesytem layer, but with filesystem support for bulk mirroring operations. HAMMER's native mirroring was the last major feature to go into it before the upcoming release, so it will definitely be more experimental then the rest of HAMMER. This is mainly because it implements a full blown queue-less incremental snapshot and mirroring algorithm, single-master-to-multi-slave. It does it at a very low level, by optimally scanning HAMMER's B-Tree. In other words, the kitchen sink. The B-Tree propagates the highest transaction id up to the root to support incremental mirroring and that's the bit that is highly experimental and not well tested yet. It's fairly complex because even destroyed B-Tree records and collapses must propagate a transaction id up the tree (so the mirroring code knows what it needs to send to the other end to do comparative deletions on the target). (transaction ids are bundled together in larger flushes so the actual B-Tree overhead is minimal). The rest of HAMMER is shaping up very well for the release. It's phenominal when it comes to storing backups. Post-release I'll be moving more of our production systems to HAMMER. The only sticky issue we have is filesystem-full handling, but it is more a matter of fine-tuning then anything else. -- Someone mentioned atime and mtime. For something like ZFS or HAMMER, these fields represent a real problem (atime more then mtime). I'm kinda interested in knowing, does ZFS do block replacement for atime updates? For HAMMER I don't roll new B-Tree records for atime or mtime updates. I update the fields in-place in the current version of the inode and all snapshot accesses will lock them (in getattr) to ctime in order to guarantee a consistent result. That way (tar | md5) can be used to validate snapshot integrity. At the moment, in this first release, the mirroring code does not propagate atime or mtime. I plan to do it, though. Even though I don't roll new B-Tree records for atime/mtime updates I can still propagate a new transaction id up the B-Tree to make the changes visible to the mirroring code. I'll definitely be doing that for mtime and will have the option to do it for atime as well. But atime still represents a big expense in actual mirroring bandwidth. If someone reads a million files on the master then a million inode records (sans file contents) would end up in the mirroring stream just for the atime update. Ick. -Matt