Date: Mon, 24 Oct 2011 14:33:17 +0200 From: Miroslav Lachman <000.fbsd@quip.cz> To: Alfred Bartsch <bartsch@dssgmbh.de> Cc: freebsd-geom@freebsd.org Subject: Re: disk partitioning with gmirror + gpt + gjournal (RFC) Message-ID: <4EA55B0D.4080508@quip.cz> In-Reply-To: <4E9FF09B.4030204@dssgmbh.de> References: <4E69A152.6090408@rdtc.ru> <4E69EB15.50808@rdtc.ru> <4E9D2117.4090203@dssgmbh.de> <4E9D3B56.50300@quip.cz> <4E9D99F3.40104@dssgmbh.de> <4E9ED3C8.7030607@quip.cz> <4E9FF09B.4030204@dssgmbh.de>
next in thread | previous in thread | raw e-mail | index | archive | help
Alfred Bartsch wrote: > Am 19.10.2011 15:42, schrieb Miroslav Lachman: [...] >> UEFI will replace old BIOS sooner or later, so what you will do >> then? Than you will need to rework your servers and change your >> setup routine. And I think it is better to avolid known possible >> problem than hoping "it will not bite me". You can't avoid Murphy's >> law ;) >> > >> From my present point of view there are two alternatives: Hardware > RAID and (matured) ZFS. > > If I were a GEOM guru, i would try to enhance the compatibility > between upcoming UEFI and GMIRROR / GRAID3 etc. Just guessing: > What about adding a flag named "-gpt" "-efi" or just "-offset" to > these geoms to reserve enough space (at least 33 sectors) behind the > metadata sector at the end of the disk provider to hold whatever > secondary gpt table is needed to satisfy UEFI. In ideal world, it will be "the right way", but I guess it will never happen in our real FreeBSD world. There are nobody with time, skills an courage to revork "all" GEOM classes. It is not so easy task (backward compatibility, compatibility with other OSes / tools, etc.) >>>> I am using gjournal on few of our servers, but we are slowly >>>> removing it from our setups. Data writes to gjournaled disks >>>> are too slow and sometimes gjournal is not playing nice. >>> >>> I'm heavily interested in more details. >> >> When I did some tests in the past, gjournal cannot be used in >> combination with iSCSI and I was not able to stop gjournal tasting >> providers (I was not able to remove / disable gjournal on device) >> until I stop all of them and unload gjournal kernel module. I don't >> know the current state. > > Up to now I'm not using any ISCSI equipment. Good to know about some > weaknesses in advance. > >> >>>> Maybe ZFS or UFS+SUJ is better option. >>> >>> Yes, maybe. ZFS is mainly for future use. Do you use the second >>> option on large filesystems? >> >> ZFS is there for "a long time". I feel safe to use it in production >> on few of our servers. I didn't test UFS+SUJ because it is released >> in forthcoming 9.0 and we are not deploying current on our >> servers. >> > > Compared to UFS, ZFS lifetime is relatively short. From my point of > view ZFS in its present state is too immature to hold mission critical > data, YMMV. UFS2 or UFS2+SU (Soft-Updates) is there for a longer time than ZFS, but UFS2+SUJ (journaled soft-updates) is there ofr a short time and not much tested in production. Even UFS2+gjournal is not widely deployed / tested. > On the other hand ZFS needs a lot of redundant disk space and memory > to work as expected, not to forget cpu-cycles. IMHO, ZFS is not 32-bit > capable, so there is no way to use it on older and small hardware. Yes, you are right, ZFS cannot be used in some scenarios. But in some others scenario, ZFS is the best possible. e.g. for large flexible storage, I will use ZFS, for database server I will use UFS2+SU without gjournal. [...] > > Did you perform any benchmarks (UFS+Softupdates vs. UFS+Gjournal)? If > yes, did you compare async mounts + write cache enabled (gjournal) to > sync mounts + write cache disabled (softupdates)? I don't have a fancy graphs or tables from benchmarking SW, I just have real workload experiencies where write cache were enabled in both cases. > If I understand you right, you prefer write speed to data consistency > in these cases. This may be the right choice in your environment. > >> From my point of view, I am happy to find all bits restored in /var > after an unclean shutdown for error analysis and recovery purposes, > and I hate the vision of having to restore databases from backup, even > after power failures. Furthermore I am glad, only having to wait for > gmirror synchronizing to regain redundancy after replacing a failed disk. I am not sure, you can rely on data consistency with todays HDDs when cache is enabled even if you use gjournal. You allways lose content of device cache as rotating (and some flash devices - SSD - too) is known to lie to OS about "data is written", so you end up with lost or demaged date with unclean shutdown. Database engines handle it in its own way with own journal log etc., because some of them can be installed on raw partitions without underling FS. (also MySQL can do it) I rember one case (about 3 years ago) where server remains unbootable after kernel panic and I spent a couple of hours by playing with disabling gjournal, doing full fsck on the given partition etc. It is rare case, but can happen. >>> with fdisk + bsdlabel there are not enough partitions in one >>> slice to hold all the journals, and as I already mentioned I >>> really want to minimize recovery time. With gmirror + gjournal >>> I'm able to activate disk write cache without losing data >>> consistency, which improves performance significantly. >> >> According to following commit message, bsdlabel was extended to 26 >> partitions 3 years ago. >> http://lists.freebsd.org/pipermail/cvs-all/2007-December/239719.html >> >> > (I didn't tested yet, because I don't need it - we are using two slices >> on our servers) > > I didn't know this, thanks for revealing. I'm not sure if all BSD > utilities can deal with this. > >> >>>> I see what you are trying to do and it would be nice if "all >>>> works as one can expect", but the reality is different. So I >>>> don't think it is good idea to make it as you described. >>>> >>> I'm not yet fully convinced, that my idea of disk partitioning is >>> a bad one, so please let me take part in your negative >>> experiences with gjournal. Thanks in advance. >> >> I am not saying that your idea is bad. It just contains some >> things which I rather avoid. > > To summarize some of the pros and cons of this method of disk > partitioning: > pros: > - IMHO easy to configure > - easy to recover from a failed disk (just replace with a suitable > one and resync with gmirror, needs no preparation of the new disk) > - minimal downtime after unclean shutdowns (gjournal is responsible > for this, no sucking fsck on large file systems) > - disk write cache can and should be enabled (enhanced performance) > - all disk / partition sizes are supported (even> 2TB) > - 32 bit version of FreeBSD (i386) is sufficient (small and old > hardware remains usable) > > cons: > - danger of overwriting gmirror metadata by an "unfriendly" UEFI-BIOS - somewhat complex initial setup or future changes in partitioning (you must have prepared right number of partitions for journals, so adding more partitions is not so easy - in case with UFS2+SUJ or ZFS, you just add another partition) > - to be continued ... > > Feel free to add some topics here which I am missing. One thing in my mind is longstanding problem with gjournal on heavily loaded servers: Aug 16 01:48:28 praha kernel: fsync: giving up on dirty Aug 16 01:48:30 praha kernel: 0xc44ba9b4: tag devfs, type VCHR Aug 16 01:48:30 praha kernel: usecount 1, writecount 0, refcount 6941 mountedhere 0xc445b700 Aug 16 01:48:30 praha kernel: flags () Aug 16 01:48:30 praha kernel: v_object 0xc1548c00 ref 0 pages 192023 Aug 16 01:48:30 praha kernel: lock type devfs: EXCL (count 1) by thread 0xc42a7240 (pid 45) Aug 16 01:48:30 praha kernel: dev mirror/gm0s2e.journal Aug 16 01:48:30 praha kernel: GEOM_JOURNAL: Cannot suspend file system /vol0 (error=35). Aug 16 02:32:34 praha kernel: fsync: giving up on dirty Aug 16 02:32:34 praha kernel: 0xc44ba9b4: tag devfs, type VCHR Aug 16 02:32:34 praha kernel: usecount 1, writecount 0, refcount 1418 mountedhere 0xc445b700 Aug 16 02:32:34 praha kernel: flags () Aug 16 02:32:34 praha kernel: v_object 0xc1548c00 ref 0 pages 128123 Aug 16 02:32:34 praha kernel: lock type devfs: EXCL (count 1) by thread 0xc42a7240 (pid 45) Aug 16 02:32:34 praha kernel: dev mirror/gm0s2e.journal Aug 16 02:32:34 praha kernel: GEOM_JOURNAL: Cannot suspend file system /vol0 (error=35). This error messages is seen on theme almost every second day and nobody gives me suitable explanation what it really means / what it cause. The only answer I got was something like "it is not harmfull"... then why it is logged at all? So today I removed gjournal from the next older server. I will try UFS2+SUJ with 9.0 as one of the possible ways for a future setups, where ZFS cannot be used. Miroslav Lachman
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4EA55B0D.4080508>