From owner-freebsd-fs@FreeBSD.ORG Tue Jan 31 11:39:28 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 58C061065670 for ; Tue, 31 Jan 2012 11:39:28 +0000 (UTC) (envelope-from freebsd@penx.com) Received: from btw.pki2.com (btw.pki2.com [IPv6:2001:470:a:6fd::2]) by mx1.freebsd.org (Postfix) with ESMTP id 16C798FC1D for ; Tue, 31 Jan 2012 11:39:28 +0000 (UTC) Received: from [IPv6:::1] (localhost [IPv6:::1]) by btw.pki2.com (8.14.5/8.14.5) with ESMTP id q0VBdK9p072083; Tue, 31 Jan 2012 03:39:20 -0800 (PST) (envelope-from freebsd@penx.com) From: Dennis Glatting To: Peter Maloney In-Reply-To: <4F27A1B0.2060303@brockmann-consult.de> References: <4F264B27.6060502@brockmann-consult.de> <1327955423.22960.0.camel@btw.pki2.com> <4F27A1B0.2060303@brockmann-consult.de> Content-Type: text/plain; charset="us-ascii" Date: Tue, 31 Jan 2012 03:39:20 -0800 Message-ID: <1328009960.24125.26.camel@btw.pki2.com> Mime-Version: 1.0 X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit X-yoursite-MailScanner-Information: Dennis Glatting X-yoursite-MailScanner-ID: q0VBdK9p072083 X-yoursite-MailScanner: Found to be clean X-MailScanner-From: freebsd@penx.com Cc: freebsd-fs@freebsd.org Subject: Re: ZFS sync / ZIL clarification X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 31 Jan 2012 11:39:28 -0000 On Tue, 2012-01-31 at 09:09 +0100, Peter Maloney wrote: > On 01/30/2012 09:30 PM, Dennis Glatting wrote: > > On Mon, 2012-01-30 at 08:47 +0100, Peter Maloney wrote: > >> On 01/30/2012 05:30 AM, Mark Felder wrote: > >>> I believe I was told something misleading a few weeks ago and I'd like > >>> to have this officially clarified. > >>> > >>> NFS on ZFS is horrible unless you have sync = disabled. > >> With ESXi = true > >> with others = depends on your definition of horrible > >> > >>> I was told this was effectively disabling the ZIL, which is of course > >>> naughty. Now I stumbled upon this tonight: > >>> > >> true only for the specific dataset you specified > >> eg. > >> zfs set sync=disabled tank/esxi > >> > >>>> Just for the archives... sync=disabled won't disable disable the > >>>> zil,it'll disable waiting for a disk-flush on fsync etc. > >> Same thing... "waiting for a disk-flush" is the only time the ZIL is > >> used, from what I understand. > >> > >>>> With a batterybacked controller cache, those flushes should go to > >>>> cache, and bepretty mich free. You end up tossing away something for > >>>> nothing. > >> False I guess. Would be nice, but how do you battery back your RAM, > >> which ZFS uses as a write cache? (If you know something I don't know, > >> please share.) > >>> Is this accurate? > >> sync=disabled caused data corruption for me. So you need to have battery > >> backed cache... unfortunately, the cache we are talking about is in RAM, > >> not your IO controller. So put a UPS on there, and you are safe except > >> when you get a kernel panic (which is what happened to cause my > >> corruption). But if you get something like the Gigabyte iRAM or the > >> Acard ANS-9010 > >> , > >> set it as your ZIL, and leave sync=standard, you should be safer. (I > >> don't know if the iRAM works in FreeBSD, but someone > >> > >> told me he uses the ANS-9010) > >> > >> And NFS with ZFS is not horrible, except with ESXi's built in NFS client > >> it uses for datastores. (the same someone that said he uses the > >> ANS-9010 also provides a 'patch' for the FreeBSD NFS server that > >> disables ESXi's stupid behavior, without disabling sync entirely, but > >> also possibly disables it for others that use it responsibly [a database > >> perhaps]) > >> > >> here > >> > >> is a fantastic study about NFS; dunno if this study resulted in patches > >> now in use or not, or how old it is [newest reference is 2002, so at > >> most 10 years old]. In my experience, the write caching in use today > >> still sucks. If I run async with sync=disabled, I can still see a huge > >> improvement (20% on large files, up to 100% for smaller files <200MB) > >> using an ESXi virtual disk (with ext4 doing write caching) compared to > >> NFS directly. > >> > >> > >> Here begins the rant about ESXi, which may be off topic: > >> > > ESXi 3.5, 4.0, 4.1, 5.0, or all of the above? > > > I didn't know 5.0.0 was available for free. Thanks for the notice. > I downloaded ESXi 5.0 when it was free eval but since licensed it. > My testing has been with 4.1.0 build 348481, but if you look around on > the net, you will find no official sensible workarounds/fixes/etc.. They > don't even acknowledge the issue is in the ESXi NFS client... even > though it is obvious. So I doubt the problem will be fixed any time > soon. Even using the "sync" option is discouraged, and they actually go > do the absolute worst thing and send O_SYNC with every write (even when > saving state of a VM; I turn off sync in zfs when I do this). Some > groups have some solutions that mitigate but do not eliminate the > problem. The issue also exists with other file systems and platforms, > but it seems the worst on ZFS. I couldn't find anything equivalent to > those solutions that work on FreeBSD and ZFS. The closest is the patch I > mentioned above > (http://christopher-technicalmusings.blogspot.com/2011/06/speeding-up-freebsds-nfs-on-zfs-for-esx.html) > which possibly would result in data corruption for non-ESXi connections > to your NFS server that responsibly use the O_SYNC flag. I didn't test > that patch, because I would rather just throw away ESXi. I hate how much > it limits you (no software raid, no file system choice, no rsync, no > firewall, top, iostat, etc.). And it handles network interruptions > terribly... in some cases you need to reboot to get it to find all the > .vmx files again. In other cases hacks work to reconnect to the NFS mounts. > > But many just simply switch to iSCSI. And from what I've heard, iSCSI > also sucks on ESXi with the default settings, but a single setting fixes > most of the problem. I'm not sure if this applies to FreeBSD or ZFS > (didn't test it yet). Here are some pages from the starwind forum (where > we can assume their servers are Windows based): > A buddy does iSCSI by default. I can't say he ever tried NFS. He mentioned performance questions but hadn't recent data. My server, presently, is a PoS in need of a rebuild (it started out as ESXi 5.0 eval but then became useful) -- obtaining disks and other priorities are the present impediment to rebuild. I need to include shares and I /think/ remote disks (I also want to do some analysis of combining disparate remote disks). I've been working with big data (<35TB) and want to assign an instance (FreeBSD) as one of my engines. About 80% of my ESXi usage is prototyping and product eval. > Here they say "doing Write-Back Cache helps but not completely" (Windows > specific) > http://www.starwindsoftware.com/forums/starwind-f5/esxi-iscsi-initiator-write-speed-t2398-15.html > > And here is something (Windows specific) about changing the ACK timing: > http://www.starwindsoftware.com/forums/starwind-f5/esxi-iscsi-initiator-write-speed-t2398.html > > And here is some other page that ended up in my bookmarks: > http://www.starwindsoftware.com/forums/starwind-f5/recommended-settings-for-esx-iscsi-initiator-t2296.html > > Somewhere on those 3 or linked somewhere (can't find it now), there are > instructions to turn off "Delayed ACK" (in ESXi): > > in ESXi, click the host > click "Configuration" tab. > Click "Storage Adapters" > find and select the "iSCSI Software Adapter" > click "properties" (a blue link on the right, in the "details" section) > click "advanced" (must be enabled or this button is greyed out) > look for the "Delayed ACK" option in there somewhere (at the end in my > list), and uncheck the box. > > And this is said to improve things considerably, but I didn't iSCSI at > all on ESXi or ZFS. > > I wanted to test iSCSI on ZFS, but I found zvols to be buggy... so I > decided to avoid them. So I am not very motivated to try again. > > I guess I can work around buggy zvols by using a loop device for a file > instead of a zvol... but I am always too busy. Give it a few months. > When I looked into iSCSI/zvol, ZFS was 1.5 under FreeBSD and the limitations were many. I haven't looked at 2.8. I can't say I find ESXi the most wonderful thing in the world but if I started to rant this text would go on for pages. Thanks for the info. > >> ESXi goes 7 MB/s with an SSD ZIL at 100% load, and 80 MB/s with a > >> ramdisk ZIL at 100% load (pathetic!), > >> something I can't reproduce (thought it was just a normal Linux client > >> with "-o sync" over 10 Gbps ethernet) got over 70MB/s with the ZIL at > >> 70-90% load, > >> and other clients set to "-o sync,noatime,..." or "-o noatime,..."with > >> the ZIL only randomly 0-5% load, but go faster than 100 MB/s. I didn't > >> test "async", and without "sync", they seem to go the same speed. > >> setting sync=disabled always goes around 100 MB/s, and changes the load > >> on the ZIL to 0%. > >> > >> The thing I can't reproduce might have been only possible on a pool that > >> I created with FreeBSD 8.2-RELEASE and then upgraded, which I no longer > >> have. Or maybe it was with "sync" without "noatime". > >> > >> I am going to test with 9000 MTU, and if it is not much faster, I am > >> giving up on NFS. My original plan was to use ESXi with a ZFS datastore > >> with a replicated backup. That works terribly using the ESXi NFS client. > >> Netbooting the OSses to bypass the ESXi client works much better, but > >> still not good enough for many servers. NFS is poorly implemented, with > >> terrible write caching on the client side. Now my plan is to use FreeBSD > >> with VirtualBox and ZFS all in one system, and send replication > >> snapshots from there. I wanted to use ESXi, but I guess I can't. > >> > >> And the worst thing about ESXi, is if you have 1 client going 7MB/s, the > >> second client has to share that 7MB/s, and non-ESXi clients will still > >> go horribly slow. If you have 10 non-ESXi clients going at 100 MB/s, > >> each one is limited to around 100 MB/s (again I only tested this with > >> 1500 MTU so far), but together they can write much more. > >> > >> Just now I tested 2 clients writing 100+100 MB/s (reported by GNU dd), > >> and 3 writing 50+60+60 MB/s (reported by GNU dd) > >> Output from "zpool iostat 5": > >> two clients: > >> tank 38.7T 4.76T 0 1.78K 25.5K 206M (matches 100+100) > >> three clients: > >> tank 38.7T 4.76T 1 2.44K 205K 245M (does not match > >> 60+60+50) > >> > >> (one client is a Linux netboot, and the others are using the Linux NFS > >> client) > >> > >> But I am not an 'official', so this cannot be considered 'officially > >> clarified' ;) > >> > >> > >>> _______________________________________________ > >>> freebsd-fs@freebsd.org mailing list > >>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs > >>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > >> > > > >