From owner-freebsd-fs@FreeBSD.ORG Fri Jul 5 08:02:04 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 92F81FC3 for ; Fri, 5 Jul 2013 08:02:04 +0000 (UTC) (envelope-from daniel@digsys.bg) Received: from smtp-sofia.digsys.bg (smtp-sofia.digsys.bg [193.68.21.123]) by mx1.freebsd.org (Postfix) with ESMTP id 0C2DF1974 for ; Fri, 5 Jul 2013 08:02:03 +0000 (UTC) Received: from dcave.digsys.bg (dcave.digsys.bg [193.68.6.1]) (authenticated bits=0) by smtp-sofia.digsys.bg (8.14.6/8.14.6) with ESMTP id r65821GO053189 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO) for ; Fri, 5 Jul 2013 11:02:02 +0300 (EEST) (envelope-from daniel@digsys.bg) Message-ID: <51D67D79.3030403@digsys.bg> Date: Fri, 05 Jul 2013 11:02:01 +0300 From: Daniel Kalchev User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130627 Thunderbird/17.0.7 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Slow resilvering with mirrored ZIL References: <2EF46A8C-6908-4160-BF99-EC610B3EA771@alumni.chalmers.se> <51D437E2.4060101@digsys.bg> <20130704000405.GA75529@icarus.home.lan> <20130704171637.GA94539@icarus.home.lan> <2A261BEA-4452-4F6A-8EFB-90A54D79CBB9@alumni.chalmers.se> <20130704191203.GA95642@icarus.home.lan> <43015E9015084CA6BAC6978F39D22E8B@multiplay.co.uk> <20130704202818.GB97119@icarus.home.lan> In-Reply-To: <20130704202818.GB97119@icarus.home.lan> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Jul 2013 08:02:04 -0000 On 04.07.13 23:28, Jeremy Chadwick wrote: > > I'm not sure of the impact in situations like "I had a vdev made long > ago (ashift 9), then I added a new vdev to the pool (ashift 12) and now > ZFS is threatening to murder my children..." :-) Such an situation led me to spend few months to recreate/reshuffle some 40TB of snapshots -- mostly because I was lazy to build an new system to copy to and the old one didn't have enough spare slots... To make things more interesting, I had made the ashift=9 vdev on 4k aligned drives and the ashift=12 vdev on 512b aligned drives... Which brings up the question, whether it is possible to rollback this new vdev creation easily -- errors happen... > But it would also be doing a TRIM of the LBA ranges associated with each > partition, rather than the entire SSD. > > Meaning, in the example I gave (re: leaving untouched/unpartitioned > space at the end of the drive for wear levelling), this would result in > the untouched/unpartitioned space never being TRIM'd (by anything), thus > the FTL map would still have references to those LBA ranges. That'd be > potentially 30% of LBA ranges in the FTL (depending on past I/O of > course -- no way to know), and the only thing that would work that out > is the SSD's GC (which is known to kill performance if/when it kicks > in). This assumes some knowledge of how SSD drives operate. Which might be true for one model/maker and not true for another. No doubt, starting with an clean drive is best. That might be achieved by adding the entire drive to ZFS, then removing it -- a cheap way to get "Secure Erase" effect on FreeBSD. Then go on with partitioning... > Hmm, that gives me an idea actually -- if gpart(8) itself had a flag to > induce TRIM for the LBA range of whatever was just created (gpart > create) or added (gpart add). That way you could actually induce TRIM > on those LBA ranges rather than rely on the FS to do it, or have to put > faith into the SSD's GC (I rarely do :P). In the OP's case he could > then make a freebsd-zfs partition filling up the remaining 30% with the > flag to TRIM it, then when that was done immediately delete the > partition. Hmm, not sure if what I'm saying makes sense or not, or if > that's even a task/role gpart(8) should have... Not a bad idea. Really. :) > >> ... >>> Next topic... >>> >>> I would strongly recommend you not use 1 SSD for both log and cache. >>> I understand your thought process here: "if the SSD dies, the log >>> devices are mirrored so I'm okay, and the cache is throw-away anyway". >> While not ideal it still gives a significant boost against no SLOG, so >> if thats what HW you have to work with, don't discount the benefit it >> will bring. > Sure, the advantage of no seek times due to NAND plays a big role, but > some of these drives don't particularly perform well when used with a > larger I/O queue depth. IF we talk about the SLOG, there are no seeks. The SLOG is written sequentially. You *can* use an spinning drive for SLOG and you *will* see noticeable performance boost in doing so. The L2ARC on the other hand is especially designed to use no-seek SSDs, as it will do many small and scattered reads. Writes are still sequential, I believe.. > Now consider this: the Samsung 840 256GB (not the Pro) costs US$173 > and will give you 2x the performance of that Intel drive -- and more > importantly, 12x the capacity (that means 30% for wear levelling is > hardly a concern). The 840 also performs significantly better at higher > queue depths. I'm just saying that for about US$40 more you get > something that is by far better and will last you longer. Low-capacity > SSDs, even if SLC, are incredibly niche and I'm still not sure what > demographic they're catering to. The non-Pro 840 is hardly a match to any SLC SSD. Remember, SLC is all about endurance. It is order(s) of magnitude more enduring than the TLC flash used in that cheap consumer drive. IOPS and interface speed are different things -- that might not be of concern here. Nevertheless, I have recently began to view SSDs in SLOG/L2ARC as consumables... however, no matter how I calculate, the enterprise drives always win by a big margin... > I'm making a lot of assumptions about his I/O workload too, of course. > I myself tend to stay away from cache/log devices for the time being > given that my workloads don't necessitate them. Persistent cache (yeah > I know it's on the todo list) would interest me since the MCH on my > board is maxed out at 8GB. In short... be careful. :) Don't be tempted to add too large of an L2ARC with only 8GB of RAM. :) Daniel