From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 10:51:50 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 734F87B8 for ; Tue, 29 Jan 2013 10:51:50 +0000 (UTC) (envelope-from nowakpl@platinum.linux.pl) Received: from platinum.linux.pl (platinum.edu.pl [81.161.192.4]) by mx1.freebsd.org (Postfix) with ESMTP id 1D428F25 for ; Tue, 29 Jan 2013 10:51:49 +0000 (UTC) Received: by platinum.linux.pl (Postfix, from userid 87) id E089147E16; Tue, 29 Jan 2013 11:51:41 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on platinum.linux.pl X-Spam-Level: X-Spam-Status: No, score=-1.4 required=3.0 tests=ALL_TRUSTED,AWL autolearn=disabled version=3.3.2 Received: from [10.255.0.2] (unknown [83.151.38.73]) by platinum.linux.pl (Postfix) with ESMTPA id 4297647E0F; Tue, 29 Jan 2013 11:51:38 +0100 (CET) Message-ID: <5107A9B7.5030803@platinum.linux.pl> Date: Tue, 29 Jan 2013 11:51:35 +0100 From: Adam Nowacki User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: Matthew Ahrens Subject: Re: RAID-Z wasted space - asize roundups to nparity +1 References: <5105252D.6060502@platinum.linux.pl> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 10:51:50 -0000 On 2013-01-28 22:55, Matthew Ahrens wrote: > This is so that we won't end up with small, unallocatable segments. > E.g. if you are using RAIDZ2, the smallest usable segment would be 3 > sectors (1 sector data + 2 sectors parity). If we left a 1 or 2 sector > free segment, it would be unusable and you'd be able to get into strange > accounting situations where you have free space but can't write because > you're "out of space". Sounds reasonable. > The amount of waste due to this can be minimized by using larger > blocksizes (e.g. the default recordsize of 128k and files larger than > 128k), and by using smaller sector sizes (e.g. 512b sector disks rather > than 4k sector disks). In your case these techniques would limit the > waste to 0.6%. This brings another issue - recordsize capped at 128KiB. We are using the pool for off-line storage of large files (from 50MB to 20GB). Files are stored and read sequentially as a whole. With 12 disks in RAID-Z2, 4KiB sectors, 128KiB record size and the padding above 9.4% of disk space goes completely unused - one whole disk. Increasing recordsize cap seems trivial enough. On-disk structures and kernel code support it already - a single of code had to be changed (#define SPA_MAXBLOCKSHIFT - from 17 to 20) to support 1MiB recordsizes. This of course breaks compatibility with any other system without this modification. With Suns cooperation this could be handled in safe and compatible manner via pool version upgrade. Recordsize of 128KiB would remain the default but anyone could increase it with zfs set. Pool appears to work just fine with 15TB copied so far from another pool. Wasted disk space drops down to 0.7%. Sequential read speed increased from ~400MB/s to ~600MB/s. Writes stay about the same at ~300MB/s. So far however I was not able to boot from that pool. gptzfsboot required a heap size increase and appears to work. zfsloader crashes and I've become lost in the code. I've also identified another problem with ZFS wasting disk space. When compression is off allocations are always a multiple of record size. With the default recordsize of 128KiB a 129KiB file would use 256KiB of disk space (+ parity and other inefficiencies mentioned above). This may be there to help with fragmentation but then it would be good to have a setting to turn it off - even if by means of a no-op compression that would count zeroes backwards and return short psize. > > --matt > > On Sun, Jan 27, 2013 at 5:01 AM, Adam Nowacki > wrote: > > I've just found something very weird in the ZFS code. > > sys/cddl/contrib/opensolaris/__uts/common/fs/zfs/vdev_raidz.__c:504 > in HEAD > > Can someone explain the reason behind this line of code? What it > does is align on-disk record size to a multiple of number of parity > disks + 1 ... this really doesn't make any sense. So far as I can > tell those extra sectors are just padding - completely unused. > > For the array I'm using this results in 4.8% of wasted disk space - > 1.7TB. It's a 12x 3TB disk RAID-Z2. > _________________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/__mailman/listinfo/freebsd-fs > > To unsubscribe, send any mail to > "freebsd-fs-unsubscribe@__freebsd.org > " > >