From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 17:05:34 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 69B6F260 for ; Tue, 29 Jan 2013 17:05:34 +0000 (UTC) (envelope-from fjwcash@gmail.com) Received: from mail-qe0-f41.google.com (mail-qe0-f41.google.com [209.85.128.41]) by mx1.freebsd.org (Postfix) with ESMTP id 3016B974 for ; Tue, 29 Jan 2013 17:05:33 +0000 (UTC) Received: by mail-qe0-f41.google.com with SMTP id 7so285570qeb.14 for ; Tue, 29 Jan 2013 09:05:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=piWKtDuiAP8LtdFRC0cHlCzkmupVT9dc7P0zadttVnk=; b=fhArPGdkLTZ5zix6lZB72ZATc7Bo77DRCRMJSpFK4R8RdQyIxp32Z7QdsyZbCTJSu0 vBdCJ5axcDXvY0pyiGSdzm4ZkRjLWg5e4IHF4UsJ6XI6YY/81rx//aH1QJ7oIal5p6QX o33JJ3WyYQixBL3fLsmlhxvNBAVXCbE0spQr7z/Xjmd+M6SWA1UArTTc2GDG1AXlz0DP gQESB+EDHUrRAX2tjDngl33rJpviK9T5SwKWzX60tQs74H++QFqUxytwnTr2W/A47bPz CGeofD5BsUE2nbiFb6rTK0Mg9Q6PffU/B9LY94k2YnMEXcOm0aOQ5sCAPQ0mH9nRWZL1 iELw== MIME-Version: 1.0 X-Received: by 10.224.177.10 with SMTP id bg10mr1846578qab.78.1359479133312; Tue, 29 Jan 2013 09:05:33 -0800 (PST) Received: by 10.49.106.233 with HTTP; Tue, 29 Jan 2013 09:05:33 -0800 (PST) Received: by 10.49.106.233 with HTTP; Tue, 29 Jan 2013 09:05:33 -0800 (PST) In-Reply-To: <5107A9B7.5030803@platinum.linux.pl> References: <5105252D.6060502@platinum.linux.pl> <5107A9B7.5030803@platinum.linux.pl> Date: Tue, 29 Jan 2013 09:05:33 -0800 Message-ID: Subject: Re: RAID-Z wasted space - asize roundups to nparity +1 From: Freddie Cash To: Adam Nowacki Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: Matthew Ahrens , fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 17:05:34 -0000 On Jan 29, 2013 2:52 AM, "Adam Nowacki" wrote: > > On 2013-01-28 22:55, Matthew Ahrens wrote: >> >> This is so that we won't end up with small, unallocatable segments. >> E.g. if you are using RAIDZ2, the smallest usable segment would be 3 >> sectors (1 sector data + 2 sectors parity). If we left a 1 or 2 sector >> free segment, it would be unusable and you'd be able to get into strange >> accounting situations where you have free space but can't write because >> you're "out of space". > > > Sounds reasonable. > > >> The amount of waste due to this can be minimized by using larger >> blocksizes (e.g. the default recordsize of 128k and files larger than >> 128k), and by using smaller sector sizes (e.g. 512b sector disks rather >> than 4k sector disks). In your case these techniques would limit the >> waste to 0.6%. > > > This brings another issue - recordsize capped at 128KiB. We are using the pool for off-line storage of large files (from 50MB to 20GB). Files are stored and read sequentially as a whole. With 12 disks in RAID-Z2, 4KiB sectors, 128KiB record size and the padding above 9.4% of disk space goes completely unused - one whole disk. > > Increasing recordsize cap seems trivial enough. On-disk structures and kernel code support it already - a single of code had to be changed (#define SPA_MAXBLOCKSHIFT - from 17 to 20) to support 1MiB recordsizes. This of course breaks compatibility with any other system without this modification. With Suns cooperation this could be handled in safe and compatible manner via pool version upgrade. Recordsize of 128KiB would remain the default but anyone could increase it with zfs set. There's work upstream (Illumos, I believe, maybe Delphix?) to add support for recordings above 128 KB. It'll be added ad a feature flag, so only compatible with open-source ZFS.