From nobody Tue Jan 18 16:07:48 2022 X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 29BBC1955713 for ; Tue, 18 Jan 2022 16:08:01 +0000 (UTC) (envelope-from florent@rivoire.fr) Received: from mail-yb1-xb2d.google.com (mail-yb1-xb2d.google.com [IPv6:2607:f8b0:4864:20::b2d]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4JdYb42Py0z4RY7 for ; Tue, 18 Jan 2022 16:08:00 +0000 (UTC) (envelope-from florent@rivoire.fr) Received: by mail-yb1-xb2d.google.com with SMTP id g81so57378698ybg.10 for ; Tue, 18 Jan 2022 08:08:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rivoire.fr; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=naUV5oBg52bgYCAF6KokXASrXnJuwOpf78huXhBTHCU=; b=hDIUkunJrb78+bWwiYSyxZgZedcz/JsRHLgTgRlnFvGdPhy1LAlapHoOr761h0vxvE noGkQRIQ5pyqM6sMINj/4Ynw+Zy72gbIPzTQQAp+QHR+BpItNBkFhzwHinnlFOSyBZwR Bpo6pWGsyi9+uGvv2/HelJb2+C5dgiTO7cpk3B+s2had67uv7ykzvbl2hdw0NN/VXBCg SuiHtR9t4JHvE4byVsEE+Mk0zs83lWXR/407ZMjxdwpqu2V8V09YDQ+7jhtZeA8kZj2C jpDKh6owfrc3yg6P4ZO5lBuV0Bj5XyjS86bDj0HHU+Sdp7udcCQ8Q/sgt/bbDxxwsIyc Zn/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=naUV5oBg52bgYCAF6KokXASrXnJuwOpf78huXhBTHCU=; b=ZWO7odwez4kBF22HH0tCrS/9m2QPOEh8xLKmyITu5efBAuNtg8fMOb40GWjf9ZT0tN shLhFTYG5zQKVyeDNubhA44lymwhMQkdbSyy9ol+ThWqt+so2v0kcuTnYkNS4rWmELhL eZAAQ6RUXi/wNBN72v3ibQmwsUxNmzX2/CoUUYYDzKxQ0Rm/ZAdsM92Nn4rVYxlo2SZ9 BCvesS9dA6FAd1xlTBouG6fFp8rQdvuAD/go+B0v2M+4ujf/gbalO54nRIr4grvk1i0b //GwnVyd4og8imKL6VJR/MgZ97QPWxqpreoG/1iv/IXXEoUwCixIMM5ofZFBPunguDOw fskg== X-Gm-Message-State: AOAM531ebeV1ktnVAoxZ0+1upR+dPNHXoxeFaL7zWuWqRnPJf3Dsw4Om EkmjvhNG+9ikNMcVsaljuhfbQL/U9knP4HRPQEn9ag== X-Google-Smtp-Source: ABdhPJyvPILTIkrpK7RjJoQyesKQ56I0UJdU1IxYunBb+jkCDswIjXSieTVkXhqaLJ0SrONPSLVXmKJh+n/Un7VQINg= X-Received: by 2002:a05:6902:1027:: with SMTP id x7mr34607724ybt.427.1642522079464; Tue, 18 Jan 2022 08:07:59 -0800 (PST) List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org MIME-Version: 1.0 References: In-Reply-To: From: Florent Rivoire Date: Tue, 18 Jan 2022 17:07:48 +0100 Message-ID: Subject: Re: [zfs] recordsize: unexpected increase of disk usage when increasing it To: Alan Somers Cc: Rich , freebsd-fs Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 4JdYb42Py0z4RY7 X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=rivoire.fr header.s=google header.b=hDIUkunJ; dmarc=none; spf=pass (mx1.freebsd.org: domain of florent@rivoire.fr designates 2607:f8b0:4864:20::b2d as permitted sender) smtp.mailfrom=florent@rivoire.fr X-Spamd-Result: default: False [-3.49 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[rivoire.fr:s=google]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-fs@freebsd.org]; DMARC_NA(0.00)[rivoire.fr]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; DKIM_TRACE(0.00)[rivoire.fr:+]; NEURAL_HAM_SHORT(-0.99)[-0.992]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::b2d:from]; MLMMJ_DEST(0.00)[freebsd-fs]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_COUNT_TWO(0.00)[2]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; RCVD_TLS_ALL(0.00)[]; FREEMAIL_CC(0.00)[gmail.com,freebsd.org] X-ThisMailContainsUnwantedMimeParts: N On Tue, Jan 18, 2022 at 3:23 PM Alan Somers wrote: > However, I would suggest that you don't bother. With a 128kB recsize, > ZFS has something like a 1000:1 ratio of data:metadata. In other > words, increasing your recsize can save you at most 0.1% of disk > space. Basically, it doesn't matter. What it _does_ matter for is > the tradeoff between write amplification and RAM usage. 1000:1 is > comparable to the disk:ram of many computers. And performance is more > sensitive to metadata access times than data access times. So > increasing your recsize can help you keep a greater fraction of your > metadata in ARC. OTOH, as you remarked increasing your recsize will > also increase write amplification. In the attached zdb files (for 128K recordsize), we can see that the "L0 ZFS plain file" objects are using 99.89% in my test zpool. So the ratio in my case is exactly 1000:1 like you said. I had that rule-of-thumb in mind, but thanks for reminding me ! As quickly mentioned in my first email, the context is that I'm considering using a mirror of SSDs as "special devices" for a new zpool which will still be mainly made of magnetic HDDs (raidz2 of 5x3TB). And to really take advantage of those SSDs, I'm probably going to set special_small_blocks at a value > 0. Of course, I don't want to put all data on SSDs, so I need to keep "special_small_blocks" strictly below "recordsize", so that ZFS splits objects into 2 groups: the "small"=>special-vdev, vs the "rest"=>main-vdevs. So basically, I see two kind of solutions: 1) keep default recordsize (128K), and define a fairly small special_small_blocks (64K, or 32K) 2) increase recordsize (probably to 1M) and be able to define special_small_blocks higher (maybe: 128K, 256K or 512K, I'm testing) I'm leaning towards the 2nd option, because : - it allows a higher percentage of data into the SSD (solution 1 is only storing ~0.1% of data on SSDs, vs solution 2 with 512K small_blocks is ~0.3%, and my SSD will have 3% of HDD's vdev capacity) - the "high" recordsize of 1M looks good for my use-cases (files usually between MBs, and 100's of MBs, written sequentially, and never overwritten, so no risk of write amplification on this dataset) So, my goal is not to optimize disk usage (indeed: 0.1% is nothing), but to optimize read/write performance at a small cost (using small SSDs as special). And also, it's more to do some tech experiments on ZFS (because it's fun, I'm learning and it's my home-nas) than for a real need :) -- Florent