Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 18 Jun 2021 17:39:55 -0600
From:      Alan Somers <asomers@freebsd.org>
To:        joe mcguckin <joe@via.net>
Cc:        freebsd-fs <freebsd-fs@freebsd.org>
Subject:   Re: ZFS config question
Message-ID:  <CAOtMX2grm9UFST0uN6nbVDCEEFPCYn%2B7d3XBH__w5xKr=2i=-Q@mail.gmail.com>
In-Reply-To: <43127C8C-8CEA-4796-A906-E2149B4262DE@via.net>
References:  <43127C8C-8CEA-4796-A906-E2149B4262DE@via.net>

next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000e6aa2d05c512d718
Content-Type: text/plain; charset="UTF-8"

You definitely don't want 60-drives in the same RAIDZ vdev, and this is why:

RAIDZ1 is not the same layout as traditional RAID5 (ditto with RAIDZ2 and
RAID6).  With RAID5, each set of data+parity chunks is distributed over all
of the disks.  For example, an 8+1 array is composed of identical rows that
each have 8 data chunks and 1 parity chunk of perhaps a few dozen KB per
chunk.  But with RAIDZ, each set of data+parity chunks is distributed over
as many disks as are needed for _a_single_record_.  For example, in that
same 8+1 array, a 32KB record would be divided into 8 data chunks and 1
parity chunk of 4KB apiece.  But assuming ashift=9, a 16 KB record would be
divided into _4_ data chunks and 1 parity chunk of 4KB apiece.  So small
records are less space efficient to store on RAIDZ, and the problem gets
worse the larger the RAIDZ vdev.  In fact, the problem is a little bit
worse than this example shows, due to padding blocks.  I won't go into
those right now.

But it's not just space efficiency, it's IOPs too.  In our 8+1 RAID5 array,
if the chunksize is 64KB or larger, then randomly reading a 64KB record
requires just a single operation from a single disk.  But reading a 64KB
record from a 8+1 RAIDZ array requires a single operation from _8_ disks.
So RAIDZ has worse IOPs than RAID5.  Basically, if a single disk has X read
IOPs, then n+m RAID5 provids n * X read IOPs, but n + m RAIDZ only provides
X.

But it's not just space efficiency and IOPs, it's rebuild time, too.  When
rebuilding a failed disk, whether it's RAID5 or RAIDZ, you basically have
to read the full contents of every other disk in the RAID group (slightly
less for RAIDZ, for the reasons discussed in paragraph 2).  For large RAID
arrays, this can take a lot of IOPs and CPU cycles away from servicing
user-facing requests.  ZFS's DRAID is a partial improvement, but only a
partial one.

The best size of RAIDZ for you depends on the typical record size you're
going to have, your random read IOPs requirement, the ashift of your
drives, and how much performance hit you're willing to accept during
rebuild.  But 60 is way too many.

-Alan

On Fri, Jun 18, 2021 at 5:21 PM joe mcguckin <joe@via.net> wrote:

> If I have a box with 60 SAS drives - Why not hake it one big RAID volume?
>
> Is there a benefit to a filesystem composed of multiple, smaller VDEVS vs
> one giant 40-50 drive zpool?
>
> Are there guidelines or rules-of-thumb for sizing vdevs and zpools?
>
> Thanks,
>
> Joe
>
> Joe McGuckin
> ViaNet Communications
>
> joe@via.net
> 650-207-0372 cell
> 650-213-1302 office
> 650-969-2124 fax
>
>
>
>

--000000000000e6aa2d05c512d718--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2grm9UFST0uN6nbVDCEEFPCYn%2B7d3XBH__w5xKr=2i=-Q>