Date: Fri, 18 Jun 2021 17:39:55 -0600 From: Alan Somers <asomers@freebsd.org> To: joe mcguckin <joe@via.net> Cc: freebsd-fs <freebsd-fs@freebsd.org> Subject: Re: ZFS config question Message-ID: <CAOtMX2grm9UFST0uN6nbVDCEEFPCYn%2B7d3XBH__w5xKr=2i=-Q@mail.gmail.com> In-Reply-To: <43127C8C-8CEA-4796-A906-E2149B4262DE@via.net> References: <43127C8C-8CEA-4796-A906-E2149B4262DE@via.net>
next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000e6aa2d05c512d718 Content-Type: text/plain; charset="UTF-8" You definitely don't want 60-drives in the same RAIDZ vdev, and this is why: RAIDZ1 is not the same layout as traditional RAID5 (ditto with RAIDZ2 and RAID6). With RAID5, each set of data+parity chunks is distributed over all of the disks. For example, an 8+1 array is composed of identical rows that each have 8 data chunks and 1 parity chunk of perhaps a few dozen KB per chunk. But with RAIDZ, each set of data+parity chunks is distributed over as many disks as are needed for _a_single_record_. For example, in that same 8+1 array, a 32KB record would be divided into 8 data chunks and 1 parity chunk of 4KB apiece. But assuming ashift=9, a 16 KB record would be divided into _4_ data chunks and 1 parity chunk of 4KB apiece. So small records are less space efficient to store on RAIDZ, and the problem gets worse the larger the RAIDZ vdev. In fact, the problem is a little bit worse than this example shows, due to padding blocks. I won't go into those right now. But it's not just space efficiency, it's IOPs too. In our 8+1 RAID5 array, if the chunksize is 64KB or larger, then randomly reading a 64KB record requires just a single operation from a single disk. But reading a 64KB record from a 8+1 RAIDZ array requires a single operation from _8_ disks. So RAIDZ has worse IOPs than RAID5. Basically, if a single disk has X read IOPs, then n+m RAID5 provids n * X read IOPs, but n + m RAIDZ only provides X. But it's not just space efficiency and IOPs, it's rebuild time, too. When rebuilding a failed disk, whether it's RAID5 or RAIDZ, you basically have to read the full contents of every other disk in the RAID group (slightly less for RAIDZ, for the reasons discussed in paragraph 2). For large RAID arrays, this can take a lot of IOPs and CPU cycles away from servicing user-facing requests. ZFS's DRAID is a partial improvement, but only a partial one. The best size of RAIDZ for you depends on the typical record size you're going to have, your random read IOPs requirement, the ashift of your drives, and how much performance hit you're willing to accept during rebuild. But 60 is way too many. -Alan On Fri, Jun 18, 2021 at 5:21 PM joe mcguckin <joe@via.net> wrote: > If I have a box with 60 SAS drives - Why not hake it one big RAID volume? > > Is there a benefit to a filesystem composed of multiple, smaller VDEVS vs > one giant 40-50 drive zpool? > > Are there guidelines or rules-of-thumb for sizing vdevs and zpools? > > Thanks, > > Joe > > Joe McGuckin > ViaNet Communications > > joe@via.net > 650-207-0372 cell > 650-213-1302 office > 650-969-2124 fax > > > > --000000000000e6aa2d05c512d718--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2grm9UFST0uN6nbVDCEEFPCYn%2B7d3XBH__w5xKr=2i=-Q>