Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 8 Apr 2022 13:14:24 +0200
From:      Stefan Esser <se@FreeBSD.org>
To:        egoitz@ramattack.net
Cc:        freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org, Rainer Duffner <rainer@ultra-secure.de>
Subject:   Re: Desperate with 870 QVO and ZFS
Message-ID:  <b9dba1b4-1db1-9d73-da8a-080906c8e146@FreeBSD.org>
In-Reply-To: <e3ccbea91aca7c8870fd56ad393401a4@ramattack.net>
References:  <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <dd9a55ac-053d-7802-169d-04c95c045ed2@FreeBSD.org> <ce51660b5f83f92aa9772d764ae12dff@ramattack.net> <e4b7252d-525e-1c0f-c22b-e34b96c1ce83@FreeBSD.org> <e3ccbea91aca7c8870fd56ad393401a4@ramattack.net>

next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--------------TD2ASvjzfpo4JyxrFz8JyiVP
Content-Type: multipart/mixed; boundary="------------O0ssWPEiEQcQgZRJEKUGD0ko";
 protected-headers="v1"
From: Stefan Esser <se@FreeBSD.org>
To: egoitz@ramattack.net
Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org,
 freebsd-performance@freebsd.org, Rainer Duffner <rainer@ultra-secure.de>
Message-ID: <b9dba1b4-1db1-9d73-da8a-080906c8e146@FreeBSD.org>
Subject: Re: Desperate with 870 QVO and ZFS
References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net>
 <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de>
 <0ef282aee34b441f1991334e2edbcaec@ramattack.net>
 <dd9a55ac-053d-7802-169d-04c95c045ed2@FreeBSD.org>
 <ce51660b5f83f92aa9772d764ae12dff@ramattack.net>
 <e4b7252d-525e-1c0f-c22b-e34b96c1ce83@FreeBSD.org>
 <e3ccbea91aca7c8870fd56ad393401a4@ramattack.net>
In-Reply-To: <e3ccbea91aca7c8870fd56ad393401a4@ramattack.net>

--------------O0ssWPEiEQcQgZRJEKUGD0ko
Content-Type: multipart/alternative;
 boundary="------------5zqkLuBRyXvluXGRQjtNvCUK"

--------------5zqkLuBRyXvluXGRQjtNvCUK
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Am 07.04.22 um 14:30 schrieb egoitz@ramattack.net:
> El 2022-04-06 23:49, Stefan Esser escribi=C3=B3:
>>>
>>> El 2022-04-06 17:43, Stefan Esser escribi=C3=B3:
>>>
>>>
>>>     Am 06.04.22 um 16:36 schrieb egoitz@ramattack.net:
>>>
>>>         Hi Rainer!
>>>
>>>         Thank you so much for your help :) :)
>>>
>>>         Well I assume they are in a datacenter and should not be a po=
wer
>>>         outage....
>>>
>>>         About dataset size... yes... our ones are big... they can be =
3-4 TB
>>>         easily each
>>>         dataset.....
>>>
>>>         We bought them, because as they are for mailboxes and mailbox=
es
>>>         grow and
>>>         grow.... for having space for hosting them...
>>>
>>>
>>>     Which mailbox format (e.g. mbox, maildir, ...) do you use?
>>>     =C2=A0
>>>     *I'm running Cyrus imap so sort of Maildir... too many little fil=
es
>>>     normally..... Sometimes directories with tons of little files....=
*
>>>
>> Assuming that many mails are much smaller than the erase block size of=
 the
>> SSD, this may cause issues. (You may know the following ...)
>>
>> For example, if you have message sizes of 8 KB and an erase block size=
 of 64
>> KB (just guessing), then 8 mails will be in an erase block. If half th=
e
>> mails are deleted, then the erase block will still occupy 64 KB, but o=
nly
>> hold 32 KB of useful data (and the SSD will only be aware of this fact=
 if
>> TRIM has signaled which data is no longer relevant). The SSD will copy=

>> several partially filled erase blocks together in a smaller number of =
free
>> blocks, which then are fully utilized. Later deletions will repeat thi=
s
>> game, and your data will be copied multiple times until it has aged (a=
nd the
>> user is less likely to delete further messages). This leads to "write
>> amplification" - data is internally moved around and thus written mult=
iple
>> times.
>>
>>
>> *Stefan!! you are nice!! I think this could explain all our problem. S=
o, why
>> we are having the most randomness in our performance degradation and t=
hat
>> does not necessarily has to match with the most io peak hours... That =
I
>> could cause that performance degradation just by deleting a couple of =
huge
>> (perhaps 200.000 mails) mail folders in a middle traffic hour time!!*
>>
Yes, if deleting large amounts of data triggers performance issues (and t=
he
disk does not have a deficient TRIM implementation), then the issue is li=
kely
to be due to internal garbage collections colliding with other operations=
=2E
>>
>> *The problem is that by what I know, erase block size of an SSD disk i=
s
>> something fixed in the disk firmware. I don't really know if perhaps i=
t
>> could be modified with Samsung magician or those kind of tool of Samsu=
ng....
>> else I don't really see the manner of improving it... because apart fr=
om
>> that, you are deleting a file in raidz-2 array... no just in a disk...=
 I
>> assume aligning chunk size, with record size and with the "secret" era=
se
>> size of the ssd, perhaps could be slightly compensated?.*
>>
The erase block size is a fixed hardware feature of each flash chip. Ther=
e is a
block size for writes (e.g. 8 KB) and many such blocks are combined in on=
e
erase block (of e.g. 64 KB, probably larger in todays SSDs), they can onl=
y be
returned to the free block pool all together. And if some of these writab=
le
blocks hold live data, they must be preserved by collecting them in newly=

allocated free blocks.

An example of what might happen, showing a simplified layout of files 1, =
2, 3
(with writable blocks 1a, 1b, ..., 2a, 2b, ... and "--" for stale data of=

deleted files, ".." for erased/writable flash blocks) in an SSD might be:=


erase block 1: |1a|1b|--|--|2a|--|--|3a|

erase block 2; |--|--|--|2b|--|--|--|1c|

erase block 3; |2c|1d|3b|3c|--|--|--|--|

erase block 4; |..|..|..|..|..|..|..|..|

This is just a random example how data could be laid out on the physical
storage array. It is assumed that the 3 erase blocks once were completely=
 occupied

In this example, 10 of 32 writable blocks are occupied, and only one free=
 erase
block exists.

This situation must not persist, since the SSD needs more empty erase blo=
cks.
10/32 of the capacity is used for data, but 3/4 of the blocks are occupie=
d and
not immediately available for new data.

The garbage collection might combine erase blocks 1 and 3 into a currentl=
y free
one, e.g. erase block 4:

erase block 1; |..|..|..|..|..|..|..|..|

erase block 2; |--|--|--|2b|--|--|--|1c|

erase block 3; |..|..|..|..|..|..|..|..|

erase block 4: |1a|1b|2a|3a|2c|1d|3b|3c|

Now only 2/4 of the capacity is not available for new data (which is stil=
l a
lot more than 10/32, but better than before).

Now assume file 2 is deleted:

erase block 1; |..|..|..|..|..|..|..|..|

erase block 2; |--|--|--|--|--|--|--|1c|

erase block 3; |..|..|..|..|..|..|..|..|

erase block 4: |1a|1b|--|3a|--|1d|3b|3c|

There is now a new sparsely used erase block 4, and it will soon need to =
be
garbage collected, too - in fact it could be combined with the live data =
from
erase block 2, but this may be delayed until there is demand for more era=
sed
blocks (since e.g. file 1 or 3 might also have been deleted by then).

The garbage collection does not know which data blocks belong to which fi=
le,
and therefore it cannot collect the data belonging to a file into a singl=
e
erase block. Blocks are allocated as data comes in (as long as enough SLC=
 cells
are available in this area, else directly in QLC cells). Your many parall=
el
updates will cause fractions of each larger file to be spread out over ma=
ny
erase blocks.

As you can see, a single file that is deleted may affect many erase block=
s, and
you have to take redundancy into consideration, which will multiply the e=
ffect
by a factor of up to 3 for small files (one ZFS allocation block). And
remember: deleting a message in mdir format will free the data blocks, bu=
t will
also remove the directory entry, causing additional meta-data writes (aga=
in
multiplied by the raid redundancy).

A consumer SSD would normally see only very few parallel writes, and sequ=
ential
writes of full files will have a high chance to put the data of each file=

contiguously in the minimum number of erase blocks, allowing to free mult=
iple
complete erase blocks when such a file is deleted and thus obviating the =
need
for many garbage collection copies (that occur if data from several indep=
endent
files is in one erase block).

Actual SSDs have many more cells than advertised. Some 10% to 20% may be =
kept
as a reserve for aging blocks that e.g. may have failed kind of a
"read-after-write test" (implemented in the write function, which adds ch=
arges
to the cells until they return the correct read-outs).

BTW: Having an ashift value that is lower than the internal write block s=
ize
may also lead to higher write amplification values, but a large ashift ma=
y lead
to more wasted capacity, which may become an issue if typical file length=
 are
much smaller than the allocation granularity that results from the ashift=
 value.

>> Larger mails are less of an issue since they span multiple erase block=
s,
>> which will be completely freed when such a message is deleted.
>>
>> *I see I see Stefan...*
>>
>> Samsung has a lot of experience and generally good strategies to deal =
with
>> such a situation, but SSDs specified for use in storage systems might =
be
>> much better suited for that kind of usage profile.
>>
>> *Yes... and the disks for our purpose... perhaps weren't QVOs....*
>>
You should have got (much more expensive) server grade SSDs, IMHO.

But even 4 * 2 TB QVO (or better EVO) drives per each 8 TB QVO drive woul=
d
result in better performance (but would need a lot of extra SATA ports).

In fact, I'm not sure whether rotating media and a reasonable L2ARC consi=
sting
of a fast M.2 SSD plus a mirror of small SSDs for a LOG device would not =
be a
better match for your use case. Reading the L2ARC would be very fast, wri=
tes
would be purely sequential and relatively slow, you could choose a suitab=
le
L2ARC strategy (caching of file data vs. meta data), and the LOG device w=
ould
support fast fsync() operations required for reliable mail systems (which=

confirm data is on stable storage before acknowledging the reception to t=
he
sender).

>>>         We knew they had some speed issues, but those speed issues, w=
e
>>>         thought (as
>>>         Samsung explains in the QVO site) they started after exceedin=
g the
>>>         speeding
>>>         buffer this disks have. We though that meanwhile you didn't e=
xceed it's
>>>         capacity (the capacity of the speeding buffer) no speed probl=
em
>>>         arises. Perhaps
>>>         we were wrong?.
>>>
>>>
>>>     These drives are meant for small loads in a typical PC use case,
>>>     i.e. some installations of software in the few GB range, else onl=
y
>>>     files of a few MB being written, perhaps an import of media files=

>>>     that range from tens to a few hundred MB at a time, but less ofte=
n
>>>     than once a day.
>>>     =C2=A0
>>>     *We move, you know... lots of little files... and lot's of differ=
ent
>>>     concurrent modifications by 1500-2000 concurrent imap connections=
 we
>>>     have...*
>>>
>> I do not expect the read load to be a problem (except possibly when th=
e SSD
>> is moving data from SLC to QLC blocks, but even then reads will get
>> priority). But writes and trims might very well overwhelm the SSD,
>> especially when its getting full. Keeping a part of the SSD unused (ex=
cluded
>> from the partitions created) will lead to a large pool of unused block=
s.
>> This will reduce the write amplification - there are many free blocks =
in the
>> "unpartitioned part" of the SSD, and thus there is less urgency to com=
pact
>> partially filled blocks. (E.g. if you include only 3/4 of the SSD capa=
city
>> in a partition used for the ZPOOL, then 1/4 of each erase block could =
be
>> free due to deletions/TRIM without any compactions required to hold al=
l this
>> data.)
>>
>> Keeping a significant percentage of the SSD unallocated is a good stra=
tegy
>> to improve its performance and resilience.
>>
>> *Well, we have allocated all the disk space... but not used... just
>> allocated.... you know... we do a zpool create with the whole disks...=
=2E.*
>>
I think the only chance for a solution that does not require new hardware=
 is to
make sure, only some 80% of the SSDs are used (i.e. allocate only 80% for=
 ZFS,
leave 20% unallocated). This will significantly reduce the rate of garbag=
e
collections and thus reduce the load they cause.

I'd use a fast encryption algorithm (zstd - choose a level that does not
overwhelm the CPU, there are benchmark results for ZFS with zstd, and I f=
ound
zstd-2 to be best for my use case). This will more than make up for the s=
pace
you left unallocated on the SSDs.

A different mail box format might help, too - I'm happy with dovecot's md=
box
format, which is as fast but much more efficient than mdir.

>>>     As the SSD fills, the space available for the single level write
>>>     cache gets smaller
>>>     =C2=A0
>>>     *The single level write cache is the cache these ssd drivers have=
, for
>>>     compensating the speed issues they have due to using qlc memory?.=
 Do
>>>     you refer to that?. Sorry I don't understand well this paragraph.=
*
>>>
>> Yes, the SSD is specified to hold e.g. 1 TB at 4 bits per cell. The SL=
C
>> cache has only 1 bit per cell, thus a 6 GB SLC cache needs as many cel=
ls as
>> 24 GB of data in QLC mode.
>>
>> *Ok, true.... yes....*
>>
>> A 100 GB SLC cache would reduce the capacity of a 1 TB SSD to 700 GB (=
600 GB
>> in 150 tn QLC cells plus 100 GB in 100 tn SLC cells).
>>
>> *Ahh! you mean that SLC capacity for speeding up the QLC disks, is obt=
ained
>> from each single layer of the QLC?.*
>>
There are no specific SLC cells. A fraction of the QLC capable cells is o=
nly
written with only 1 instead of 4 bits. This is a much simpler process, si=
nce
there are only 2 charge levels per cell that are used, while QLC uses 16 =
charge
levels, and you can only add charge (must not overshoot), therefore only =
small
increments are added until the correct value can be read out).

But since SLC cells take away specified capacity (which is calculated ass=
uming
all cells hold 4 bits each, not only 1 bit), their number is limited and
shrinks as demand for QLC cells grows.

The advantage of the SLC cache is fast writes, but also that data in it m=
ay
have become stale (trimmed) and thus will never be copied over into a QLC=

block. But as the SSD fills and the size of the SLC cache shrinks, this
capability will be mostly lost, and lots of very short lived data is stor=
ed in
QLC cells, which will quickly become partially stale and thus needing
compaction as explained above.

>> Therefore, the fraction of the cells used as an SLC cache is reduced w=
hen it
>> gets full (e.g. ~1 TB in ~250 tn QLC cells, plus 6 GB in 6 tn SLC cell=
s).
>>
>> *Sorry I don't get this last sentence... don't understand it because I=
 don't
>> really know the meaning of tn... *
>>
>> *but I think I'm getting the idea if you say that each QLC layer, has =
it's
>> own SLC cache obtained from the disk space avaiable for each QLC layer=
=2E...*
>>
>> And with less SLC cells available for short term storage of data the
>> probability of data being copied to QLC cells before the irrelevant me=
ssages
>> have been deleted is significantly increased. And that will again lead=
 to
>> many more blocks with "holes" (deleted messages) in them, which then n=
eed to
>> be copied possibly multiple times to compact them.
>>
>> *If I correct above, I think I got the idea yes....*
>>
>>>     (on many SSDs, I have no numbers for this
>>>     particular device), and thus the amount of data that can be
>>>     written at single cell speed shrinks as the SSD gets full.
>>>     =C2=A0
>>>
>>>     I have just looked up the size of the SLC cache, it is specified
>>>     to be 78 GB for the empty SSD, 6 GB when it is full (for the 2 TB=

>>>     version, smaller models will have a smaller SLC cache).
>>>     =C2=A0
>>>     *Assuming you were talking about the cache for compensating speed=
 we
>>>     previously commented, I should say these are the 870 QVO but the =
8TB
>>>     version. So they should have the biggest cache for compensating t=
he
>>>     speed issues...*
>>>
>> I have looked up the data: the larger versions of the 870 QVO have the=
 same
>> SLC cache configuration as the 2 TB model, 6 GB minimum and up to 72 G=
B more
>> if there are enough free blocks.
>>
>> *Ours one is the 8TB model so I assume it could have bigger limits. Th=
e
>> disks are mostly empty, really.... so... for instance....*
>>
>> *zpool list*
>> *NAME=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 SIZE=C2=A0 ALLOC=C2=A0=C2=A0 FREE=C2=A0 CKPOINT=C2=A0 EXPANDSZ=C2=A0=
=C2=A0 FRAG=C2=A0=C2=A0=C2=A0 CAP=C2=A0
>> DEDUP=C2=A0 HEALTH=C2=A0 ALTROOT*
>> *root_dataset=C2=A0 448G=C2=A0 2.29G=C2=A0=C2=A0 446G=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 -=C2=A0=C2=A0=C2=A0=C2=A0 1%=C2=A0=C2=A0=C2=A0=C2=A0 0%=C2=A0 1.00x=C2=A0=

>> ONLINE=C2=A0 -*
>> *mail_dataset=C2=A0 58.2T=C2=A0 11.8T=C2=A0 46.4T=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0 -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -=
=C2=A0=C2=A0=C2=A0 26%=C2=A0=C2=A0=C2=A0 20%=C2=A0 1.00x=C2=A0
>> ONLINE=C2=A0 -*
>>
Ok, seems you have got 10 * 8 TB in a raidz2 configuration.

Only 20% of the mail dataset is in use, the situation will become much wo=
rse
when the pool will fill up!

>> *I suppose fragmentation affects too....*
>>
On magnetic media fragmentation means that a file is spread out over the =
disk
in a non-optimal way, causing access latencies due to seeks and rotationa=
l
delay. That kind of fragmentation is not really relevant for SSDs, which =
allow
for fast random access to the cells.

And the FRAG value shown by the "zpool list" command is not about fragmen=
tation
of files at all, it is about the structure of free space. Anyway less rel=
evant
for SSDs than for classic hard disk drives.

>>>     But after writing those few GB at a speed of some 500 MB/s (i.e.
>>>     after 12 to 150 seconds), the drive will need several minutes to
>>>     transfer those writes to the quad-level cells, and will operate
>>>     at a fraction of the nominal performance during that time.
>>>     (QLC writes max out at 80 MB/s for the 1 TB model, 160 MB/s for t=
he
>>>     2 TB model.)
>>>     =C2=A0
>>>     *Well we are in the 8TB model. I think I have understood what you=
 wrote
>>>     in previous paragraph. You said they can be fast but not constant=
ly,
>>>     because later they have to write all that to their perpetual stor=
age
>>>     from the cache. And that's slow. Am I wrong?. Even in the 8TB mod=
el you
>>>     think Stefan?.*
>>>
>> The controller in the SSD supports a given number of channels (e.g 4),=
 each
>> of which can access a Flash chip independently of the others. Small SS=
Ds
>> often have less Flash chips than there are channels (and thus a lower
>> throughput, especially for writes), but the larger models often have m=
ore
>> chips than channels and thus the performance is capped.
>>
>> *This is totally logical. If a QVO disk would outperform best or simil=
ar
>> than an Intel without consequences.... who was going to buy a expensiv=
e
>> Intel enterprise?.*
>>
The QVO is bandwidth limited due to the SATA data rate of 6 Mbit/s anyway=
, and
it is optimized for reads (which are not significantly slower than offere=
d by
the TLC models). This is a viable concept for a consumer PC, but not for =
a server.
>>
>> In the case of the 870 QVO, the controller supports 8 channels, which =
allows
>> it to write 160 MB/s into the QLC cells. The 1 TB model apparently has=
 only
>> 4 Flash chips and is thus limited to 80 MB/s in that situation, while =
the
>> larger versions have 8, 16, or 32 chips. But due to the limited number=
 of
>> channels, the write rate is limited to 160 MB/s even for the 8 TB mode=
l.
>>
>> *Totally logical Stefan...*
>>
>> If you had 4 * 2 TB instead, the throughput would be 4 * 160 MB/s in t=
his limit.
>>
>>>     *The main problem we are facing is that in some peak moments, whe=
n the
>>>     machine serves connections for all the instances it has, and only=
 as
>>>     said in some peak moments... like the 09am or the 11am.... it see=
ms the
>>>     machine becomes slower... and like if the disks weren't able to s=
erve
>>>     all they have to serve.... In these moments, no big files are mov=
ed...
>>>     but as we have 1800-2000 concurrent imap connections... normally =
they
>>>     are doing each one... little changes in their mailbox. Do you thi=
nk
>>>     perhaps this disks then are not appropriate for this kind of usag=
e?-*
>>>
>> I'd guess that the drives get into a state in which they have to recyc=
le
>> lots of partially free blocks (i.e. perform kind of a garbage collecti=
on)
>> and then three kinds of operations are competing with each other:
>>
>>  1. reads (generally prioritized)
>>  2. writes (filling the SLC cache up to its maximum size)
>>  3. compactions of partially filled blocks (required to make free bloc=
ks
>>     available for re-use)
>>
>> Writes can only proceed if there are sufficient free blocks, which on =
a
>> filled SSD with partially filled erase blocks means that operations of=
 type
>> 3. need to be performed with priority to not stall all writes.
>>
>> My assumption is that this is what you are observing under peak load.
>>
>> *It could be although the disks are not filled.... the pool are at 20 =
or 30%
>> of capacity and fragmentation from 20%-30% (as zpool list states).*
>>
Yes, and that means that your issues will become much more critical over =
time
when the free space shrinks and garbage collections will be required at a=
n even
faster rate, with the SLC cache becoming less and less effective to weed =
out
short lived files as an additional factor that will increase write amplif=
ication.
>>>
>>>     And cheap SSDs often have no RAM cache (not checked, but I'd be
>>>     surprised if the QVO had one) and thus cannot keep bookkeeping da=
te
>>>     in such a cache, further limiting the performance under load.
>>>     =C2=A0
>>>     *This brochure
>>>     (https://semiconductor.samsung.com/resources/brochure/870_Series_=
Brochure.pdf
>>>     and the datasheet
>>>     https://semiconductor.samsung.com/resources/data-sheet/Samsung_SS=
D_870_QVO_Data_Sheet_Rev1.1.pdf)
>>>     sais if I have read properly, the 8TB drive has 8GB of ram?. I as=
sume
>>>     that is what they call the turbo write cache?.*
>>>
>> No, the turbo write cache consists of the cells used in SLC mode (whic=
h can
>> be any cells, not only cells in a specific area of the flash chip).
>>
>> *I see I see....*
>>
>> The RAM is needed for fast lookup of the position of data for reads an=
d of
>> free blocks for writes.
>>
>> *Our ones... seem to have 8GB LPDDR4 of ram.... as datasheet states...=
=2E*
>>
Yes, and it makes sense that the RAM size is proportional to the capacity=
 since
a few bytes are required per addressable data block.

If the block size was 8 KB the RAM could hold 8 bytes (e.g. a pointer and=
 some
status flags) for each logically addressable block. But there is no infor=
mation
about the actual internal structure of the QVO that I know of.

[...]
>>
>> *I see.... It's extremely misleading you know... because... you can co=
py
>> five mailboxes of 50GB concurrently for instance.... and you flood a g=
igabit
>> interface copying (obviously because disks can keep that throughput)..=
=2E but
>> later.... you see... you are in an hour that yesterday, and even 4 day=
s
>> before you have not had any issues... and that day... you see the comm=
ented
>> issue... even not being exactly at a peak hour (perhaps is two hours l=
ater
>> the peak hour even)... or... but I wasn't noticing about all things yo=
u say
>> in this email....*
>>
>> I have seen advice to not use compression in a high load scenario in s=
ome
>> other reply.
>>
>> I tend to disagree: Since you seem to be limited when the SLC cache is=

>> exhausted, you should get better performance if you compress your data=
=2E I
>> have found that zstd-2 works well for me (giving a significant overall=

>> reduction of size at reasonable additional CPU load). Since ZFS allows=
 to
>> switch compressions algorithms at any time, you can experiment with
>> different algorithms and levels.
>>
>> *I see... you say compression should be enabled.... The main reason be=
cause
>> we have not enabled it yet, is for keeping the system the most near po=
ssible
>> to config defaults... you know... for later being able to ask in this
>> mailing lists if we have an issue... because you know... it would be f=
ar
>> more easier to ask about something strange you are seeing when that st=
range
>> thing is near to a well tested config, like the config by default....*=

>>
>> *But now you say Stefan... if you switch between compression algorithm=
s you
>> will end up with a mix of different files compressed in a different
>> manner... that is not a bit disaster later?. Doesn't affect performanc=
e in
>> some manner?.*
>>
The compression used is stored in the per file information, each file in =
a
dataset could have been written with a different compression method and l=
evel.
Blocks are independently compressed - a file level compression may be mor=
e
effective. Large mail files will contain incompressible attachments (alre=
ady
compressed), but in base64 encoding. This should allow a compression rati=
o of
~1,3. Small files will be plain text or HTML, offering much better compre=
ssion
factors.
>>
>> One advantage of ZFS compression is that it applies to the ARC, too. A=
nd a
>> compression factor of 2 should easily be achieved when storing mail (n=
ot for
>> .docx, .pdf, .jpg files though). Having more data in the ARC will redu=
ce the
>> read pressure on the SSDs and will give them more cycles for garbage
>> collections (which are performed in the background and required to alw=
ays
>> have a sufficient reserve of free flash blocks for writes).
>>
>> *We would use I assume the lz4... which is the less "expensive" compre=
ssion
>> algorithm for the CPU... and I assume too for avoiding delay accessing=

>> data... do you recommend another one?. Do you always recommend compres=
sion
>> then?.*
>>
I'd prefer zstd over lz4 since it offers a much higher compression ratio.=


Zstd offers higher compression ratios than lz4 at similar or better
decompression speed, but may be somewhat slower compressing the data. But=
 in my
opinion this is outweighed by the higher effective amount of data in the
ARC/L2ARC possible with zstd.

For some benchmarks of different compression algorithms available for ZFS=
 and
compared to uncompressed mode see the extensive results published by Jude=
 Allan:

https://docs.google.com/spreadsheets/d/1TvCAIDzFsjuLuea7124q-1UtMd0C9amTg=
nXm2yPtiUQ/edit?usp=3Dsharing

The SQL benchmarks might best resemble your use case - but remember that =
a significant reduction of the amount of data being written to the SSDs m=
ight be more important than the highest transaction rate, since your SSDs=
 put a low upper limit on that when highly loaded.

>> I'd give it a try - and if it reduces your storage requirements by 10%=
 only,
>> then keep 10% of each SSD unused (not assigned to any partition). That=
 will
>> greatly improve the resilience of your SSDs, reduce the write-amplific=
ation,
>> will allow the SLC cache to stay at its large value, and may make a la=
rge
>> difference to the effective performance under high load.
>>
>> *But when you enable compression... only gets compressed the new data
>> modified or entered. Am I wrong?.*
>>
Compression is per file system data block (at most 1 MB if you set the
blocksize to that value). Each such block is compressed independently of =
all
others, to not require more than 1 block to be read and decompressed when=

randomly reading a file. If a block does not shrink when compressed (it m=
ay
contain compressed file data) the block is written to disk as-is (uncompr=
essed).
>>
>> **
>>
>> *By the way, we have more or less 1/4 of each disk used (12 TB allocat=
ed in
>> a poll stated by zpool list, divided between 8 disks of 8TB...)... do =
you
>> think we could be suffering on write amplification and so... having a =
so
>> little disk space used in each disk?.*
>>
Your use case will cause a lot of garbage collections and this particular=
 high
write amplification values.
>>
>> Regards, STefan
>>
>> *Hey mate, your mail is incredible. It has helped as a lot. Can we inv=
ite
>> you a cup of coffee or a beer through Paypal or similar?. Can I help y=
ou in
>> some manner?.*
>>
Thanks, I'm glad to help, and I'd appreciate to hear whether you get your=
 setup
optimized for the purpose (and how well it holds up when you approach the=

capacity limits of your drives).

I'm always interested in experience of users with different use cases tha=
n I
have (just being a developer with too much archived mail and media collec=
ted
over a few decades).

Regards, STefan

--------------5zqkLuBRyXvluXGRQjtNvCUK
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<html>
  <head>
    <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DUTF=
-8">
  </head>
  <body>
    <div class=3D"moz-cite-prefix">Am 07.04.22 um 14:30 schrieb <a
        class=3D"moz-txt-link-abbreviated moz-txt-link-freetext"
        href=3D"mailto:egoitz@ramattack.net">egoitz@ramattack.net</a>:<br=
>
    </div>
    <blockquote type=3D"cite"
      cite=3D"mid:e3ccbea91aca7c8870fd56ad393401a4@ramattack.net">
      <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DU=
TF-8">
      El 2022-04-06 23:49, Stefan Esser escribi=C3=B3:
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0"><!-- html ignored -->
        <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:=

          #1010ff 2px solid; margin: 0">
          <p>El 2022-04-06 17:43, Stefan Esser escribi=C3=B3:</p>
          <blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px=

            solid; margin: 0;">
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;"><br>
              Am 06.04.22 um 16:36 schrieb <a
                class=3D"moz-txt-link-abbreviated moz-txt-link-freetext"
                href=3D"mailto:egoitz@ramattack.net"
                moz-do-not-send=3D"true">egoitz@ramattack.net</a>:
              <blockquote style=3D"padding: 0 0.4em; border-left: #1010ff=

                2px solid; margin: 0;">Hi Rainer!<br>
                <br>
                Thank you so much for your help :) :)<br>
                <br>
                Well I assume they are in a datacenter and should not be
                a power outage....<br>
                <br>
                About dataset size... yes... our ones are big... they
                can be 3-4 TB easily each<br>
                dataset.....<br>
                <br>
                We bought them, because as they are for mailboxes and
                mailboxes grow and<br>
                grow.... for having space for hosting them...</blockquote=
>
              <br>
              Which mailbox format (e.g. mbox, maildir, ...) do you use?<=
/div>
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;">=C2=A0</div>
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;"><strong><span style=3D"color: #008000;">I'm
                  running Cyrus imap so sort of Maildir... too many
                  little files normally..... Sometimes directories with
                  tons of little files....</span></strong></div>
          </blockquote>
        </blockquote>
        <p>Assuming that many mails are much smaller than the erase
          block size of the SSD, this may cause issues. (You may know
          the following ...) </p>
        <p>For example, if you have message sizes of 8 KB and an erase
          block size of 64 KB (just guessing), then 8 mails will be in
          an erase block. If half the mails are deleted, then the erase
          block will still occupy 64 KB, but only hold 32 KB of useful
          data (and the SSD will only be aware of this fact if TRIM has
          signaled which data is no longer relevant). The SSD will copy
          several partially filled erase blocks together in a smaller
          number of free blocks, which then are fully utilized. Later
          deletions will repeat this game, and your data will be copied
          multiple times until it has aged (and the user is less likely
          to delete further messages). This leads to "write
          amplification" - data is internally moved around and thus
          written multiple times.</p>
        <p><br>
        </p>
        <p><strong><span style=3D"color: #0000ff;">Stefan!! you are nice!=
!
              I think this could explain all our problem. So, why we are
              having the most randomness in our performance degradation
              and that does not necessarily has to match with the most
              io peak hours... That I could cause that performance
              degradation just by deleting a couple of huge (perhaps
              200.000 mails) mail folders in a middle traffic hour
              time!!</span></strong></p>
      </blockquote>
    </blockquote>
    Yes, if deleting large amounts of data triggers performance issues
    (and the disk does not have a deficient TRIM implementation), then
    the issue is likely to be due to internal garbage collections
    colliding with other operations.<br>
    <blockquote type=3D"cite"
      cite=3D"mid:e3ccbea91aca7c8870fd56ad393401a4@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <p><strong><span style=3D"color: #0000ff;">The problem is that by=

              what I know, erase block size of an SSD disk is something
              fixed in the disk firmware. I don't really know if perhaps
              it could be modified with Samsung magician or those kind
              of tool of Samsung.... else I don't really see the manner
              of improving it... because apart from that, you are
              deleting a file in raidz-2 array... no just in a disk... I
              assume aligning chunk size, with record size and with the
              "secret" erase size of the ssd, perhaps could be slightly
              compensated?.</span></strong></p>
      </blockquote>
    </blockquote>
    <p>The erase block size is a fixed hardware feature of each flash
      chip. There is a block size for writes (e.g. 8 KB) and many such
      blocks are combined in one erase block (of e.g. 64 KB, probably
      larger in todays SSDs), they can only be returned to the free
      block pool all together. And if some of these writable blocks hold
      live data, they must be preserved by collecting them in newly
      allocated free blocks.</p>
    <p>An example of what might happen, showing a simplified layout of
      files 1, 2, 3 (with writable blocks 1a, 1b, ..., 2a, 2b, ... and
      "--" for stale data of deleted files, ".." for erased/writable
      flash blocks) in an SSD might be:</p>
    <p><font face=3D"monospace">erase block 1: |1a|1b|--|--|2a|--|--|3a|<=
/font></p>
    <font face=3D"monospace"> </font>
    <p><font face=3D"monospace">erase block 2; |--|--|--|2b|--|--|--|1c|<=
/font></p>
    <font face=3D"monospace"> </font>
    <p><font face=3D"monospace">erase block 3; |2c|1d|3b|3c|--|--|--|--|<=
/font></p>
    <font face=3D"monospace"> </font>
    <p><font face=3D"monospace">erase block 4; |..|..|..|..|..|..|..|..|<=
/font></p>
    <p>This is just a random example how data could be laid out on the
      physical storage array. It is assumed that the 3 erase blocks once
      were completely occupied <br>
    </p>
    <p>In this example, 10 of 32 writable blocks are occupied, and only
      one free erase block exists.</p>
    <p>This situation must not persist, since the SSD needs more empty
      erase blocks. 10/32 of the capacity is used for data, but 3/4 of
      the blocks are occupied and not immediately available for new
      data.</p>
    <p>The garbage collection might combine erase blocks 1 and 3 into a
      currently free one, e.g. erase block 4:</p>
    <font face=3D"monospace">erase block 1; |..|..|..|..|..|..|..|..| </f=
ont>
    <p><font face=3D"monospace">erase block 2; |--|--|--|2b|--|--|--|1c|<=
/font></p>
    <font face=3D"monospace"> </font>
    <p><font face=3D"monospace">erase block 3; |..|..|..|..|..|..|..|..|<=
/font></p>
    <font face=3D"monospace"> </font>
    <p><font face=3D"monospace">erase block 4: |1a|1b|2a|3a|2c|1d|3b|3c|<=
/font></p>
    <p>Now only 2/4 of the capacity is not available for new data (which
      is still a lot more than 10/32, but better than before).</p>
    <p>Now assume file 2 is deleted:<font face=3D"monospace"><br>
      </font></p>
    <font face=3D"monospace"> </font>
    <p><font face=3D"monospace">erase block 1; |..|..|..|..|..|..|..|..| =
</font></p>
    <font face=3D"monospace"> </font>
    <p><font face=3D"monospace">erase block 2; |--|--|--|--|--|--|--|1c|<=
/font></p>
    <font face=3D"monospace"> </font>
    <p><font face=3D"monospace">erase block 3; |..|..|..|..|..|..|..|..|<=
/font></p>
    <font face=3D"monospace"> </font>
    <p><font face=3D"monospace">erase block 4: |1a|1b|--|3a|--|1d|3b|3c|<=
/font></p>
    <p>There is now a new sparsely used erase block 4, and it will soon
      need to be garbage collected, too - in fact it could be combined
      with the live data from erase block 2, but this may be delayed
      until there is demand for more erased blocks (since e.g. file 1 or
      3 might also have been deleted by then).<br>
    </p>
    <p>The garbage collection does not know which data blocks belong to
      which file, and therefore it cannot collect the data belonging to
      a file into a single erase block. Blocks are allocated as data
      comes in (as long as enough SLC cells are available in this area,
      else directly in QLC cells). Your many parallel updates will cause
      fractions of each larger file to be spread out over many erase
      blocks.</p>
    <p>As you can see, a single file that is deleted may affect many
      erase blocks, and you have to take redundancy into consideration,
      which will multiply the effect by a factor of up to 3 for small
      files (one ZFS allocation block). And remember: deleting a message
      in mdir format will free the data blocks, but will also remove the
      directory entry, causing additional meta-data writes (again
      multiplied by the raid redundancy).<br>
    </p>
    <p>
    </p>
    <p>A consumer SSD would normally see only very few parallel writes,
      and sequential writes of full files will have a high chance to put
      the data of each file contiguously in the minimum number of erase
      blocks, allowing to free multiple complete erase blocks when such
      a file is deleted and thus obviating the need for many garbage
      collection copies (that occur if data from several independent
      files is in one erase block).<br>
    </p>
    <p>Actual SSDs have many more cells than advertised. Some 10% to 20%
      may be kept as a reserve for aging blocks that e.g. may have
      failed kind of a "read-after-write test" (implemented in the write
      function, which adds charges to the cells until they return the
      correct read-outs).</p>
    <p>BTW: Having an ashift value that is lower than the internal write
      block size may also lead to higher write amplification values, but
      a large ashift may lead to more wasted capacity, which may become
      an issue if typical file length are much smaller than the
      allocation granularity that results from the ashift value.<br>
    </p>
    <p> </p>
    <blockquote type=3D"cite"
      cite=3D"mid:e3ccbea91aca7c8870fd56ad393401a4@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <p>Larger mails are less of an issue since they span multiple
          erase blocks, which will be completely freed when such a
          message is deleted.</p>
        <p><strong><span style=3D"color: #0000ff;">I see I see Stefan...<=
/span></strong></p>
        <p>Samsung has a lot of experience and generally good strategies
          to deal with such a situation, but SSDs specified for use in
          storage systems might be much better suited for that kind of
          usage profile.</p>
        <p><strong><span style=3D"color: #0000ff;">Yes... and the disks
              for our purpose... perhaps weren't QVOs....</span></strong>=
</p>
      </blockquote>
    </blockquote>
    <p>You should have got (much more expensive) server grade SSDs,
      IMHO.</p>
    <p>But even 4 * 2 TB QVO (or better EVO) drives per each 8 TB QVO
      drive would result in better performance (but would need a lot of
      extra SATA ports).<br>
    </p>
    <p>In fact, I'm not sure whether rotating media and a reasonable
      L2ARC consisting of a fast M.2 SSD plus a mirror of small SSDs for
      a LOG device would not be a better match for your use case.
      Reading the L2ARC would be very fast, writes would be purely
      sequential and relatively slow, you could choose a suitable L2ARC
      strategy (caching of file data vs. meta data), and the LOG device
      would support fast fsync() operations required for reliable mail
      systems (which confirm data is on stable storage before
      acknowledging the reception to the sender).<br>
    </p>
    <blockquote type=3D"cite"
      cite=3D"mid:e3ccbea91aca7c8870fd56ad393401a4@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:=

          #1010ff 2px solid; margin: 0">
          <blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px=

            solid; margin: 0;">
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;">
              <blockquote style=3D"padding: 0 0.4em; border-left: #1010ff=

                2px solid; margin: 0;">We knew they had some speed
                issues, but those speed issues, we thought (as<br>
                Samsung explains in the QVO site) they started after
                exceeding the speeding<br>
                buffer this disks have. We though that meanwhile you
                didn't exceed it's<br>
                capacity (the capacity of the speeding buffer) no speed
                problem arises. Perhaps<br>
                we were wrong?.</blockquote>
              <br>
              These drives are meant for small loads in a typical PC use
              case,<br>
              i.e. some installations of software in the few GB range,
              else only<br>
              files of a few MB being written, perhaps an import of
              media files<br>
              that range from tens to a few hundred MB at a time, but
              less often<br>
              than once a day.</div>
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;">=C2=A0</div>
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;"><strong><span style=3D"color: #008000;">We move=
,
                  you know... lots of little files... and lot's of
                  different concurrent modifications by 1500-2000
                  concurrent imap connections we have...</span></strong><=
/div>
          </blockquote>
        </blockquote>
        <p>I do not expect the read load to be a problem (except
          possibly when the SSD is moving data from SLC to QLC blocks,
          but even then reads will get priority). But writes and trims
          might very well overwhelm the SSD, especially when its getting
          full. Keeping a part of the SSD unused (excluded from the
          partitions created) will lead to a large pool of unused
          blocks. This will reduce the write amplification - there are
          many free blocks in the "unpartitioned part" of the SSD, and
          thus there is less urgency to compact partially filled blocks.
          (E.g. if you include only 3/4 of the SSD capacity in a
          partition used for the ZPOOL, then 1/4 of each erase block
          could be free due to deletions/TRIM without any compactions
          required to hold all this data.)</p>
        <p>Keeping a significant percentage of the SSD unallocated is a
          good strategy to improve its performance and resilience.</p>
        <p><strong><span style=3D"color: #0000ff;">Well, we have allocate=
d
              all the disk space... but not used... just allocated....
              you know... we do a zpool create with the whole disks.....<=
/span></strong><span
            style=3D"color: #0000ff;"></span></p>
      </blockquote>
    </blockquote>
    <p>I think the only chance for a solution that does not require new
      hardware is to make sure, only some 80% of the SSDs are used (i.e.
      allocate only 80% for ZFS, leave 20% unallocated). This will
      significantly reduce the rate of garbage collections and thus
      reduce the load they cause.</p>
    <p>I'd use a fast encryption algorithm (zstd - choose a level that
      does not overwhelm the CPU, there are benchmark results for ZFS
      with zstd, and I found zstd-2 to be best for my use case). This
      will more than make up for the space you left unallocated on the
      SSDs.</p>
    <p>A different mail box format might help, too - I'm happy with
      dovecot's mdbox format, which is as fast but much more efficient
      than mdir.<br>
    </p>
    <blockquote type=3D"cite"
      cite=3D"mid:e3ccbea91aca7c8870fd56ad393401a4@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:=

          #1010ff 2px solid; margin: 0">
          <blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px=

            solid; margin: 0;">
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;">As the SSD fills, the space available for the
              single level write<br>
              cache gets smaller</div>
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;">=C2=A0</div>
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;"><strong><span style=3D"color: #008000;">The
                  single level write cache is the cache these ssd
                  drivers have, for compensating the speed issues they
                  have due to using qlc memory?. Do you refer to that?.
                  Sorry I don't understand well this paragraph.</span></s=
trong></div>
          </blockquote>
        </blockquote>
        <p>Yes, the SSD is specified to hold e.g. 1 TB at 4 bits per
          cell. The SLC cache has only 1 bit per cell, thus a 6 GB SLC
          cache needs as many cells as 24 GB of data in QLC mode.</p>
        <p><strong><span style=3D"color: #0000ff;">Ok, true.... yes....</=
span></strong></p>
        <p>A 100 GB SLC cache would reduce the capacity of a 1 TB SSD to
          700 GB (600 GB in 150 tn QLC cells plus 100 GB in 100 tn SLC
          cells). </p>
        <p><strong><span style=3D"color: #0000ff;">Ahh! you mean that SLC=

              capacity for speeding up the QLC disks, is obtained from
              each single layer of the QLC?.</span></strong></p>
      </blockquote>
    </blockquote>
    <p>There are no specific SLC cells. A fraction of the QLC capable
      cells is only written with only 1 instead of 4 bits. This is a
      much simpler process, since there are only 2 charge levels per
      cell that are used, while QLC uses 16 charge levels, and you can
      only add charge (must not overshoot), therefore only small
      increments are added until the correct value can be read out).</p>
    <p>But since SLC cells take away specified capacity (which is
      calculated assuming all cells hold 4 bits each, not only 1 bit),
      their number is limited and shrinks as demand for QLC cells grows.<=
/p>
    <p>The advantage of the SLC cache is fast writes, but also that data
      in it may have become stale (trimmed) and thus will never be
      copied over into a QLC block. But as the SSD fills and the size of
      the SLC cache shrinks, this capability will be mostly lost, and
      lots of very short lived data is stored in QLC cells, which will
      quickly become partially stale and thus needing compaction as
      explained above.<br>
    </p>
    <blockquote type=3D"cite"
      cite=3D"mid:e3ccbea91aca7c8870fd56ad393401a4@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <p>Therefore, the fraction of the cells used as an SLC cache is
          reduced when it gets full (e.g. ~1 TB in ~250 tn QLC cells,
          plus 6 GB in 6 tn SLC cells).</p>
        <p><span style=3D"color: #0000ff;"><strong>Sorry I don't get this=

              last sentence... don't understand it because I don't
              really know the meaning of tn... </strong></span></p>
        <p><span style=3D"color: #0000ff;"><strong>but I think I'm gettin=
g
              the idea if you say that each QLC layer, has it's own SLC
              cache obtained from the disk space avaiable for each QLC
              layer....</strong></span></p>
        <p>And with less SLC cells available for short term storage of
          data the probability of data being copied to QLC cells before
          the irrelevant messages have been deleted is significantly
          increased. And that will again lead to many more blocks with
          "holes" (deleted messages) in them, which then need to be
          copied possibly multiple times to compact them.</p>
        <p><strong><span style=3D"color: #0000ff;">If I correct above, I
              think I got the idea yes....</span></strong></p>
        <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:=

          #1010ff 2px solid; margin: 0">
          <blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px=

            solid; margin: 0;"><font face=3D"monospace">(on many SSDs, I
              have no numbers for this</font><br>
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;">particular device), and thus the amount of
              data that can be<br>
              written at single cell speed shrinks as the SSD gets full.<=
/div>
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;">=C2=A0</div>
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;"><br>
              I have just looked up the size of the SLC cache, it is
              specified<br>
              to be 78 GB for the empty SSD, 6 GB when it is full (for
              the 2 TB<br>
              version, smaller models will have a smaller SLC cache).</di=
v>
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;">=C2=A0</div>
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;"><strong><span style=3D"color: #008000;">Assumin=
g
                  you were talking about the cache for compensating
                  speed we previously commented, I should say these are
                  the 870 QVO but the 8TB version. So they should have
                  the biggest cache for compensating the speed issues...<=
/span></strong></div>
          </blockquote>
        </blockquote>
        <p>I have looked up the data: the larger versions of the 870 QVO
          have the same SLC cache configuration as the 2 TB model, 6 GB
          minimum and up to 72 GB more if there are enough free blocks.</=
p>
        <p><strong><span style=3D"color: #0000ff;">Ours one is the 8TB
              model so I assume it could have bigger limits. The disks
              are mostly empty, really.... so... for instance....</span><=
/strong></p>
        <p><strong><span style=3D"color: #0000ff;">zpool list</span></str=
ong><br>
          <strong><span style=3D"color: #0000ff;">NAME=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 SIZE=C2=A0
              ALLOC=C2=A0=C2=A0 FREE=C2=A0 CKPOINT=C2=A0 EXPANDSZ=C2=A0=C2=
=A0 FRAG=C2=A0=C2=A0=C2=A0 CAP=C2=A0 DEDUP=C2=A0
              HEALTH=C2=A0 ALTROOT</span></strong><br>
          <strong><span style=3D"color: #0000ff;">root_dataset=C2=A0 448G=
=C2=A0
              2.29G=C2=A0=C2=A0 446G=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0 -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -=C2=A0=C2=A0=C2=A0=
=C2=A0 1%=C2=A0=C2=A0=C2=A0=C2=A0 0%=C2=A0 1.00x=C2=A0
              ONLINE=C2=A0 -</span></strong><br>
          <strong><span style=3D"color: #0000ff;">mail_dataset=C2=A0 58.2=
T=C2=A0
              11.8T=C2=A0 46.4T=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -=C2=A0=C2=A0=C2=A0 26=
%=C2=A0=C2=A0=C2=A0 20%=C2=A0 1.00x=C2=A0
              ONLINE=C2=A0 -</span></strong></p>
      </blockquote>
    </blockquote>
    <p>Ok, seems you have got 10 * 8 TB in a raidz2 configuration.</p>
    <p>Only 20% of the mail dataset is in use, the situation will become
      much worse when the pool will fill up!</p>
    <blockquote type=3D"cite"
      cite=3D"mid:e3ccbea91aca7c8870fd56ad393401a4@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <p><strong><span style=3D"color: #0000ff;">I suppose fragmentatio=
n
              affects too....</span></strong></p>
      </blockquote>
    </blockquote>
    <p>On magnetic media fragmentation means that a file is spread out
      over the disk in a non-optimal way, causing access latencies due
      to seeks and rotational delay. That kind of fragmentation is not
      really relevant for SSDs, which allow for fast random access to
      the cells.</p>
    <p>And the FRAG value shown by the "zpool list" command is not about
      fragmentation of files at all, it is about the structure of free
      space. Anyway less relevant for SSDs than for classic hard disk
      drives.<br>
    </p>
    <blockquote type=3D"cite"
      cite=3D"mid:e3ccbea91aca7c8870fd56ad393401a4@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:=

          #1010ff 2px solid; margin: 0">
          <blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px=

            solid; margin: 0;">
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;">But after writing those few GB at a speed of
              some 500 MB/s (i.e.<br>
              after 12 to 150 seconds), the drive will need several
              minutes to<br>
              transfer those writes to the quad-level cells, and will
              operate<br>
              at a fraction of the nominal performance during that time.<=
br>
              (QLC writes max out at 80 MB/s for the 1 TB model, 160
              MB/s for the<br>
              2 TB model.)</div>
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;">=C2=A0</div>
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;"><strong><span style=3D"color: #008000;">Well we=

                  are in the 8TB model. I think I have understood what
                  you wrote in previous paragraph. You said they can be
                  fast but not constantly, because later they have to
                  write all that to their perpetual storage from the
                  cache. And that's slow. Am I wrong?. Even in the 8TB
                  model you think Stefan?.</span></strong></div>
          </blockquote>
        </blockquote>
        <p>The controller in the SSD supports a given number of channels
          (e.g 4), each of which can access a Flash chip independently
          of the others. Small SSDs often have less Flash chips than
          there are channels (and thus a lower throughput, especially
          for writes), but the larger models often have more chips than
          channels and thus the performance is capped.</p>
        <p><strong><span style=3D"color: #0000ff;">This is totally
              logical. If a QVO disk would outperform best or similar
              than an Intel without consequences.... who was going to
              buy a expensive Intel enterprise?.</span></strong></p>
      </blockquote>
    </blockquote>
    The QVO is bandwidth limited due to the SATA data rate of 6 Mbit/s
    anyway, and it is optimized for reads (which are not significantly
    slower than offered by the TLC models). This is a viable concept for
    a consumer PC, but not for a server.<br>
    <blockquote type=3D"cite"
      cite=3D"mid:e3ccbea91aca7c8870fd56ad393401a4@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <p>In the case of the 870 QVO, the controller supports 8
          channels, which allows it to write 160 MB/s into the QLC
          cells. The 1 TB model apparently has only 4 Flash chips and is
          thus limited to 80 MB/s in that situation, while the larger
          versions have 8, 16, or 32 chips. But due to the limited
          number of channels, the write rate is limited to 160 MB/s even
          for the 8 TB model.</p>
        <p><strong><span style=3D"color: #0000ff;">Totally logical
              Stefan...</span></strong></p>
        <p>If you had 4 * 2 TB instead, the throughput would be 4 * 160
          MB/s in this limit.</p>
        <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:=

          #1010ff 2px solid; margin: 0">
          <blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px=

            solid; margin: 0;">
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;"><span style=3D"color: #008000;"><strong>The mai=
n
                  problem we are facing is that in some peak moments,
                  when the machine serves connections for all the
                  instances it has, and only as said in some peak
                  moments... like the 09am or the 11am.... it seems the
                  machine becomes slower... and like if the disks
                  weren't able to serve all they have to serve.... In
                  these moments, no big files are moved... but as we
                  have 1800-2000 concurrent imap connections... normally
                  they are doing each one... little changes in their
                  mailbox. Do you think perhaps this disks then are not
                  appropriate for this kind of usage?-</strong></span></d=
iv>
          </blockquote>
        </blockquote>
        <p>I'd guess that the drives get into a state in which they have
          to recycle lots of partially free blocks (i.e. perform kind of
          a garbage collection) and then three kinds of operations are
          competing with each other:</p>
        <ol>
          <li>reads (generally prioritized)</li>
          <li>writes (filling the SLC cache up to its maximum size)</li>
          <li>compactions of partially filled blocks (required to make
            free blocks available for re-use)</li>
        </ol>
        <p>Writes can only proceed if there are sufficient free blocks,
          which on a filled SSD with partially filled erase blocks means
          that operations of type 3. need to be performed with priority
          to not stall all writes.</p>
        <p>My assumption is that this is what you are observing under
          peak load.</p>
        <p><strong><span style=3D"color: #0000ff;">It could be although
              the disks are not filled.... the pool are at 20 or 30% of
              capacity and fragmentation from 20%-30% (as zpool list
              states).</span></strong></p>
      </blockquote>
    </blockquote>
    Yes, and that means that your issues will become much more critical
    over time when the free space shrinks and garbage collections will
    be required at an even faster rate, with the SLC cache becoming less
    and less effective to weed out short lived files as an additional
    factor that will increase write amplification.<br>
    <blockquote type=3D"cite"
      cite=3D"mid:e3ccbea91aca7c8870fd56ad393401a4@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:=

          #1010ff 2px solid; margin: 0">
          <blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px=

            solid; margin: 0;">
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;">And cheap SSDs often have no RAM cache (not
              checked, but I'd be<br>
              surprised if the QVO had one) and thus cannot keep
              bookkeeping date<br>
              in such a cache, further limiting the performance under
              load.</div>
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;">=C2=A0</div>
            <div class=3D"pre" style=3D"margin: 0; padding: 0; font-famil=
y:
              monospace;"><strong><span style=3D"color: #008000;">This
                  brochure (<a class=3D"moz-txt-link-freetext"
                    style=3D"color: #008000;"
href=3D"https://semiconductor.samsung.com/resources/brochure/870_Series_B=
rochure.pdf"
                    target=3D"_blank" rel=3D"noopener noreferrer"
                    moz-do-not-send=3D"true">https://semiconductor.samsun=
g.com/resources/brochure/870_Series_Brochure.pdf</a>
                  and the datasheet <a class=3D"moz-txt-link-freetext"
href=3D"https://semiconductor.samsung.com/resources/data-sheet/Samsung_SS=
D_870_QVO_Data_Sheet_Rev1.1.pdf"
                    target=3D"_blank" rel=3D"noopener noreferrer"
                    moz-do-not-send=3D"true">https://semiconductor.samsun=
g.com/resources/data-sheet/Samsung_SSD_870_QVO_Data_Sheet_Rev1.1.pdf</a>)=

                  sais if I have read properly, the 8TB drive has 8GB of
                  ram?. I assume that is what they call the turbo write
                  cache?.</span></strong></div>
          </blockquote>
        </blockquote>
        <p>No, the turbo write cache consists of the cells used in SLC
          mode (which can be any cells, not only cells in a specific
          area of the flash chip).</p>
        <p><strong><span style=3D"color: #0000ff;">I see I see....</span>=
</strong></p>
        <p>The RAM is needed for fast lookup of the position of data for
          reads and of free blocks for writes.</p>
        <p><strong><span style=3D"color: #0000ff;">Our ones... seem to
              have 8GB LPDDR4 of ram.... as datasheet states....</span></=
strong></p>
      </blockquote>
    </blockquote>
    <p>Yes, and it makes sense that the RAM size is proportional to the
      capacity since a few bytes are required per addressable data
      block.</p>
    <p>If the block size was 8 KB the RAM could hold 8 bytes (e.g. a
      pointer and some status flags) for each logically addressable
      block. But there is no information about the actual internal
      structure of the QVO that I know of.</p>
    [...]<br>
    <blockquote type=3D"cite"
      cite=3D"mid:e3ccbea91aca7c8870fd56ad393401a4@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <p><strong><span style=3D"color: #0000ff;">I see.... It's
              extremely misleading you know... because... you can copy
              five mailboxes of 50GB concurrently for instance.... and
              you flood a gigabit interface copying (obviously because
              disks can keep that throughput)... but later.... you
              see... you are in an hour that yesterday, and even 4 days
              before you have not had any issues... and that day... you
              see the commented issue... even not being exactly at a
              peak hour (perhaps is two hours later the peak hour
              even)... or... but I wasn't noticing about all things you
              say in this email....</span></strong></p>
        <p>I have seen advice to not use compression in a high load
          scenario in some other reply.</p>
        <p>I tend to disagree: Since you seem to be limited when the SLC
          cache is exhausted, you should get better performance if you
          compress your data. I have found that zstd-2 works well for me
          (giving a significant overall reduction of size at reasonable
          additional CPU load). Since ZFS allows to switch compressions
          algorithms at any time, you can experiment with different
          algorithms and levels.</p>
        <p><strong><span style=3D"color: #0000ff;">I see... you say
              compression should be enabled.... The main reason because
              we have not enabled it yet, is for keeping the system the
              most near possible to config defaults... you know... for
              later being able to ask in this mailing lists if we have
              an issue... because you know... it would be far more
              easier to ask about something strange you are seeing when
              that strange thing is near to a well tested config, like
              the config by default....</span></strong></p>
        <p><strong><span style=3D"color: #0000ff;">But now you say
              Stefan... if you switch between compression algorithms you
              will end up with a mix of different files compressed in a
              different manner... that is not a bit disaster later?.
              Doesn't affect performance in some manner?.</span></strong>=
<span
            style=3D"color: #0000ff;"></span></p>
      </blockquote>
    </blockquote>
    The compression used is stored in the per file information, each
    file in a dataset could have been written with a different
    compression method and level. Blocks are independently compressed -
    a file level compression may be more effective. Large mail files
    will contain incompressible attachments (already compressed), but in
    base64 encoding. This should allow a compression ratio of ~1,3.
    Small files will be plain text or HTML, offering much better
    compression factors.<br>
    <blockquote type=3D"cite"
      cite=3D"mid:e3ccbea91aca7c8870fd56ad393401a4@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <p>One advantage of ZFS compression is that it applies to the
          ARC, too. And a compression factor of 2 should easily be
          achieved when storing mail (not for .docx, .pdf, .jpg files
          though). Having more data in the ARC will reduce the read
          pressure on the SSDs and will give them more cycles for
          garbage collections (which are performed in the background and
          required to always have a sufficient reserve of free flash
          blocks for writes).</p>
        <p><strong><span style=3D"color: #0000ff;">We would use I assume
              the lz4... which is the less "expensive" compression
              algorithm for the CPU... and I assume too for avoiding
              delay accessing data... do you recommend another one?. Do
              you always recommend compression then?.</span></strong></p>=

      </blockquote>
    </blockquote>
    <p> I'd prefer zstd over lz4 since it offers a much higher
      compression ratio.</p>
    <p>Zstd offers higher compression ratios than lz4 at similar or
      better decompression speed, but may be somewhat slower compressing
      the data. But in my opinion this is outweighed by the higher
      effective amount of data in the ARC/L2ARC possible with zstd.</p>
    <p>For some benchmarks of different compression algorithms available
      for ZFS and compared to uncompressed mode see the extensive
      results published by Jude Allan:<br>
    </p>
    <pre class=3D"moz-quote-pre" wrap=3D""><a class=3D"moz-txt-link-freet=
ext" href=3D"https://docs.google.com/spreadsheets/d/1TvCAIDzFsjuLuea7124q=
-1UtMd0C9amTgnXm2yPtiUQ/edit?usp=3Dsharing">https://docs.google.com/sprea=
dsheets/d/1TvCAIDzFsjuLuea7124q-1UtMd0C9amTgnXm2yPtiUQ/edit?usp=3Dsharing=
</a>

The SQL benchmarks might best resemble your use case - but remember that =
a significant reduction of the amount of data being written to the SSDs m=
ight be more important than the highest transaction rate, since your SSDs=
 put a low upper limit on that when highly loaded.
</pre>
    <blockquote type=3D"cite"
      cite=3D"mid:e3ccbea91aca7c8870fd56ad393401a4@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <p>I'd give it a try - and if it reduces your storage
          requirements by 10% only, then keep 10% of each SSD unused
          (not assigned to any partition). That will greatly improve the
          resilience of your SSDs, reduce the write-amplification, will
          allow the SLC cache to stay at its large value, and may make a
          large difference to the effective performance under high load.<=
/p>
        <p><strong><span style=3D"color: #0000ff;">But when you enable
              compression... only gets compressed the new data modified
              or entered. Am I wrong?.</span></strong></p>
      </blockquote>
    </blockquote>
    Compression is per file system data block (at most 1 MB if you set
    the blocksize to that value). Each such block is compressed
    independently of all others, to not require more than 1 block to be
    read and decompressed when randomly reading a file. If a block does
    not shrink when compressed (it may contain compressed file data) the
    block is written to disk as-is (uncompressed).<br>
    <blockquote type=3D"cite"
      cite=3D"mid:e3ccbea91aca7c8870fd56ad393401a4@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <p><strong> </strong></p>
        <p><strong><span style=3D"color: #0000ff;">By the way, we have
              more or less 1/4 of each disk used (12 TB allocated in a
              poll stated by zpool list, divided between 8 disks of
              8TB...)... do you think we could be suffering on write
              amplification and so... having a so little disk space used
              in each disk?.</span></strong></p>
      </blockquote>
    </blockquote>
    Your use case will cause a lot of garbage collections and this
    particular high write amplification values.<br>
    <blockquote type=3D"cite"
      cite=3D"mid:e3ccbea91aca7c8870fd56ad393401a4@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <p>Regards, STefan</p>
        <p><strong><span style=3D"color: #0000ff;">Hey mate, your mail is=

              incredible. It has helped as a lot. Can we invite you a
              cup of coffee or a beer through Paypal or similar?. Can I
              help you in some manner?.</span></strong></p>
      </blockquote>
    </blockquote>
    <p>Thanks, I'm glad to help, and I'd appreciate to hear whether you
      get your setup optimized for the purpose (and how well it holds up
      when you approach the capacity limits of your drives).<br>
    </p>
    <p>I'm always interested in experience of users with different use
      cases than I have (just being a developer with too much archived
      mail and media collected over a few decades).</p>
    <p>Regards, STefan<br>
    </p>
  </body>
</html>

--------------5zqkLuBRyXvluXGRQjtNvCUK--

--------------O0ssWPEiEQcQgZRJEKUGD0ko--

--------------TD2ASvjzfpo4JyxrFz8JyiVP
Content-Type: application/pgp-signature; name="OpenPGP_signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="OpenPGP_signature"

-----BEGIN PGP SIGNATURE-----

wsB5BAABCAAjFiEEo3HqZZwL7MgrcVMTR+u171r99UQFAmJQGRAFAwAAAAAACgkQR+u171r99UTJ
4gf6A4flCtMFuYpldXh1g1ln+Nio4LQGooOn69VSJw4KhihTBqy5ZR8scfhKetf8/miSx/0Akvsc
WqZA9bEy67LXbUCekfbuUXQdO8ikXY1H64fecl4ZQZwItnIacKKD6TEIuBDe5sda0N+S2n7mNE/N
d3EhNEyQTBOVvOx4vHdQaz+xAR6FFXstc14bs6BaSjROUndk21zO2IE8KMXkxH4RiWqoRuny3Po7
Uz3q0/+rcPSe9GHLw3BOHbvo89NdCFwlgBv5CXDMnEqqeW7ECdZiHn14XL/F30L8zCi2c3eTbD4Q
AJeoKbBsmzSaud3opYrv2oCr8UOWiSLEJOkYAyvJeA==
=tDUQ
-----END PGP SIGNATURE-----

--------------TD2ASvjzfpo4JyxrFz8JyiVP--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?b9dba1b4-1db1-9d73-da8a-080906c8e146>