Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 17 May 2016 11:27:05 +0200
From:      Borja Marcos <borjam@sarenet.es>
To:        Steven Hartland <killing@multiplay.co.uk>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: ZFS and NVMe, trim caused stalling
Message-ID:  <87668F1E-D165-4195-9DB0-4764038FC075@sarenet.es>
In-Reply-To: <20a155fd-8695-ca42-6a72-32cb78864a22@multiplay.co.uk>
References:  <5E710EA5-C9B0-4521-85F1-3FE87555B0AF@bsdimp.com> <BD7424F9-2968-410D-8146-27496054BCFA@sarenet.es> <20a155fd-8695-ca42-6a72-32cb78864a22@multiplay.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help

> On 17 May 2016, at 11:09, Steven Hartland <killing@multiplay.co.uk> =
wrote:
>=20
>> I understand that, but I don=E2=80=99t think it=E2=80=99s a good that =
ZFS depends blindly on a driver feature such
>> as that. Of course, it=E2=80=99s great to exploit it.
>>=20
>> I have also noticed that ZFS has a good throttling mechanism for =
write operations. A similar
>> mechanism should throttle trim requests so that trim requests don=E2=80=
=99t clog the whole system.
> It already does.

I see that there=E2=80=99s a limit to the number of active TRIM =
requests, but not an explicit delay such
as it=E2=80=99s applied to write requests. So, even with a single =
maximum active TRIM request,
it seems that TRIM wins.=20


>>=20
>>> I=E2=80=99d be extremely hesitant to tossing away TRIMs. They are =
actually quite important for
>>> the FTL in the drive=E2=80=99s firmware to proper manage the NAND =
wear. More free space always
>>> reduces write amplification. It tends to go as 1 / freespace, so =
simply dropping them on
>>> the floor should be done with great reluctance.
>> I understand. I was wondering about choosing the lesser between two =
evils. A 15 minute
>> I/O stall (I deleted 2 TB of data, that=E2=80=99s a lot, but not so =
unrealistic) or settings trims aside
>> during the peak activity.
>>=20
>> I see that I was wrong on that, as a throttling mechanism would be =
more than enough probably,
>> unless the system is close to running out of space.
>>=20
>> I=E2=80=99ve filed a bug report anyway. And copying to -stable.
>>=20
>>=20
>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209571
>>=20
> TBH it sounds like you may have badly behaved HW, we've used ZFS + =
TRIM and for years on large production boxes and while we're seen slow =
down we haven't experienced the total lockups you're describing.

I am using ZFS+TRIM on SATA SSD disks for a very long time. Actually, a =
single SSD I tried at home can TRIM at around 2 GB/s.=20
Warner Losh told me that the nvd driver is not currently coalescing the =
TRIMs, which is a disadvantage compared to the ada driver, which does.


> The graphs on you're ticket seem to indicate peak throughput of =
250MB/s which is extremely slow for standard SSD's let alone NVMe ones =
and when you add in the fact you have 10 well it seems like something is =
VERY wrong.

The pool is a raidz2 vdev with 10 P3500 NVMe disks. That graph is the =
throughput of just one of the disks (the other 9 graphs are identical). =
Bonnie++
reports around 1.7 Gigabytes/s writing =E2=80=9Cintelligently=E2=80=9D, =
1 GB/s =E2=80=9Crewriting=E2=80=9D and almost 2 GB/s =E2=80=9Creading =
intelligently=E2=80=9D which, as far as I know, is more or less
reasonable.

The really slow part are the TRIM requests. When destroying the files =
(it=E2=80=99s four concurrent bonnie++ tasks writing a total of 2 =
Terabytes)=20

> I just did a quick test on our DB box here creating and then deleting =
a 2G file as you describe and I couldn't even spot the delete in the =
general noise it was so quick to process and that's a 6 disk machine =
with P3700=E2=80=99s.

Totalling 2 TB? In my case it was FOUR files, 512 GB each.

I=E2=80=99m realy puzzled,=20




Borja.






Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?87668F1E-D165-4195-9DB0-4764038FC075>