From owner-freebsd-fs@freebsd.org  Thu May  5 08:24:37 2016
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id B99ADB2D616
 for <freebsd-fs@mailman.ysv.freebsd.org>; Thu,  5 May 2016 08:24:37 +0000 (UTC)
 (envelope-from borjam@sarenet.es)
Received: from cu01176b.smtpx.saremail.com (cu01176b.smtpx.saremail.com
 [195.16.151.151])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id ADE141396
 for <freebsd-fs@freebsd.org>; Thu,  5 May 2016 08:24:36 +0000 (UTC)
 (envelope-from borjam@sarenet.es)
Received: from [172.16.8.36] (izaro.sarenet.es [192.148.167.11])
 by proxypop01.sare.net (Postfix) with ESMTPSA id B4F759DDD8C
 for <freebsd-fs@freebsd.org>; Thu,  5 May 2016 10:14:29 +0200 (CEST)
From: Borja Marcos <borjam@sarenet.es>
Subject: ZFS and SSD, trim caused stalling
Message-Id: <132CDFA3-0390-4208-B6D5-3F2AE49E4B47@sarenet.es>
Date: Thu, 5 May 2016 10:14:29 +0200
To: freebsd-fs <freebsd-fs@freebsd.org>
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
X-Mailer: Apple Mail (2.3124)
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.22
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 05 May 2016 08:24:37 -0000


Hello,

Doing some tests with Intel P3500 NVMEs I have found a serious =
performance problem caused by the TRIM operation.=20
Maybe it=E2=80=99s better not to use trim on these SSDs, I am not sure, =
but anyway this reveals a serious performance problem which
can happen with other SSDs. Actually I have seen a comparable behavior =
at least with another SSD, although less serious. For
example, trying with a 128 GB OCZ Vertex4, there was some stalling, =
although this particular SSD trims at around 2 GB/s while it can
sustain a write throughput of 200 MB/s until it reaches 50% capacity, =
falling to around 100 MB/s.

I know this is a very worst case benchmark, but operations like the =
deletion of a large snapshot or a dataset could
trigger similar problems.

In order to do a gross check of the I/O performance of this system, I =
created a raidz2 pool with 10 NVMEs. After
creating it, I used Bonnie++. As a single bonnie instance is unable to =
generate enough I/O activity, I actually ran=20
four in parallel.

Doing a couple of tests, I noticed that the second time I launched four =
Bonnies the writing activity was completely
stalled. Repeeating a single test I noticed this (file OneBonnie.png):

The Bonnies were writing for 30 minutes, the read/write test took around =
50 minutes, and the reading test took
10 minutes more or less. But after the Bonnie processes finished, the =
deletion of the files took more or less
30 minutes of heavy trim activity.=20

Running two tests, one after another, showed something far more serious. =
The second group of four Bonnies
was stalled for around 15 minutes while there was heavy trim I/O =
activity. And according to the service times
reported by devstat, the stall didn=E2=80=99t happen in the disk I/O =
subsystem. Looking at the activity between 8:30 and=20
8:45 it can be seen that the service time reported for the writing =
operations is 0, which means that the write operations
aren=E2=80=99t actually reaching the disk. (files TwoBonniesTput.png and =
TwoBonniesTimes.png)

ZFS itself is starving the whole vdev. Doing some silly operations such =
a =E2=80=9Cls=E2=80=9D was a problem as well, the system
performance was awful.=20

Apart from disabling TRIM there would be two solutions to this problem:

1) Somewhat deferring the TRIM operations. Of course it implies that the =
block freeing work must be throttled, which
can cause its own issues.

2) Skipping the TRIMs sometimes. Depending on the particular workload =
and SSD model, TRIM can be almost mandatory
or just a =E2=80=9Cnice to have=E2=80=9D feature. In a case like this, =
deleting large files (four 512 GB files) has caused a very serious =
impact. In
this case TRIM has done more harm than good.=20

The selective TRIM skipping could be based just on the number of TRIM =
requests pending on the vdev queues (past some
threshold the TRIM requests would be discarded) or maybe the ZFS block =
freeing routines would make a similar decision. I=E2=80=99m not
sure where it=E2=80=99s better to implement this.

A couple of sysctl variables could keep a counter of discarded TRIM =
operations and total =E2=80=9Cnot trimmed=E2=80=9D bytes, making if =
possible
 to know the impact of this measure. And this mechanism could be based =
on some static threshold configured via a sysctl variable or,
even better, ZFS could make a decision based on the queue depth. In case =
write or read requests got an unacceptable service
time, the system would invalidate the TRIM requests.

What do you think? In some cases it=E2=80=99s clear that TRIM can do =
more harm than good. I think that this measure can buy the best
of both worlds: TRIMming when possible, during =E2=80=9Cnormal=E2=80=9D =
I/O activity, and avoiding the troubles caused by it during exceptional
activity (deletion of very large files/large number of files/large =
snapshots/datasets).


Borja.