From owner-freebsd-fs@freebsd.org Thu May 5 08:24:37 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B99ADB2D616 for ; Thu, 5 May 2016 08:24:37 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from cu01176b.smtpx.saremail.com (cu01176b.smtpx.saremail.com [195.16.151.151]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id ADE141396 for ; Thu, 5 May 2016 08:24:36 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from [172.16.8.36] (izaro.sarenet.es [192.148.167.11]) by proxypop01.sare.net (Postfix) with ESMTPSA id B4F759DDD8C for ; Thu, 5 May 2016 10:14:29 +0200 (CEST) From: Borja Marcos Subject: ZFS and SSD, trim caused stalling Message-Id: <132CDFA3-0390-4208-B6D5-3F2AE49E4B47@sarenet.es> Date: Thu, 5 May 2016 10:14:29 +0200 To: freebsd-fs Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) X-Mailer: Apple Mail (2.3124) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 May 2016 08:24:37 -0000 Hello, Doing some tests with Intel P3500 NVMEs I have found a serious = performance problem caused by the TRIM operation.=20 Maybe it=E2=80=99s better not to use trim on these SSDs, I am not sure, = but anyway this reveals a serious performance problem which can happen with other SSDs. Actually I have seen a comparable behavior = at least with another SSD, although less serious. For example, trying with a 128 GB OCZ Vertex4, there was some stalling, = although this particular SSD trims at around 2 GB/s while it can sustain a write throughput of 200 MB/s until it reaches 50% capacity, = falling to around 100 MB/s. I know this is a very worst case benchmark, but operations like the = deletion of a large snapshot or a dataset could trigger similar problems. In order to do a gross check of the I/O performance of this system, I = created a raidz2 pool with 10 NVMEs. After creating it, I used Bonnie++. As a single bonnie instance is unable to = generate enough I/O activity, I actually ran=20 four in parallel. Doing a couple of tests, I noticed that the second time I launched four = Bonnies the writing activity was completely stalled. Repeeating a single test I noticed this (file OneBonnie.png): The Bonnies were writing for 30 minutes, the read/write test took around = 50 minutes, and the reading test took 10 minutes more or less. But after the Bonnie processes finished, the = deletion of the files took more or less 30 minutes of heavy trim activity.=20 Running two tests, one after another, showed something far more serious. = The second group of four Bonnies was stalled for around 15 minutes while there was heavy trim I/O = activity. And according to the service times reported by devstat, the stall didn=E2=80=99t happen in the disk I/O = subsystem. Looking at the activity between 8:30 and=20 8:45 it can be seen that the service time reported for the writing = operations is 0, which means that the write operations aren=E2=80=99t actually reaching the disk. (files TwoBonniesTput.png and = TwoBonniesTimes.png) ZFS itself is starving the whole vdev. Doing some silly operations such = a =E2=80=9Cls=E2=80=9D was a problem as well, the system performance was awful.=20 Apart from disabling TRIM there would be two solutions to this problem: 1) Somewhat deferring the TRIM operations. Of course it implies that the = block freeing work must be throttled, which can cause its own issues. 2) Skipping the TRIMs sometimes. Depending on the particular workload = and SSD model, TRIM can be almost mandatory or just a =E2=80=9Cnice to have=E2=80=9D feature. In a case like this, = deleting large files (four 512 GB files) has caused a very serious = impact. In this case TRIM has done more harm than good.=20 The selective TRIM skipping could be based just on the number of TRIM = requests pending on the vdev queues (past some threshold the TRIM requests would be discarded) or maybe the ZFS block = freeing routines would make a similar decision. I=E2=80=99m not sure where it=E2=80=99s better to implement this. A couple of sysctl variables could keep a counter of discarded TRIM = operations and total =E2=80=9Cnot trimmed=E2=80=9D bytes, making if = possible to know the impact of this measure. And this mechanism could be based = on some static threshold configured via a sysctl variable or, even better, ZFS could make a decision based on the queue depth. In case = write or read requests got an unacceptable service time, the system would invalidate the TRIM requests. What do you think? In some cases it=E2=80=99s clear that TRIM can do = more harm than good. I think that this measure can buy the best of both worlds: TRIMming when possible, during =E2=80=9Cnormal=E2=80=9D = I/O activity, and avoiding the troubles caused by it during exceptional activity (deletion of very large files/large number of files/large = snapshots/datasets). Borja.