Date: Thu, 5 May 2016 10:14:29 +0200 From: Borja Marcos <borjam@sarenet.es> To: freebsd-fs <freebsd-fs@freebsd.org> Subject: ZFS and SSD, trim caused stalling Message-ID: <132CDFA3-0390-4208-B6D5-3F2AE49E4B47@sarenet.es>
index | next in thread | raw e-mail
Hello, Doing some tests with Intel P3500 NVMEs I have found a serious performance problem caused by the TRIM operation. Maybe it’s better not to use trim on these SSDs, I am not sure, but anyway this reveals a serious performance problem which can happen with other SSDs. Actually I have seen a comparable behavior at least with another SSD, although less serious. For example, trying with a 128 GB OCZ Vertex4, there was some stalling, although this particular SSD trims at around 2 GB/s while it can sustain a write throughput of 200 MB/s until it reaches 50% capacity, falling to around 100 MB/s. I know this is a very worst case benchmark, but operations like the deletion of a large snapshot or a dataset could trigger similar problems. In order to do a gross check of the I/O performance of this system, I created a raidz2 pool with 10 NVMEs. After creating it, I used Bonnie++. As a single bonnie instance is unable to generate enough I/O activity, I actually ran four in parallel. Doing a couple of tests, I noticed that the second time I launched four Bonnies the writing activity was completely stalled. Repeeating a single test I noticed this (file OneBonnie.png): The Bonnies were writing for 30 minutes, the read/write test took around 50 minutes, and the reading test took 10 minutes more or less. But after the Bonnie processes finished, the deletion of the files took more or less 30 minutes of heavy trim activity. Running two tests, one after another, showed something far more serious. The second group of four Bonnies was stalled for around 15 minutes while there was heavy trim I/O activity. And according to the service times reported by devstat, the stall didn’t happen in the disk I/O subsystem. Looking at the activity between 8:30 and 8:45 it can be seen that the service time reported for the writing operations is 0, which means that the write operations aren’t actually reaching the disk. (files TwoBonniesTput.png and TwoBonniesTimes.png) ZFS itself is starving the whole vdev. Doing some silly operations such a “ls” was a problem as well, the system performance was awful. Apart from disabling TRIM there would be two solutions to this problem: 1) Somewhat deferring the TRIM operations. Of course it implies that the block freeing work must be throttled, which can cause its own issues. 2) Skipping the TRIMs sometimes. Depending on the particular workload and SSD model, TRIM can be almost mandatory or just a “nice to have” feature. In a case like this, deleting large files (four 512 GB files) has caused a very serious impact. In this case TRIM has done more harm than good. The selective TRIM skipping could be based just on the number of TRIM requests pending on the vdev queues (past some threshold the TRIM requests would be discarded) or maybe the ZFS block freeing routines would make a similar decision. I’m not sure where it’s better to implement this. A couple of sysctl variables could keep a counter of discarded TRIM operations and total “not trimmed” bytes, making if possible to know the impact of this measure. And this mechanism could be based on some static threshold configured via a sysctl variable or, even better, ZFS could make a decision based on the queue depth. In case write or read requests got an unacceptable service time, the system would invalidate the TRIM requests. What do you think? In some cases it’s clear that TRIM can do more harm than good. I think that this measure can buy the best of both worlds: TRIMming when possible, during “normal” I/O activity, and avoiding the troubles caused by it during exceptional activity (deletion of very large files/large number of files/large snapshots/datasets). Borja.help
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?132CDFA3-0390-4208-B6D5-3F2AE49E4B47>
