Date: Thu, 5 May 2016 10:14:29 +0200 From: Borja Marcos <borjam@sarenet.es> To: freebsd-fs <freebsd-fs@freebsd.org> Subject: ZFS and SSD, trim caused stalling Message-ID: <132CDFA3-0390-4208-B6D5-3F2AE49E4B47@sarenet.es>
next in thread | raw e-mail | index | archive | help
Hello, Doing some tests with Intel P3500 NVMEs I have found a serious = performance problem caused by the TRIM operation.=20 Maybe it=E2=80=99s better not to use trim on these SSDs, I am not sure, = but anyway this reveals a serious performance problem which can happen with other SSDs. Actually I have seen a comparable behavior = at least with another SSD, although less serious. For example, trying with a 128 GB OCZ Vertex4, there was some stalling, = although this particular SSD trims at around 2 GB/s while it can sustain a write throughput of 200 MB/s until it reaches 50% capacity, = falling to around 100 MB/s. I know this is a very worst case benchmark, but operations like the = deletion of a large snapshot or a dataset could trigger similar problems. In order to do a gross check of the I/O performance of this system, I = created a raidz2 pool with 10 NVMEs. After creating it, I used Bonnie++. As a single bonnie instance is unable to = generate enough I/O activity, I actually ran=20 four in parallel. Doing a couple of tests, I noticed that the second time I launched four = Bonnies the writing activity was completely stalled. Repeeating a single test I noticed this (file OneBonnie.png): The Bonnies were writing for 30 minutes, the read/write test took around = 50 minutes, and the reading test took 10 minutes more or less. But after the Bonnie processes finished, the = deletion of the files took more or less 30 minutes of heavy trim activity.=20 Running two tests, one after another, showed something far more serious. = The second group of four Bonnies was stalled for around 15 minutes while there was heavy trim I/O = activity. And according to the service times reported by devstat, the stall didn=E2=80=99t happen in the disk I/O = subsystem. Looking at the activity between 8:30 and=20 8:45 it can be seen that the service time reported for the writing = operations is 0, which means that the write operations aren=E2=80=99t actually reaching the disk. (files TwoBonniesTput.png and = TwoBonniesTimes.png) ZFS itself is starving the whole vdev. Doing some silly operations such = a =E2=80=9Cls=E2=80=9D was a problem as well, the system performance was awful.=20 Apart from disabling TRIM there would be two solutions to this problem: 1) Somewhat deferring the TRIM operations. Of course it implies that the = block freeing work must be throttled, which can cause its own issues. 2) Skipping the TRIMs sometimes. Depending on the particular workload = and SSD model, TRIM can be almost mandatory or just a =E2=80=9Cnice to have=E2=80=9D feature. In a case like this, = deleting large files (four 512 GB files) has caused a very serious = impact. In this case TRIM has done more harm than good.=20 The selective TRIM skipping could be based just on the number of TRIM = requests pending on the vdev queues (past some threshold the TRIM requests would be discarded) or maybe the ZFS block = freeing routines would make a similar decision. I=E2=80=99m not sure where it=E2=80=99s better to implement this. A couple of sysctl variables could keep a counter of discarded TRIM = operations and total =E2=80=9Cnot trimmed=E2=80=9D bytes, making if = possible to know the impact of this measure. And this mechanism could be based = on some static threshold configured via a sysctl variable or, even better, ZFS could make a decision based on the queue depth. In case = write or read requests got an unacceptable service time, the system would invalidate the TRIM requests. What do you think? In some cases it=E2=80=99s clear that TRIM can do = more harm than good. I think that this measure can buy the best of both worlds: TRIMming when possible, during =E2=80=9Cnormal=E2=80=9D = I/O activity, and avoiding the troubles caused by it during exceptional activity (deletion of very large files/large number of files/large = snapshots/datasets). Borja.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?132CDFA3-0390-4208-B6D5-3F2AE49E4B47>