Date: Mon, 2 Apr 2018 14:36:26 +0100 From: Stilez Stilezy <stilezy@gmail.com> To: freebsd-fs@freebsd.org Subject: ZFS dedup write pathway - possible inefficiency or..? Message-ID: <CAFwhr76S-hqJya5P91HsJh94SnYR-scn%2BcmBEUFCDpQT_gD_OA@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
Hi list, I'm writing because of the fairly technical nature of this question about 2 possibly related issues within ZFS. The first issue is specific to the dedup write pathway. I've tested locally to a point where it seems it's not due to inadequate hardware and it's very consistent and specific, even on idle conditions/minimal load. I'm wondering whether there's a code bottleneck specifically affecting just the dedup write pathway. The second issue seems to be that in some scenarios, ZFS doesn't read from the IO buffers where I'd expect it to, causing netio issues elsewhere. I should say that I'm aware of the intense nature of dedup processing and hopefully I'm not a noob user asking as usual about dedup on crap hardware that can't do the job with data that shouldn't be anywhere near a dedup engine. That's not the case here. The system I'm testing on is built and specced to handle dedup and has a ton of dedupable data with a high ratio, so it's not an edge case: it should be ideal. I'm not really asking for help in resolving the issue. My question is aimed at understanding technically more about the bottlenecks/issues so I can make intelligent decisions how to approach it. More to the point, my gut feeling is that there is some kind of issue/inefficiency in the dedup write pathway, and a second issue whereby ZFS isn't always reading as it should from the network buffers: netio rcv buffers periodically fail to empty during ZFS processing, it's apparently somehow related to txg handling, and causes netrcv buffer backup and TCP zero window issuance within milliseconds lasting almost continually for lengthy periods. That doesn't look right. My main reason for posting here is that if these do seem to be genuine inefficiencies/issues, I'd like to ask if it's sensible to put in an enhancement request on bugzilla for either or both of them. So either way these probably needs technical/dev/committer insight, as I'd like to find out (1) if it's possible to guess what the internal underlying ZFS issues are, and (2) if it's worth putting in enhancement/fix requests. *TEST HARDWARE / OS:* - *Baseboard/CPU/RAM* = Supermicro X10 series + Xeon E5-1680 v4 (3.4+ GHz, octo core, 20MB cache, Broadwell generation) + 128 GB ECC @ 2400 - *Main pool* = array of 12 enterprise 7200 SAS HDDs hanging off 2 x LSI 9311 PCIe3 HBA. The HDDs are configured in ZFS as 4 x (3 way mirrors). Cache drives = Intel P3700 NVMe SLOG/ZIL (reckoned to be very good for reliable low write latency) and 250GB Samsung NVMe (L2ARC) - *NIC* = 10G Chelsio (if it matters) - *Power stability* = EVGA Supernova Platinum 1600W + APC 1500VA UPS - *OS version* = clean install full ISO FreeBSD 11.1 arm64 onto wiped boot SSD mirror (tested with both bare install and also prepackaged as "FreeNAS") - *Installed sw:* Very little running beyond bare OS - no jails, no bhyve, no mods/patches, no custom kernel. Samba and iperf for testing across LAN (see below). - *Main pool:* The main pool has >22 TB capacity and in physical terms it's about 55% full. The data is nicely balanced across the disks, which are almost all the same (or very similar) performance. The data in the existing pool is highly dedupable - it has a ratio of about 4x and judging by zdb's output (total blocks x bytes needed per block) the DDT is about 50 GB. - * Sysctls / loader / rc:* Various sysctls - can list if required. In particular metadata about 75GB of RAM, so that DDTs aren't likely to be forced out, with the remainder split between OS and other file cache (about 10G for OS and about 35G for ARC not reserved for metadata). *Significant values if needed: **vfs.zfs.arc_meta_limit* 75G, *vfs.zfs.l2arc_write_max/write_boost* 300000000 (300MB/s), *vfs.zfs.vdev.cache.size* 200MB, *vfs.zfs.delay_min_dirty_percent* 70. Also various tunables for efficient 10G networking, including testing with large receive buffer sizes. In theory, it should be a fairly powerful setup for handling the heavy workload of a small-scale dedup pool, with no parity data/RaidZ, in quiet circumstances. Certainly I'm not expecting the dedup write outcomes I'm seeing. *TEST SETUP:* I attached 2 x fast wiped SSDs capable of 500 MB/s+ rw, formatted as UFS, and an additional 3 way mirror of another 7200 enterprise SAS drives on the same HBAs, for testing. I created a second pool on the temporary HDD pool, configured identical to the main pool but empty and with dedup=off. I copied a few very large files and a directory of smaller files onto the SSDs (30, 50 and 110GB single files, plus a mixed dir of 3MB mp3s, datasets, ISOs etc), and also copied them to twin SSDs on my directly connected workstation. I hash-checked the copies to ensure they were identical, so that dedup would probably match the blocks they contain. Then I tested copying the files onto both dedup and non-dedup pools, locally (CLI) and across the LAN (Samba) as well as testing raw networking IO (iperf). In each case I copied the files to/from newly created empty dirs, with the intent that the dedup pool would dedup these against existing copies, and the non-dedup pool would just write them as normal. The network and server were both checked as being quiet/idle apart from these copies (previous write flushing finished, all netio/diskio/CPU idle for several seconds, etc). I copied the files from SSD (client/UFS) to the dedup pool and the non-dedup pool, repeatedly and in turn, to offset/minimise issues related to non-cached vs cached data, and to ensure that if dedup was on, performance was measured when DDTs already contained entries for the blocks and they were alrready known cached in ARC. In theory, the writes would be identical other than dedup on/off. Repeat to check stable results. When copying data, I watched common system stats (gstat, iostat, netstat, top, via SSH in multiple windows, all updating every 1 sec). *RESULTS:* Checking with iperf and Samba showed that the system was very fast for reading from the pool, and networking (both ways). It was capable of up to 1 GByte/sec both ways (duplex) on Samba, fractionally more on iperf. But when writing data, whether locally via CLI or across the LAN with Samba, writing to the dedup pool was consistently 10x ~ 20x slower than writing to the non-dedup pool. (raw file write speeds 30~50 MB/s dedup vs 400 ~ 1000 MB/s non-dedup, as seen by client on a 100 GB single file transfer, before allowing for caching effects, nothing else going on). I had known dedup would impact RAM and performance but I had expected a good CPU and hardware to mitigate it a lot, and it wasn't being mitigated much if at all. It was impacting so much that when writing across Samba, the networking subsystem could be seen in tcpdump to be driven to smaller windows and floods of tiny- and zero-windows on TCP, in order to allow *something* within ZFS a lot of extra time for write request handling. Nothing like this outcome happened during non-dedup pool writing, or during dedup reading, of these files. But the server's performance consistently dropped by 10x ~ 20x when writing to the dedup pool. The system should almost surely have enough RAM, and high standard of hardware/setup. DiskIO and txg's looked about right, networking looked sane, and the issue affected only dedup writing. The main suspect seems likely to be either CPU/threading, or an unexpectedly huge avalanche of required metadata updates. With a large amount of RAM to play with and very fast ZIL, and large diskIO caching setup to even out diskIO, it doesn't seem *that* likely to be down to metadata updates, and "top" showed that most of the CPU was idle, but I can't tell if that's cause or effect. I altered a number of tunables to increase TXG/max dirty data/write coalescing, to the point it was writing in noticeable bursts, and even so it didn't help. I also noticed as an aside, if relevant, that where I'd expected one TXG to be building up while the previous TXG was writing out, that wasn't what was happening when writing across the LAN. What I saw consistently was that regularly, ZFS would stop pulling data off the network buffers for a lengthy number of seconds. At 10G speeds the netrcv buffer backed up in milliseconds, causing zero windows. Then, abruptly, the buffer would almost instantly empty and this seemed to coincide with the start of a high lebvel of HDD writing-out. I'm not sure why networking is being stalled and not continuing smoothly, perhaps someone will know? I posted some technical details elsewhere - graphs showing netrcv buffer fill rate (which matches to the millisecond what you'd see if ZFS completely stopped reading incoming data for a lengthy period), and other screenshots. If useful I'll add links in a followup. *DISCUSSION / QUESTIONS:* I suspect that the reason dedup writes (only) were so slow, is that somewhere in the dedup write pathway, where it hashes data, matches it to a DDT entry, and optionally verifies it, something inefficient is going on, and it's slowing down the entire pathway. Perhaps it's only using single core? Perhaps metadata updates aren'are more serious than I realised or inefficient? I'm not sure what's up. But it's very consistent, I've repeated this on multiple platforms and installs since doing the first tests. I don't know where to look further and I probably need input from someone knowledgeable with the internals of the ZFS subsystem to do more. I'd like to nail this down closer and get ideas what's (probably) up. And I'd like to see if it can be enhanced for others, by feeding back into bugzilla if helpful. So my questions are - 1. There seem to be two ZFS issues, and they're somehow linked: *(A)* the dedup write pathway suffers from what feels like an unexpectedly horrible slowdown that's excessive in the scenario; *and* *(B)* ZFS seems to halt from pulling data from the network rcv buffer during a significant part of its processing cycle, to the point that netio is forced to zero win for lengthy periods and most of the time, whereas I'd expect incoming network data to be processed into a new txg regardless of other processing going on, and not cause congestion unless a lot more was going on. Does anyone "in the know" on the technical side have an insight into what might be going on with either of these, or suggest any diagnostics/further info that's useful to pin it down? 2. I'd expect dedup write performance to have *some* kind of impact due to the processing required. But is the dedup write performance slowdown on dedup write (specifically) that I'm seeing, usual *to this extent, on this class of hardware*, on an idle system with just one large file being written between 2 local file systems? 3. If either of these matters *does* turn out to be a threading or other clear inefficiency on the write pathway or anywhere else, is it likely to be useful if I file an enhancement request in bugzilla? After all, dedup is incredibly useful in a small number of scenarios and if a server of this hardware on a single user load is struggling that much, it would be interesting to know the technical point where it's occurring. (Equally, I guess many people advocate not using ZFS dedup in almost any scenario, because end users inevitably use it on completely inadequate hardware or s totally unsuitable data, so perhaps it's a pathologised area with little patience and "don't expect much to be done now it's stable"!) Anyhow, hoping for an insightful reply! Thank you Stilez
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFwhr76S-hqJya5P91HsJh94SnYR-scn%2BcmBEUFCDpQT_gD_OA>