Date: Fri, 21 Dec 2018 13:18:55 +0000 From: Stilez Stilezy <stilezy@gmail.com> To: freebsd-fs@freebsd.org Subject: Re: Dev help requested: short DTrace instrumentation script to narrow down cause of a longstanding ZFS write issue Message-ID: <CAFwhr748pVWL7vVEhRyGAaNU84YZ-yqoHyCCeCLeVpbhrwapvw@mail.gmail.com> In-Reply-To: <CAFwhr76CXkgUUUnbYHBby2KjDZ7DkLN-UTyGWbv=Bv12dWpcgQ@mail.gmail.com> References: <CAFwhr76CXkgUUUnbYHBby2KjDZ7DkLN-UTyGWbv=Bv12dWpcgQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
To add some more data about this issue. I just tried the suggestions offered by Delphix, for write throttle problems <https://www.delphix.com/blog/delphix-engineering/tuning-openzfs-write-throttle> (specifically the last part, about when ZFS genuinely can't keep up). I used the same setup as above (same hardware, clean FreeNAS 11.2, single SMB client writing a single 100GB to the pool, with an otherwise-idle 10G LAN). I modifyied only a few sysctls (otherwise left at default for FreeNAS 11.2), in line with the guide. The sysctls I modified were: (vfs.zfs.)*delay_scale, delay_min_dirty_percent, async_write_active_min/max_dirty_percent*. The usual sysctl for chelsio max MTU 4096 under 11.2-REL was in use, and RNG data harvesting from 10G NIC was disabled.Apart from that, it was default settings for ZFS sysctls. Hopefully this may give ideas what's up. I kept it very simple. I started by looking at dirty data writeout patterns before making any sysctl changes. This is the output of the well-known script for combined dirty.d + duration.d, when the file transfer began on the client. It seems to show matters almost immediately getting out of hand with dirty data writeout, by the 4th TXG sync: dtrace -s dirty_and_duration.d tank dtrace: script 'dirty_and_duration.d' matched 2 probes CPU ID FUNCTION:NAME 6 63989 none:txg-synced 0MB of 4096MB synced in 0.47 seconds 6 63989 none:txg-synced 0MB of 4096MB synced in 0.00 seconds 6 63989 none:txg-synced 0MB of 4096MB synced in 0.00 seconds 5 63989 none:txg-synced 64MB of 4096MB synced in 0.76 seconds <-- copy begins 1 63989 none:txg-synced 305MB of 4096MB synced in 0.32 seconds 2 63989 none:txg-synced 438MB of 4096MB synced in 2.13 seconds 1 63989 none:txg-synced 819MB of 4096MB synced in 7.88 seconds <-- 4th sync is already taking too long for 5 second TXGs to keep up. And 7.9 seconds to write out 820MB! A write-out rate of 104MB/sec, when it's got a 43% empty pool and can write to 4 7200 mirros simultaneously, without any need to calculate parity data? Something's got to be wrong. Having no idea why writeout is so incredibly slow, I first tried to get it to write consistently, without backlogging/stalling the file transfer. Following the guide, I first tried increasing *delay_scale *(current value 500K). The Delphix guide said to try doubling it or more. I took it up to 1M, then 2M, then 5M, then 50M - a factor of 100x - and the dtrace script showed that even at that scaling level, ZFS was * still* needing to invoke delays of 50M - 100M most of the time - and despite doing so, was *still *unable to prevent the issue: # dtrace -n delay-mintime'{ @ = quantize(arg2); }' dtrace: description 'delay-mintime' matched 1 probe ^C value ------------- Distribution ------------- count 128 | 0 256 | 1 512 | 0 1024 | 1 2048 | 1 4096 | 4 8192 | 6 16384 | 13 32768 | 26 65536 | 51 131072 | 102 262144 | 203 524288 |@ 399 1048576 |@ 775 2097152 |@@ 1453 4194304 |@@@@ 2603 8388608 |@@@@@@ 4243 16777216 |@@@@@ 4048 33554432 |@@@@@@ 4485 67108864 |@@@@@@@@@@@@@@@ 11208 134217728 | 0 Next, I tried the following part of the Delphix guide. I took *delay_scale *back down to "sane" values" and reducing the dirty data write-out percentages. I finally got past the problem that way - but I had to take *delay_min_dirty_percent *and *vdev.async_write_active_max_dirty_percent* down to 15%, and *delay_scale *up to 2M to do it. Those are insanely limited values for an enterprise server with a single file transfer. I checked disk IO. As shown below, all disks seemed to be performing roughly equally, suggesting the issue isn't the HBA or "bad" HDDs: # zpool iostat -T d -v tank 1 capacity operations bandwidth pool alloc free read write read write -------------------------------------- ----- ----- ----- ----- ----- ----- tank 14.5T 10.9T 0 52.3K 0 210M mirror 4.12T 3.13T 0 8.60K 0 34.7M da6 - - 0 778 0 35.1M da5 - - 0 945 0 35.8M da7 - - 0 1.01K 0 36.8M mirror 3.83T 1.61T 0 20.7K 0 82.9M da8 - - 0 5.01K 0 83.0M da9 - - 0 5.11K 0 83.0M da10 - - 0 3.14K 0 82.9M mirror 3.81T 1.62T 0 12.2K 0 48.8M da4 - - 0 2.17K 0 48.8M da11 - - 0 2.13K 0 50.7M da2 - - 0 2.73K 0 50.7M mirror 2.69T 4.57T 0 10.8K 0 43.6M da0 - - 0 1.26K 0 45.5M da1 - - 0 1.29K 0 43.7M da3 - - 0 1.50K 0 45.8M I gather the stats here aren't reliable, but if that's any indication of the average bytes per write, it also suggests something odd is going on. The vdevs comprise matched enterprise 7200 HDDs, all new, and the top 2 vdevs have slightly faster HDDs than the bottom two vdevs. Writing a single large seq file in a 57% full pool, data should be written 4 ways at a time, hence topping out at about 4 x single HDD speed as an ideal ceiling, say about 400 - 800MB/sec (the HDDs probably benchmark at around 180 - 200 MB/sec seq write individually). I double checked by testing local write speed for a non-cached 5GB local file (pre-created on a different device therefore also in ZFS ARC/L2ARC and won't slow down writing). I copied the file to three destinations: */dev/null * (to isolate the time required by /dev/random with no disk writes), *the boot pool* (ZFS on mirrored slow-ish SATA SSDs 54% full) and *the same pool as above*, to allow effects to be distinguished. I also used dtrace to watch writeouts, as above, to compare with LAN file tfr writeout patterns. I can't decide if this shows the entire time taken is due to /dev/random, or whether it shows the problem is still there and hasn't actually gone away, despite the apparently smooth LAN dataflow achieved above: # dd if=/dev/random of=/TEST0.DAT bs=1048576 count=5120 <-- saved to boot pool # dd if=/dev/random of=TEST1.DAT bs=1048576 count=5120 <-- saved to tank # /usr/bin/time -h cp /TEST0.DAT /dev/null <-- read speed for boot pool file 1.25s real 0.00s user 1.25s sys # /usr/bin/time -h cp /TEST0.DAT TEST0-COPY1.DAT <-- copy boot pool file to tank 2m30.34s real 0.00s user 2.29s sys 1 63989 none:txg-synced [At timestamp: 9507491.794 sec] 0MB of 476MB synced in 0.00 seconds 2 63989 none:txg-synced [At timestamp: 9507496.057 sec] 64MB of 476MB synced in 0.88 seconds 7 63989 none:txg-synced [At timestamp: 9507499.424 sec] 476MB of 476MB synced in 3.36 seconds 2 63989 none:txg-synced [At timestamp: 9507515.465 sec] 476MB of 476MB synced in 16.04 seconds 4 63989 none:txg-synced [At timestamp: 9507532.055 sec] 476MB of 476MB synced in 16.59 seconds 4 63989 none:txg-synced [At timestamp: 9507548.635 sec] 476MB of 476MB synced in 16.58 seconds 5 63989 none:txg-synced [At timestamp: 9507564.808 sec] 476MB of 476MB synced in 16.17 seconds 4 63989 none:txg-synced [At timestamp: 9507580.779 sec] 476MB of 476MB synced in 15.97 seconds 7 63989 none:txg-synced [At timestamp: 9507598.004 sec] 476MB of 476MB synced in 17.22 seconds 2 63989 none:txg-synced [At timestamp: 9507614.247 sec] 476MB of 476MB synced in 16.24 seconds 4 63989 none:txg-synced [At timestamp: 9507630.392 sec] 476MB of 476MB synced in 16.14 seconds 4 63989 none:txg-synced [At timestamp: 9507646.723 sec] 476MB of 476MB synced in 16.33 seconds 3 63989 none:txg-synced [At timestamp: 9507657.185 sec] 300MB of 476MB synced in 10.46 seconds 3 63989 none:txg-synced [At timestamp: 9507657.218 sec] 0MB of 476MB synced in 0.03 seconds [completed] # /usr/bin/time -h cp TEST1.DAT /dev/null <-- read speed for tank file 1.21s real 0.00s user 1.20s sys # /usr/bin/time -h cp TEST1.DAT /TEST1-COPY1.DAT <-- copy tank file to boot pool 2m0.98s real 0.00s user 2.57s sys 2 63989 none:txg-synced [At timestamp: 9507827.131 sec] 0MB of 476MB synced in 0.01 seconds 0 63989 none:txg-synced [At timestamp: 9507831.391 sec] 64MB of 476MB synced in 1.59 seconds 2 63989 none:txg-synced [At timestamp: 9507840.045 sec] 384MB of 476MB synced in 8.65 seconds 2 63989 none:txg-synced [At timestamp: 9507840.045 sec] 62MB of 476MB synced in 0.00 seconds 0 63989 none:txg-synced [At timestamp: 9507841.743 sec] 64MB of 476MB synced in 1.69 seconds 2 63989 none:txg-synced [At timestamp: 9507850.117 sec] 368MB of 476MB synced in 8.37 seconds 2 63989 none:txg-synced [At timestamp: 9507850.118 sec] 60MB of 476MB synced in 0.00 seconds 1 63989 none:txg-synced [At timestamp: 9507851.751 sec] 62MB of 476MB synced in 1.63 seconds 7 63989 none:txg-synced [At timestamp: 9507859.728 sec] 351MB of 476MB synced in 7.97 seconds 7 63989 none:txg-synced [At timestamp: 9507859.728 sec] 58MB of 476MB synced in 0.00 seconds 2 63989 none:txg-synced [At timestamp: 9507861.298 sec] 61MB of 476MB synced in 1.56 seconds 5 63989 none:txg-synced [At timestamp: 9507868.970 sec] 336MB of 476MB synced in 7.67 seconds 5 63989 none:txg-synced [At timestamp: 9507868.971 sec] 56MB of 476MB synced in 0.00 seconds 2 63989 none:txg-synced [At timestamp: 9507870.507 sec] 60MB of 476MB synced in 1.53 seconds 6 63989 none:txg-synced [At timestamp: 9507877.644 sec] 322MB of 476MB synced in 7.13 seconds 6 63989 none:txg-synced [At timestamp: 9507877.664 sec] 55MB of 476MB synced in 0.02 seconds 4 63989 none:txg-synced [At timestamp: 9507879.197 sec] 96MB of 476MB synced in 1.53 seconds [same pattern until completed] Hmm. 16-17 seconds average to write out 476MB that's already in ARC/L2ARC from a previous write? 150 secs to do the writeout of a simple 5GB file? That's about 34 MB/sec effective copying speed, to copy a file locally from ARC/L2ARC, to a pool of 4 fast striped mirrors, that should be able to write at 600-800MB/sec. It does confirm, however, that the issue isn't LAN related, since it happens even with local file writing. The boot pool isn't much faster, although it's SSD it only has one mirror, so it's at a 4x speed disadvantage from the start. Very, very wrong-looking..... Finally, I returned to dirty.d/duration.d, to compare the later TXG writeout patterns (using the new, much tighter settings that didn't stall LAN writes) to those at the initial testing run. This time we can see it is handling the tfr fine - but it also shows again how much it had to strangulate performance in order to do so: # dtrace -s dirty_and_duration.d tank dtrace: script 'dirty_and_duration.d' matched 2 probes CPU ID FUNCTION:NAME 3 63989 none:txg-synced [At timestamp: 9503423.244 sec] 0MB of 476MB synced in 0.00 seconds 4 63989 none:txg-synced [At timestamp: 9503429.056 sec] 0MB of 476MB synced in 0.50 seconds 3 63989 none:txg-synced [At timestamp: 9503431.789 sec] 64MB of 476MB synced in 0.69 seconds 7 63989 none:txg-synced [At timestamp: 9503433.114 sec] 280MB of 476MB synced in 1.32 seconds 0 63989 none:txg-synced [At timestamp: 9503435.103 sec] 476MB of 476MB synced in 1.98 seconds 2 63989 none:txg-synced [At timestamp: 9503436.962 sec] 476MB of 476MB synced in 1.85 seconds 6 63989 none:txg-synced [At timestamp: 9503438.810 sec] 470MB of 476MB synced in 1.84 seconds 3 63989 none:txg-synced [At timestamp: 9503440.791 sec] 474MB of 476MB synced in 1.98 seconds 0 63989 none:txg-synced [At timestamp: 9503442.726 sec] 476MB of 476MB synced in 1.93 seconds 2 63989 none:txg-synced [At timestamp: 9503445.287 sec] 476MB of 476MB synced in 2.56 seconds 2 63989 none:txg-synced [At timestamp: 9503447.193 sec] 476MB of 476MB synced in 1.90 seconds 2 63989 none:txg-synced [At timestamp: 9503449.566 sec] 476MB of 476MB synced in 2.37 seconds 3 63989 none:txg-synced [At timestamp: 9503451.476 sec] 476MB of 476MB synced in 1.91 seconds 2 63989 none:txg-synced [At timestamp: 9503453.207 sec] 476MB of 476MB synced in 1.73 seconds 4 63989 none:txg-synced [At timestamp: 9503454.983 sec] 411MB of 476MB synced in 1.77 seconds 3 63989 none:txg-synced [At timestamp: 9503456.860 sec] 476MB of 476MB synced in 1.87 seconds 3 63989 none:txg-synced [At timestamp: 9503458.820 sec] 473MB of 476MB synced in 1.95 seconds 3 63989 none:txg-synced [At timestamp: 9503460.682 sec] 476MB of 476MB synced in 1.86 seconds 4 63989 none:txg-synced [At timestamp: 9503462.676 sec] 476MB of 476MB synced in 1.99 seconds 0 63989 none:txg-synced [At timestamp: 9503464.346 sec] 476MB of 476MB synced in 1.66 seconds Crunching the numbers, the average write-out transfer rate is 242-245 MB/sec (411-476MB in ~1.8 sec for individual TXGs, and 7064 MB in 29.2 sec overall). It looks like something has also hit a limit - TXGs are being written out about every 1.8 sec, and also taking 1.8 sec to writeout. Put another way, to get to a point where ZFS was inserting enough delay that the file transfer was consistent and no longer backlogging and stalling the NIC buffer for tens of seconds or minutes at a time, I had to push the overall file tansfer down to about 240 MB/sec, which for a single client/single file write on an enterprise server able to write to 4 vdevs in parallel, is crazy low. I don't know if this helps, but I feel it might be useful to convey how badly my server's ZFS async write pathway is writing out, even under AFAIK reasonable/default config and fast appropriate HW for the load, and fast striped mirrors, and why I think it's indicative of something deeper I'm not seeing, whether bad sysctls or a dirty data bug as per previous replies when the issue was earlier raised on this list. So this is why I want to dig "under the hood" with dtrace, but I need help in writing the necessary short instrumenting scripts required to narrow down where it's going wrong internally.. Help kindly appreciated! Stilez On Fri, 21 Dec 2018 at 03:18, Stilez Stilezy <stilezy@gmail.com> wrote: > I'm experiencing a persistent timeout problem that only occurs during > network writes across a 10G LAN to a ZFS pool. Work to date confirms it's a > real server-side issue, arising within the ZFS side not the client, network > hardware, or FreeBSD's network I/O handling, and it involves ZFS > mishandling dirty data free-up (there are bugs logged for this), or ZFS not > correctly delaying too-fast write source. Of cours eit could also be > mis-set sysctls, but I've replicated it with a default FreeNAS setup so it > would have to be mis-configured in that as well, if so. > > To save time, can I skip the detailed debugging done to date, and just say > that is where I've got to, so far. I have asked previously, long ago, and > it was suggested at the time it could be a dirty data bug. With more > experience under the belt, I'd like to dig deeper and try to narrow down > the point which is bottlenecking, using dtrace. > > Specifically, I'd like help with a short instrumentation script to capture > four key events within the ZFS write data pathway, in order to pin down > which part of the |FS data flow is implicated in the bottleneck. I need > help because it requires detailed knowledge of the ZFS codebase. I get the > impression it could be a very short script - if I can narrow it down, maybe > I can find a fix, or else, better evidence of the exact point where the > bug/issue is arising. > > *Symptoms:* > > I'm using ZFS on enterprise hardware (Supermicro, Xeon E5-1620 v4, 128GB > ECC, Intel P3700 NVMe SLOG/ZIL, 7200 mirrored enterprise SAS3 HDD pool, > 9311 SAS3 HBA IT mode, Chelsio NICs, 10G fibre LAN, Netgear managed SFP+ > switch). > > With a quiesced "known good" 10G network, and just a single SMB, iSCSI or > netcat client (same happens on all 3 protocols), and that client doing a > single 100GB file write, I can see the network RCV buffer gradually being > driven down to zero without getting time to recover much, causing the TCP > window to go down to << 80 bytes in tcpdump as it tries to cope, and > eventually (when TCP window is driven to zero often enough) the file write > session fails due to the persistant zero window condition. Jumps in the > network RCV queue seems to coincide in time just before disk activity in > the HDD pool, adding to evidence it's related to the ZFS write pathway. > > All the evidence suggests the network and client sides aren't the issue. > What's happening is, ZFS is misconfigured or hitting a bug, it's not > pulling new data into the next TXG in a timely manner, or delaying > elegantly. File data is coming into the NIC RCV buffer at 10G speed, faster > than ZFS is allowing it to be removed, and ZFS isn't effectively signalling > the client to slow down or delay. > > > *dtrace:* > > The most direct route to narrow it down, would involve instrumenting to > log the timestamps, sizes, and time taken, for (I think!) 4 events. I don't > know the exact names used for these calls, so I have to describe their > effect. Help with identifying appropriate functions/instrumentation points, > even if it's not clear how to instrument them, would be great! > > - There should be a function that gets called/polled when ZFS has > space in its write pipeline, it checks to see if any ongoing write has a > chunk of data to be written, and if so, pulls data from the source > stream/pipe/descriptor (in this case the network RCV buffer, could be any > source of an ongoing file write). This will let me instrument the events, > when ZFS requests/gets source data from any ongoing file write, in this > case pulling it from the network buffer. > * To instrument: timestamp, max bytes to read, actual bytes read* > > - Within the ZFS code, the write pathway involves (simplistically) > newly received write data and any related metadata/consequent updates being > added to the current TXG and SLOG/ZIL, and ultimately, written out to the > pool. > *To instrument for those 3 events: timestamp, bytes written, > write-out time taken. (Ideally also, dirty data cache size > before/after)* > > > Although the output file will be huge, it should be easy to analyse the > timestamped list of events+dataflow and see which of the last 3 processes > is causing the bottleneck. > > I would enjoy doing the work, and tracking this down somewhat, but need a > little help with the instrumentation aspect which really needs developer > competences. > > Can someone kind, please help with an appropriate short instrumentation > script to capture the raw data for these 4 items :) > > (And any other data that in your experience, would be useful) > > Thank you! > > Stilez >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFwhr748pVWL7vVEhRyGAaNU84YZ-yqoHyCCeCLeVpbhrwapvw>