Date: Fri, 21 Dec 2018 03:18:15 +0000 From: Stilez Stilezy <stilezy@gmail.com> To: freebsd-fs@freebsd.org Subject: Dev help requested: short DTrace instrumentation script to narrow down cause of a longstanding ZFS write issue Message-ID: <CAFwhr76CXkgUUUnbYHBby2KjDZ7DkLN-UTyGWbv=Bv12dWpcgQ@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
I'm experiencing a persistent timeout problem that only occurs during network writes across a 10G LAN to a ZFS pool. Work to date confirms it's a real server-side issue, arising within the ZFS side not the client, network hardware, or FreeBSD's network I/O handling, and it involves ZFS mishandling dirty data free-up (there are bugs logged for this), or ZFS not correctly delaying too-fast write source. Of cours eit could also be mis-set sysctls, but I've replicated it with a default FreeNAS setup so it would have to be mis-configured in that as well, if so. To save time, can I skip the detailed debugging done to date, and just say that is where I've got to, so far. I have asked previously, long ago, and it was suggested at the time it could be a dirty data bug. With more experience under the belt, I'd like to dig deeper and try to narrow down the point which is bottlenecking, using dtrace. Specifically, I'd like help with a short instrumentation script to capture four key events within the ZFS write data pathway, in order to pin down which part of the |FS data flow is implicated in the bottleneck. I need help because it requires detailed knowledge of the ZFS codebase. I get the impression it could be a very short script - if I can narrow it down, maybe I can find a fix, or else, better evidence of the exact point where the bug/issue is arising. *Symptoms:* I'm using ZFS on enterprise hardware (Supermicro, Xeon E5-1620 v4, 128GB ECC, Intel P3700 NVMe SLOG/ZIL, 7200 mirrored enterprise SAS3 HDD pool, 9311 SAS3 HBA IT mode, Chelsio NICs, 10G fibre LAN, Netgear managed SFP+ switch). With a quiesced "known good" 10G network, and just a single SMB, iSCSI or netcat client (same happens on all 3 protocols), and that client doing a single 100GB file write, I can see the network RCV buffer gradually being driven down to zero without getting time to recover much, causing the TCP window to go down to << 80 bytes in tcpdump as it tries to cope, and eventually (when TCP window is driven to zero often enough) the file write session fails due to the persistant zero window condition. Jumps in the network RCV queue seems to coincide in time just before disk activity in the HDD pool, adding to evidence it's related to the ZFS write pathway. All the evidence suggests the network and client sides aren't the issue. What's happening is, ZFS is misconfigured or hitting a bug, it's not pulling new data into the next TXG in a timely manner, or delaying elegantly. File data is coming into the NIC RCV buffer at 10G speed, faster than ZFS is allowing it to be removed, and ZFS isn't effectively signalling the client to slow down or delay. *dtrace:* The most direct route to narrow it down, would involve instrumenting to log the timestamps, sizes, and time taken, for (I think!) 4 events. I don't know the exact names used for these calls, so I have to describe their effect. Help with identifying appropriate functions/instrumentation points, even if it's not clear how to instrument them, would be great! - There should be a function that gets called/polled when ZFS has space in its write pipeline, it checks to see if any ongoing write has a chunk of data to be written, and if so, pulls data from the source stream/pipe/descriptor (in this case the network RCV buffer, could be any source of an ongoing file write). This will let me instrument the events, when ZFS requests/gets source data from any ongoing file write, in this case pulling it from the network buffer. * To instrument: timestamp, max bytes to read, actual bytes read* - Within the ZFS code, the write pathway involves (simplistically) newly received write data and any related metadata/consequent updates being added to the current TXG and SLOG/ZIL, and ultimately, written out to the pool. *To instrument for those 3 events: timestamp, bytes written, write-out time taken. (Ideally also, dirty data cache size before/after)* Although the output file will be huge, it should be easy to analyse the timestamped list of events+dataflow and see which of the last 3 processes is causing the bottleneck. I would enjoy doing the work, and tracking this down somewhat, but need a little help with the instrumentation aspect which really needs developer competences. Can someone kind, please help with an appropriate short instrumentation script to capture the raw data for these 4 items :) (And any other data that in your experience, would be useful) Thank you! Stilez
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFwhr76CXkgUUUnbYHBby2KjDZ7DkLN-UTyGWbv=Bv12dWpcgQ>