Date: Fri, 1 Jul 2016 11:32:43 +1000 From: Paul Koch <paul.koch137@gmail.com> To: Andrew Bates <andrewbates09@gmail.com> Cc: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org> Subject: Re: ZFS ARC and mmap/page cache coherency question Message-ID: <20160701113243.307739cc@splash.akips.com> In-Reply-To: <CAPi5Lmm6RtXQ6UxzcfoRKtGC-LfBLJAW0qOy6=F5fh3mg-OB5w@mail.gmail.com> References: <20160630140625.3b4aece3@splash.akips.com> <CAPi5Lmm6RtXQ6UxzcfoRKtGC-LfBLJAW0qOy6=F5fh3mg-OB5w@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi Andrew, further info below... > Heya Paul, > > How is your ZFS configured ( zfs get all tank0 )? > > These certainly aren't absolute, law, or perfect - but if you haven't yet, > I suggest you take a peek at the following: > > * http://open-zfs.org/wiki/Performance_tuning > * https://www.joyent.com/blog/bruning-questions-zfs-record-size > * http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide > > On Wed, Jun 29, 2016 at 9:06 PM, Paul Koch <paul.koch137@gmail.com> wrote: > > > > > Posted this to -stable on the 15th June, but no feedback... > > > > We are trying to understand a performance issue when syncing large mmap'ed > > files on ZFS. > > > > Example test box setup: > > FreeBSD 10.3-p5 > > Intel i7-5820K 3.30GHz with 64G RAM > > 6 * 2 Tbyte Seagate ST2000DM001-1ER164 in a ZFS stripe > > > > Read performance of a sequentially written large file on the pool is > > typically around 950Mbytes/sec using dd. > > > > Our software mmap's some large database files using MAP_NOSYNC, and we > > call fsync() every 10 minutes when we know the file system is mostly > > idle. In our test setup, the database files are 1.1G, 2G, 1.4G, 12G, > > 4.7G and ~20 small files (under 10M). All of the memory pages in the > > mmap'ed files are updated every minute with new values, so the entire > > mmap'ed file needs to be > > synced to disk, not just fragments. > > > > When the 10 minute fsync() occurs, gstat typically shows very little disk > > reads and very high write speeds, which is what we expect. But, every 80 > > minutes we process the data in the large mmap'ed files and store it in > > highly > > compressed blocks of a ~300G file using pread/pwrite (i.e. not mmap'ed). > > After that, the performance of the next fsync() of the mmap'ed files falls > > off a cliff. We are assuming it is because the ARC has thrown away the > > cached data of the mmap'ed files. gstat shows lots of read/write > > contention > > and lots of things tend to stall waiting for disk. > > > > Is this just a lack of ZFS ARC and page cache coherency ?? > > > > Is there a way to prime the ARC with the mmap'ed files again before we > > call fsync() ? > > > > We've tried cat and read() on the mmap'ed files but doesn't seem to touch > > the > > disk at all and the fsync() performance is still poor, so it looks like > > the ARC is not being filled. msync() doesn't seem to be much different. > > mincore() stats show the mmap'ed data is entirely incore and referenced. > > > > Paul. Here is our zfs get all akips NAME PROPERTY VALUE SOURCE akips type filesystem - akips creation Sat Apr 9 7:29 2016 - akips used 835G - akips available 9.70T - akips referenced 96K - akips compressratio 1.00x - akips mounted no - akips quota none default akips reservation none default akips recordsize 128K default akips mountpoint none local akips sharenfs off default akips checksum on default akips compression off default akips atime off local akips devices on default akips exec on default akips setuid on default akips readonly off default akips jailed off default akips snapdir hidden default akips aclmode discard default akips aclinherit restricted default akips canmount on default akips xattr on default akips copies 1 default akips version 5 - akips utf8only off - akips normalization none - akips casesensitivity sensitive - akips vscan off default akips nbmand off default akips sharesmb off default akips refquota none default akips refreservation none default akips primarycache all default akips secondarycache all default akips usedbysnapshots 0 - akips usedbydataset 96K - akips usedbychildren 835G - akips usedbyrefreservation 0 - akips logbias latency default akips dedup off default akips mlslabel - akips sync standard default akips refcompressratio 1.00x - akips written 96K - akips logicalused 834G - akips logicalreferenced 9.50K - akips volmode default default akips filesystem_limit none default akips snapshot_limit none default akips filesystem_count none default akips snapshot_count none default akips redundant_metadata all default The problem appears to be similar to what is described here: http://zfs-discuss.opensolaris.narkive.com/tgP1NV9l/remedies-for-suboptimal-mmap-performance-on-zfs So basically our problem is, we mmap a large file, which gets double cached in both ZFS ARC and page cache. We update every page in the mmap'ed data, then flush it our every 10 minutes when we know the disks are mostly idle. Performance is great.... unless the ZFS ARC no longer has the doubled cached mmap'ed data, which means it has to go to disk and cause heaps of read/write contention, then performance falls off a cliff. What we want to know is if there is no ARC/page cache coherency, how do we prime the ARC cache again with the same data so we can get good write performance on the fsync(). Paul.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160701113243.307739cc>