Date: Mon, 15 Aug 2016 15:15:01 +1000 From: Paul Koch <paul.koch137@gmail.com> To: freebsd-hackers@freebsd.org Subject: Re: ZFS ARC and mmap/page cache coherency question Message-ID: <20160815151501.5f5b4a86@splash.akips.com> In-Reply-To: <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net> References: <20160630140625.3b4aece3@splash.akips.com> <CALXu0UfxRMnaamh%2Bpo5zp=iXdNUNuyj%2B7e_N1z8j46MtJmvyVA@mail.gmail.com> <20160703123004.74a7385a@splash.akips.com> <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org> <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Just a followup to my original post about VM/ZFS ARC coherency. We've done a few simple changes to our application to get around the coherency issues, and observed some odd things. Description of our app: Very large scale ping/snmp poller/database (made up of 8 underlying databases - configuration, event, time-series, common strings, etc). Each database contains various sized mmap'ed files, ranging from 512 bytes to many gigabytes. All mmap'ed files are opened with the MAP_NOSYNC flag. The poller updates every page of the mmap'ed data every minute. We fsync the mmap'ed data every 10 minutes when the system is mostly idle. Everything works fine while the mmap'ed data is in both the VM and ZFS caches. Every 80 minutes we process a very large amount of cached poller data, which pushes the mmap'ed data out of the ARC. The performance of the next 10 minute fsync then falls off a cliff, causing lots of read/write contention. This is due to the lack of VM/ZFS ARC coherency. We've changed our sync algorithm to something like: 1. Exclusive lock on the entire database 2. fsync() all the small 512 byte mmap'ed files 3. Write out new copies of all the other mmap'ed files - mprotect - write - rename - munprotect 4. Release exclusive lock 5. Signal all database processes so they reopen the database. Our sync now completes in a very predictable manner and is significantly faster. But we observed some odd things: 1. The rename in step 3 above can be painfully slow for large files. Not sure what is going on, but we also noticed that deleting the same files using unlink(2) or rm(1) was also painfully slow. It is much much faster to truncate(2) the large files to zero bytes before calling rename(2) or unlink(2). Why is that ?? 2. We are using both fsync(2) and write(2) in the above sync. We observed that order was very important. If we write/rename the large mmap'ed files first and then fsync the small 512 byte files, the fsync sits in zio for some time. Doing the fsync calls first and then the large write/renames is much faster. Not sure what is going on there. Paul. -- Paul Koch | Founder | CEO AKIPS Network Monitor | akips.com Brisbane, Australia
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160815151501.5f5b4a86>