From owner-freebsd-hackers@freebsd.org Mon Aug 15 05:15:17 2016 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A42E8BBA36E for ; Mon, 15 Aug 2016 05:15:17 +0000 (UTC) (envelope-from paul.koch137@gmail.com) Received: from mail-pa0-x229.google.com (mail-pa0-x229.google.com [IPv6:2607:f8b0:400e:c03::229]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 72B8E1ED8 for ; Mon, 15 Aug 2016 05:15:17 +0000 (UTC) (envelope-from paul.koch137@gmail.com) Received: by mail-pa0-x229.google.com with SMTP id fi15so13575617pac.1 for ; Sun, 14 Aug 2016 22:15:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:subject:message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=jePPY2X3wk1kZkEgJhz1o5liXfmRE68TqAz4K6Juz/s=; b=BcFk3yhSVsIvYY+r7qcSbRKDL38QoNiJGOee6LnkoPiVLXLemLwgPT+17Tux6Dzvhq 30dJ+FVhhrxaafQVE01+ywcNKcTDoCHMHRB8JKDqXaqZh+iOQ+EYB11xLC2V148jTZ7X siMq4tuKBSEDToIaXa+jmxdHw45+LjRMIcJOzWY0fek0FogtPGy7ukEg84Sk9+Tk/uRS xnIJbYkI+V580bntGg0+v2T/UsDM8vbsUeY+nc9Pg5NWfpXoHeR8ymsb967Ni9Pm1WNz hcJotbgJEvALHiodI6zrHmFKf56wQFchCq7m5VD4z679A+gGWlWMfAeV4IOLuRUgR+HB PWrQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:subject:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=jePPY2X3wk1kZkEgJhz1o5liXfmRE68TqAz4K6Juz/s=; b=U+Heayb6TaIV17XvD9gimUXFe2Z1W4MRoUvruscTR3jm/atDDn1NZSTOuIC237oZ/V U0ULS+Rpm4lYDLrFFzoq+/HudCxfEqkj6bzoTtVw8AKDNzzejBnw5ZT5Ijw9LZ3ItdlX 9haOjR1twSpHAhH04v3lgBSi2U8nwyFMm4v2qdOcLDkvDk930Ipox0sMNKiptx/OE6NP 77nvnO8Jm+Mhla+mfxvTXm1YYKfUysyV38MNvP9M89P8XLH/sAgTKiphkRWx0eJJLH+Z pitQtK7A0lX4Fkdofh9l2UUzXrwUpfgQg9XOtItl6y212vjcM5E9PhYh6hoSnoeiTXNA YNBQ== X-Gm-Message-State: AEkoouuC4mVb4q2KtHE6pWF9xBrqmMtCszL5opPwij9mbE1ghZJBJH7+t42xJTOLf7Dfzg== X-Received: by 10.66.193.65 with SMTP id hm1mr51726780pac.12.1471238116803; Sun, 14 Aug 2016 22:15:16 -0700 (PDT) Received: from splash.akips.com (CPE-120-146-191-2.static.qld.bigpond.net.au. [120.146.191.2]) by smtp.gmail.com with ESMTPSA id ao6sm29328189pac.8.2016.08.14.22.15.14 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 14 Aug 2016 22:15:15 -0700 (PDT) Date: Mon, 15 Aug 2016 15:15:01 +1000 From: Paul Koch To: freebsd-hackers@freebsd.org Subject: Re: ZFS ARC and mmap/page cache coherency question Message-ID: <20160815151501.5f5b4a86@splash.akips.com> In-Reply-To: <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net> References: <20160630140625.3b4aece3@splash.akips.com> <20160703123004.74a7385a@splash.akips.com> <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org> <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net> X-Mailer: Claws Mail 3.13.2 (GTK+ 2.24.29; amd64-portbld-freebsd10.2) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Aug 2016 05:15:17 -0000 Just a followup to my original post about VM/ZFS ARC coherency. We've done a few simple changes to our application to get around the coherency issues, and observed some odd things. Description of our app: Very large scale ping/snmp poller/database (made up of 8 underlying databases - configuration, event, time-series, common strings, etc). Each database contains various sized mmap'ed files, ranging from 512 bytes to many gigabytes. All mmap'ed files are opened with the MAP_NOSYNC flag. The poller updates every page of the mmap'ed data every minute. We fsync the mmap'ed data every 10 minutes when the system is mostly idle. Everything works fine while the mmap'ed data is in both the VM and ZFS caches. Every 80 minutes we process a very large amount of cached poller data, which pushes the mmap'ed data out of the ARC. The performance of the next 10 minute fsync then falls off a cliff, causing lots of read/write contention. This is due to the lack of VM/ZFS ARC coherency. We've changed our sync algorithm to something like: 1. Exclusive lock on the entire database 2. fsync() all the small 512 byte mmap'ed files 3. Write out new copies of all the other mmap'ed files - mprotect - write - rename - munprotect 4. Release exclusive lock 5. Signal all database processes so they reopen the database. Our sync now completes in a very predictable manner and is significantly faster. But we observed some odd things: 1. The rename in step 3 above can be painfully slow for large files. Not sure what is going on, but we also noticed that deleting the same files using unlink(2) or rm(1) was also painfully slow. It is much much faster to truncate(2) the large files to zero bytes before calling rename(2) or unlink(2). Why is that ?? 2. We are using both fsync(2) and write(2) in the above sync. We observed that order was very important. If we write/rename the large mmap'ed files first and then fsync the small 512 byte files, the fsync sits in zio for some time. Doing the fsync calls first and then the large write/renames is much faster. Not sure what is going on there. Paul. -- Paul Koch | Founder | CEO AKIPS Network Monitor | akips.com Brisbane, Australia