From owner-freebsd-hackers@freebsd.org Sat Jul 2 18:26:56 2016 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DC812B8F8BE for ; Sat, 2 Jul 2016 18:26:56 +0000 (UTC) (envelope-from cedric.blancher@gmail.com) Received: from mail-pa0-x22c.google.com (mail-pa0-x22c.google.com [IPv6:2607:f8b0:400e:c03::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id B088A2C5A for ; Sat, 2 Jul 2016 18:26:56 +0000 (UTC) (envelope-from cedric.blancher@gmail.com) Received: by mail-pa0-x22c.google.com with SMTP id b13so47673492pat.0 for ; Sat, 02 Jul 2016 11:26:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=7JKjZdN1Y2GGYPLsk5/JXaDenJExyG7jnNQnwrvO/Ng=; b=F6dWHo8goma7ca3Z0hdFjkEd1gDDHue0EuQFm9RN2Go9ce0zFrVtGWaF/EjtH7LPuf eVFt2vvEXJ7G/NvG88wiW2LAK7p0m/3mJJr4Mq4X8611CVfYMlNTKthXfITEpXpyTn20 gaYrlHmf0dg2LBSK9ozI3e45SjSJ/kKVMrTM2Kl2nnkAB/+MXwU9vbypdJNWKuHw9dEG LxpgxvKRfKrqPONXLuD/hyG7iXum+/JEh7wcpCsLA1fDszw/h2memkFPgtBmpH230BU1 5SfvaZnDuMcaadvbkzFjQumkw5IEVlirkUjWoiBYIzKv09ZiIkpNfWRKgQ0c5/l1QWgS osAw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=7JKjZdN1Y2GGYPLsk5/JXaDenJExyG7jnNQnwrvO/Ng=; b=Va+4Dk6R1Nk1kr16V56OxpgGDcBATfwzNY5VWJri9Jz0s9c71dG/0oJsghrPTNugYQ j5DFEpAEOY4ZgJbs5EjOivCz44NROgj2re/gg8xXIb9OiyDy3mY/PYK8MZULyAX23MsI uf8tArimFb+8se/be6UF88sx1v5gcPgB123JO7JMd7lTDSDZ84MG0VW2THldLDMrCGAz mtg3Ybr6PEjNtXWDAgkiL5MWXtEjqsTJ//psgSYP7ZmQA0X+erhT+tC/RcYfHxBDYg8g xgjxss1iwsyizE4FDOg5PtK+8qQ7HK6fKDZpc/B1iT1K0/sj5hqPiXMmF4l3FQR5Om25 vBEg== X-Gm-Message-State: ALyK8tIoHPb8eQA1W0zKDpbCEfozaWVN2lGJn5i4X97mH1G0qrFkVmxbd2haZ1rD8EalVvyZ7ejDM//hG3AtQQ== X-Received: by 10.66.52.11 with SMTP id p11mr8081710pao.155.1467484016160; Sat, 02 Jul 2016 11:26:56 -0700 (PDT) MIME-Version: 1.0 Received: by 10.66.173.8 with HTTP; Sat, 2 Jul 2016 11:26:55 -0700 (PDT) In-Reply-To: <20160630140625.3b4aece3@splash.akips.com> References: <20160630140625.3b4aece3@splash.akips.com> From: Cedric Blancher Date: Sat, 2 Jul 2016 20:26:55 +0200 Message-ID: Subject: Re: ZFS ARC and mmap/page cache coherency question To: Paul Koch Cc: "freebsd-hackers@freebsd.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Jul 2016 18:26:57 -0000 Short story: ZFS was tacked on the kernel and was never properly integrated into the VM page management, which leads to DRAMATIC poor performance for anything which uses mmap() for write IO. This was solved in Oracle Solaris with the great VM allocator rewrite which landed after Opensolaris was made closed source again. Without a complete rewrite of the VM system this problem is unsolvable. Ced On 30 June 2016 at 06:06, Paul Koch wrote: > > Posted this to -stable on the 15th June, but no feedback... > > We are trying to understand a performance issue when syncing large mmap'ed > files on ZFS. > > Example test box setup: > FreeBSD 10.3-p5 > Intel i7-5820K 3.30GHz with 64G RAM > 6 * 2 Tbyte Seagate ST2000DM001-1ER164 in a ZFS stripe > > Read performance of a sequentially written large file on the pool is > typically around 950Mbytes/sec using dd. > > Our software mmap's some large database files using MAP_NOSYNC, and we call > fsync() every 10 minutes when we know the file system is mostly idle. In > our test setup, the database files are 1.1G, 2G, 1.4G, 12G, 4.7G and ~20 > small files (under 10M). All of the memory pages in the mmap'ed files are > updated every minute with new values, so the entire mmap'ed file needs to be > synced to disk, not just fragments. > > When the 10 minute fsync() occurs, gstat typically shows very little disk > reads and very high write speeds, which is what we expect. But, every 80 > minutes we process the data in the large mmap'ed files and store it in highly > compressed blocks of a ~300G file using pread/pwrite (i.e. not mmap'ed). > After that, the performance of the next fsync() of the mmap'ed files falls > off a cliff. We are assuming it is because the ARC has thrown away the > cached data of the mmap'ed files. gstat shows lots of read/write contention > and lots of things tend to stall waiting for disk. > > Is this just a lack of ZFS ARC and page cache coherency ?? > > Is there a way to prime the ARC with the mmap'ed files again before we call > fsync() ? > > We've tried cat and read() on the mmap'ed files but doesn't seem to touch the > disk at all and the fsync() performance is still poor, so it looks like the > ARC is not being filled. msync() doesn't seem to be much different. > mincore() stats show the mmap'ed data is entirely incore and referenced. > > Paul. > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org" -- Cedric Blancher Institute Pasteur