From owner-freebsd-fs@FreeBSD.ORG  Sat Jan 25 06:07:59 2014
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 78F16197
 for <freebsd-fs@freebsd.org>; Sat, 25 Jan 2014 06:07:59 +0000 (UTC)
Received: from hergotha.csail.mit.edu
 (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 1CDE412C2
 for <freebsd-fs@freebsd.org>; Sat, 25 Jan 2014 06:07:58 +0000 (UTC)
Received: from hergotha.csail.mit.edu (localhost [127.0.0.1])
 by hergotha.csail.mit.edu (8.14.7/8.14.7) with ESMTP id s0P67tWV024324
 for <freebsd-fs@freebsd.org>; Sat, 25 Jan 2014 01:07:55 -0500 (EST)
 (envelope-from wollman@hergotha.csail.mit.edu)
Received: (from wollman@localhost)
 by hergotha.csail.mit.edu (8.14.7/8.14.4/Submit) id s0P67sjl024320;
 Sat, 25 Jan 2014 01:07:54 -0500 (EST) (envelope-from wollman)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <21219.21690.165105.561362@hergotha.csail.mit.edu>
Date: Sat, 25 Jan 2014 01:07:54 -0500
From: Garrett Wollman <wollman@freebsd.org>
To: freebsd-fs@freebsd.org
Subject: ZFS read performance anomaly / performance regression
X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3
 (hergotha.csail.mit.edu [127.0.0.1]); Sat, 25 Jan 2014 01:07:55 -0500 (EST)
X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED
 autolearn=disabled version=3.3.2
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on
 hergotha.csail.mit.edu
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Jan 2014 06:07:59 -0000

Consider three 1-TiB files, equal in dignity, all on the same ZFS
dataset:

-rw-r--r--  1 root  wheel  1099511627776 Jan  7 19:03 test
-rw-r--r--  1 root  wheel  1099511627776 Jan 24 22:38 test-truncate
-rw-r--r--  1 root  wheel  1099511628800 Jan 24 21:32 test2

File "test" was created by writing zeros; "test2" was created by
seeking to 1 TiB and writing a block of zeros, and "test-truncate" was
created with "truncate -s 1T".  Compression is disabled on this
dataset, and "uname -a" reports:

FreeBSD nfs-prod-3.csail.mit.edu 9.2-RELEASE-p2 FreeBSD 9.2-RELEASE-p2 #2 r260414M: Tue Jan  7 17:32:43 EST 2014     wollman@xyz.csail.mit.edu:/usr/obj/usr/src/sys/CSAIL  amd64

A naive observer would expect that reading "test" should be
significantly slower than reading either of the other files, since it
requires reading data from the disk, and that there should be no
significant difference between the other two, which consist entirely
of cacheable metadata and have no, or very little, data stored on
disk.

But no:

x noholes
+ oseek
* truncate
+------------------------------------------------------------------------------+
|x   x      x          x           *              *+ *  * *    +       +   + + |
| |________AM______|                     |________A__M___|__|_______A__M______||
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5     71.300931      82.89801     77.044237     76.452097     4.3961747
+   5     97.931676     111.36284     108.52787      106.5687     5.5595081
Difference at 95.0% confidence
        30.1166 +/- 7.30929
        39.3928% +/- 9.56062%
        (Student's t, pooled s = 5.01171)
*   5     89.187745     101.74893     98.825988     97.497923     4.9188155
Difference at 95.0% confidence
        21.0458 +/- 6.80337
        27.5281% +/- 8.89887%
        (Student's t, pooled s = 4.66482)

(Values are the time to read 128 GiB of each file, which is more than
the available memory for caching on this server.  Measurements were
interleaved abcabcabcabcabc, but server was not rebooted nor
completely idle.  Lower numbers are better.)

Now consider the same test on an older, slower (but also completely
idle) server, running an older FreeBSD release, but with an identical
zpool configuration (88 spindles plus log and cache SSDs):

FreeBSD nfs-backup-1.csail.mit.edu 9.1-RELEASE-p7 FreeBSD 9.1-RELEASE-p7 #34 r255669M: Fri Oct 11 13:04:40 EDT 2013     wollman@xyz.csail.mit.edu:/usr/obj/usr/src/sys/CSAIL  amd64

Files were created in the same manner in a fresh, unsnapshotted
dataset on this server, and the same shell script was used to time
reads (even though the older machine has significantly less memory):

x noholes2 (9.1)
+ oseek2 (9.1)
* truncate2 (9.1)
+------------------------------------------------------------------------------+
|**       +                                                                  x |
|**  * +  +   ++                                                            xxx|
||A_|   |_MA__|                                                             |A||
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5     82.813107     83.970546     83.527358     83.427247    0.43568711
+   5     47.283031     51.450488     49.063725     49.493986     1.6446236
Difference at 95.0% confidence
        -33.9333 +/- 1.75456
        -40.6741% +/- 2.10311%
        (Student's t, pooled s = 1.20304)
*   5     44.460431     46.148431     44.636276     44.937538    0.70509763
Difference at 95.0% confidence
        -38.4897 +/- 0.854768
        -46.1357% +/- 1.02457%
        (Student's t, pooled s = 0.586083)

This is more like what I was expecting!  But there's still this
anomaly:

x oseek2 (9.1)
+ truncate2 (9.1)
+------------------------------------------------------------------------------+
|  +++   +           +            x                x x                 x      x|
||___M___A______|                       |____________M___A_________________|   |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5     47.283031     51.450488     49.063725     49.493986     1.6446236
+   5     44.460431     46.148431     44.636276     44.937538    0.70509763
Difference at 95.0% confidence
        -4.55645 +/- 1.84536
        -9.20606% +/- 3.72846%
        (Student's t, pooled s = 1.2653)

Why is it 9% faster to read the beginning of a 1 TiB file that was
crated with truncate(1) than to read the beginning of a 1 TiB file
that was created with dd oseek=?

Even more disturbing: the newer, faster server is in fact faster when
reading from disk:

x noholes (9.2)
+ noholes2 (9.1)
+------------------------------------------------------------------------------+
|x             x                    x x                                * + ++ +|
|     |_________________________A___M______________________|            |__A_| |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5     71.300931      82.89801     77.044237     76.452097     4.3961747
+   5     82.813107     83.970546     83.527358     83.427247    0.43568711
Difference at 95.0% confidence
        6.97515 +/- 4.55587
        9.12356% +/- 5.95912%
        (Student's t, pooled s = 3.12379)

But the faster (9.2) server is much, much slower when reading from
holes:

x truncate (9.2)
+ truncate2 (9.1)
+------------------------------------------------------------------------------+
|++                                                                            |
|++ +                                                       x          x x x x |
||A|                                                             |_____A_M____||
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5     89.187745     101.74893     98.825988     97.497923     4.9188155
+   5     44.460431     46.148431     44.636276     44.937538    0.70509763
Difference at 95.0% confidence
        -52.5604 +/- 5.1245
        -53.9092% +/- 5.25601%
        (Student's t, pooled s = 3.51368)

Repeating the same experiments in a fresh dataset with compression
turned on gives the expected results: ZFS automatically punches a big
hole in the "no holes" test file, and the read performance is
indistinguishable among all three creation methods.  Unfortunately,
this means that the newer, faster 9.2 server slows all accesses down
to the speed of reading holes, whereas the older, slower 9.1 server
speeds all accesses up:

x noholes+lzjb (9.2)
+ noholes2+lzjb (9.1)
+------------------------------------------------------------------------------+
| +                                                                          x |
|++++                                                                    x x xx|
||A_|                                                                     |_AM||
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5     100.23982     103.88802     102.84441     102.38784     1.4890753
+   5     43.747693     45.784714     44.914809     44.902244    0.82465096
Difference at 95.0% confidence
        -57.4856 +/- 1.75541
        -56.1449% +/- 1.71447%
        (Student's t, pooled s = 1.20362)

The 9.2 machine is unquestionably faster, at least when shoving data
around without touching the filesystem ("dd if=/dev/zero of=/dev/null"
with other parameters as for the filesystem-based tests):

x overhead (9.2)
+ overhead2 (9.1)
+------------------------------------------------------------------------------+
| x                                                +                     +     |
| x    x   x                                       +                     +    +|
||M__A___|                                          |____________A_______M____||
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5      5.971316      6.357959      5.980432     6.0935198    0.17238797
+   5      8.015839      9.168025      8.948119     8.6249242    0.55062308
Difference at 95.0% confidence
        2.5314 +/- 0.595023
        41.5426% +/- 9.76484%
        (Student's t, pooled s = 0.407985)

What's going on here?

-GAWollman