Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 25 Jan 2014 11:43:28 -0800
From:      Matthew Ahrens <mahrens@delphix.com>
To:        Garrett Wollman <wollman@freebsd.org>
Cc:        freebsd-fs <freebsd-fs@freebsd.org>
Subject:   Re: ZFS read performance anomaly / performance regression
Message-ID:  <CAJjvXiHnBq3bg-ZGgyzFw6WBNau0SmoLUfxWLtFCnBwb13fqaQ@mail.gmail.com>
In-Reply-To: <21219.21690.165105.561362@hergotha.csail.mit.edu>
References:  <21219.21690.165105.561362@hergotha.csail.mit.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
How are you reading the files?  Could you be reading in less than
block-size chunks (default 128K)?

This sounds similar to issue discussed in the thread with subject "ZFS slow
reads for unallocated blocks", which was comparing sparse vs cached reads.
 Since your storage is so fast, prefetch will probably read and cache the
blocks by the time you issue the read() syscall, so you may be hitting a
similar issue.

To quote my conclusion from the previous thread:

Yes, this is also a problem on illumos, though much less so on my system,
only about 2x.  It looks like the difference is due to the fact that the
zeroed dbufs are not cached, so we have to zero the entire dbuf (e.g. 128k)
for every read syscall (e.g. 8k).  Increasing the size of the reads to
match the recordsize results in performance parity between reading cached
data and sparse zeros.

--matt


On Fri, Jan 24, 2014 at 10:07 PM, Garrett Wollman <wollman@freebsd.org>wrote:

> Consider three 1-TiB files, equal in dignity, all on the same ZFS
> dataset:
>
> -rw-r--r--  1 root  wheel  1099511627776 Jan  7 19:03 test
> -rw-r--r--  1 root  wheel  1099511627776 Jan 24 22:38 test-truncate
> -rw-r--r--  1 root  wheel  1099511628800 Jan 24 21:32 test2
>
> File "test" was created by writing zeros; "test2" was created by
> seeking to 1 TiB and writing a block of zeros, and "test-truncate" was
> created with "truncate -s 1T".  Compression is disabled on this
> dataset, and "uname -a" reports:
>
> FreeBSD nfs-prod-3.csail.mit.edu 9.2-RELEASE-p2 FreeBSD 9.2-RELEASE-p2 #2
> r260414M: Tue Jan  7 17:32:43 EST 2014     wollman@xyz.csail.mit.edu:/usr/obj/usr/src/sys/CSAIL
>  amd64
>
> A naive observer would expect that reading "test" should be
> significantly slower than reading either of the other files, since it
> requires reading data from the disk, and that there should be no
> significant difference between the other two, which consist entirely
> of cacheable metadata and have no, or very little, data stored on
> disk.
>
> But no:
>
> x noholes
> + oseek
> * truncate
>
> +------------------------------------------------------------------------------+
> |x   x      x          x           *              *+ *  * *    +       +
> + + |
> | |________AM______|
> |________A__M___|__|_______A__M______||
>
> +------------------------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x   5     71.300931      82.89801     77.044237     76.452097     4.3961747
> +   5     97.931676     111.36284     108.52787      106.5687     5.5595081
> Difference at 95.0% confidence
>         30.1166 +/- 7.30929
>         39.3928% +/- 9.56062%
>         (Student's t, pooled s = 5.01171)
> *   5     89.187745     101.74893     98.825988     97.497923     4.9188155
> Difference at 95.0% confidence
>         21.0458 +/- 6.80337
>         27.5281% +/- 8.89887%
>         (Student's t, pooled s = 4.66482)
>
> (Values are the time to read 128 GiB of each file, which is more than
> the available memory for caching on this server.  Measurements were
> interleaved abcabcabcabcabc, but server was not rebooted nor
> completely idle.  Lower numbers are better.)
>
> Now consider the same test on an older, slower (but also completely
> idle) server, running an older FreeBSD release, but with an identical
> zpool configuration (88 spindles plus log and cache SSDs):
>
> FreeBSD nfs-backup-1.csail.mit.edu 9.1-RELEASE-p7 FreeBSD 9.1-RELEASE-p7
> #34 r255669M: Fri Oct 11 13:04:40 EDT 2013     wollman@xyz.csail.mit.edu:/usr/obj/usr/src/sys/CSAIL
>  amd64
>
> Files were created in the same manner in a fresh, unsnapshotted
> dataset on this server, and the same shell script was used to time
> reads (even though the older machine has significantly less memory):
>
> x noholes2 (9.1)
> + oseek2 (9.1)
> * truncate2 (9.1)
>
> +------------------------------------------------------------------------------+
> |**       +
>    x |
> |**  * +  +   ++
>  xxx|
> ||A_|   |_MA__|
>   |A||
>
> +------------------------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x   5     82.813107     83.970546     83.527358     83.427247    0.43568711
> +   5     47.283031     51.450488     49.063725     49.493986     1.6446236
> Difference at 95.0% confidence
>         -33.9333 +/- 1.75456
>         -40.6741% +/- 2.10311%
>         (Student's t, pooled s = 1.20304)
> *   5     44.460431     46.148431     44.636276     44.937538    0.70509763
> Difference at 95.0% confidence
>         -38.4897 +/- 0.854768
>         -46.1357% +/- 1.02457%
>         (Student's t, pooled s = 0.586083)
>
> This is more like what I was expecting!  But there's still this
> anomaly:
>
> x oseek2 (9.1)
> + truncate2 (9.1)
>
> +------------------------------------------------------------------------------+
> |  +++   +           +            x                x x                 x
>    x|
> ||___M___A______|
> |____________M___A_________________|   |
>
> +------------------------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x   5     47.283031     51.450488     49.063725     49.493986     1.6446236
> +   5     44.460431     46.148431     44.636276     44.937538    0.70509763
> Difference at 95.0% confidence
>         -4.55645 +/- 1.84536
>         -9.20606% +/- 3.72846%
>         (Student's t, pooled s = 1.2653)
>
> Why is it 9% faster to read the beginning of a 1 TiB file that was
> crated with truncate(1) than to read the beginning of a 1 TiB file
> that was created with dd oseek=?
>
> Even more disturbing: the newer, faster server is in fact faster when
> reading from disk:
>
> x noholes (9.2)
> + noholes2 (9.1)
>
> +------------------------------------------------------------------------------+
> |x             x                    x x                                * +
> ++ +|
> |     |_________________________A___M______________________|
>  |__A_| |
>
> +------------------------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x   5     71.300931      82.89801     77.044237     76.452097     4.3961747
> +   5     82.813107     83.970546     83.527358     83.427247    0.43568711
> Difference at 95.0% confidence
>         6.97515 +/- 4.55587
>         9.12356% +/- 5.95912%
>         (Student's t, pooled s = 3.12379)
>
> But the faster (9.2) server is much, much slower when reading from
> holes:
>
> x truncate (9.2)
> + truncate2 (9.1)
>
> +------------------------------------------------------------------------------+
> |++
>      |
> |++ +                                                       x          x x
> x x |
> ||A|
> |_____A_M____||
>
> +------------------------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x   5     89.187745     101.74893     98.825988     97.497923     4.9188155
> +   5     44.460431     46.148431     44.636276     44.937538    0.70509763
> Difference at 95.0% confidence
>         -52.5604 +/- 5.1245
>         -53.9092% +/- 5.25601%
>         (Student's t, pooled s = 3.51368)
>
> Repeating the same experiments in a fresh dataset with compression
> turned on gives the expected results: ZFS automatically punches a big
> hole in the "no holes" test file, and the read performance is
> indistinguishable among all three creation methods.  Unfortunately,
> this means that the newer, faster 9.2 server slows all accesses down
> to the speed of reading holes, whereas the older, slower 9.1 server
> speeds all accesses up:
>
> x noholes+lzjb (9.2)
> + noholes2+lzjb (9.1)
>
> +------------------------------------------------------------------------------+
> | +
>    x |
> |++++                                                                    x
> x xx|
> ||A_|
> |_AM||
>
> +------------------------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x   5     100.23982     103.88802     102.84441     102.38784     1.4890753
> +   5     43.747693     45.784714     44.914809     44.902244    0.82465096
> Difference at 95.0% confidence
>         -57.4856 +/- 1.75541
>         -56.1449% +/- 1.71447%
>         (Student's t, pooled s = 1.20362)
>
> The 9.2 machine is unquestionably faster, at least when shoving data
> around without touching the filesystem ("dd if=/dev/zero of=/dev/null"
> with other parameters as for the filesystem-based tests):
>
> x overhead (9.2)
> + overhead2 (9.1)
>
> +------------------------------------------------------------------------------+
> | x                                                +                     +
>     |
> | x    x   x                                       +                     +
>    +|
> ||M__A___|
>  |____________A_______M____||
>
> +------------------------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x   5      5.971316      6.357959      5.980432     6.0935198    0.17238797
> +   5      8.015839      9.168025      8.948119     8.6249242    0.55062308
> Difference at 95.0% confidence
>         2.5314 +/- 0.595023
>         41.5426% +/- 9.76484%
>         (Student's t, pooled s = 0.407985)
>
> What's going on here?
>
> -GAWollman
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJjvXiHnBq3bg-ZGgyzFw6WBNau0SmoLUfxWLtFCnBwb13fqaQ>