From owner-freebsd-fs@FreeBSD.ORG Sat Jan 25 06:07:59 2014 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 78F16197 for ; Sat, 25 Jan 2014 06:07:59 +0000 (UTC) Received: from hergotha.csail.mit.edu (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 1CDE412C2 for ; Sat, 25 Jan 2014 06:07:58 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.7/8.14.7) with ESMTP id s0P67tWV024324 for ; Sat, 25 Jan 2014 01:07:55 -0500 (EST) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.7/8.14.4/Submit) id s0P67sjl024320; Sat, 25 Jan 2014 01:07:54 -0500 (EST) (envelope-from wollman) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <21219.21690.165105.561362@hergotha.csail.mit.edu> Date: Sat, 25 Jan 2014 01:07:54 -0500 From: Garrett Wollman To: freebsd-fs@freebsd.org Subject: ZFS read performance anomaly / performance regression X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (hergotha.csail.mit.edu [127.0.0.1]); Sat, 25 Jan 2014 01:07:55 -0500 (EST) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on hergotha.csail.mit.edu X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Jan 2014 06:07:59 -0000 Consider three 1-TiB files, equal in dignity, all on the same ZFS dataset: -rw-r--r-- 1 root wheel 1099511627776 Jan 7 19:03 test -rw-r--r-- 1 root wheel 1099511627776 Jan 24 22:38 test-truncate -rw-r--r-- 1 root wheel 1099511628800 Jan 24 21:32 test2 File "test" was created by writing zeros; "test2" was created by seeking to 1 TiB and writing a block of zeros, and "test-truncate" was created with "truncate -s 1T". Compression is disabled on this dataset, and "uname -a" reports: FreeBSD nfs-prod-3.csail.mit.edu 9.2-RELEASE-p2 FreeBSD 9.2-RELEASE-p2 #2 r260414M: Tue Jan 7 17:32:43 EST 2014 wollman@xyz.csail.mit.edu:/usr/obj/usr/src/sys/CSAIL amd64 A naive observer would expect that reading "test" should be significantly slower than reading either of the other files, since it requires reading data from the disk, and that there should be no significant difference between the other two, which consist entirely of cacheable metadata and have no, or very little, data stored on disk. But no: x noholes + oseek * truncate +------------------------------------------------------------------------------+ |x x x x * *+ * * * + + + + | | |________AM______| |________A__M___|__|_______A__M______|| +------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 71.300931 82.89801 77.044237 76.452097 4.3961747 + 5 97.931676 111.36284 108.52787 106.5687 5.5595081 Difference at 95.0% confidence 30.1166 +/- 7.30929 39.3928% +/- 9.56062% (Student's t, pooled s = 5.01171) * 5 89.187745 101.74893 98.825988 97.497923 4.9188155 Difference at 95.0% confidence 21.0458 +/- 6.80337 27.5281% +/- 8.89887% (Student's t, pooled s = 4.66482) (Values are the time to read 128 GiB of each file, which is more than the available memory for caching on this server. Measurements were interleaved abcabcabcabcabc, but server was not rebooted nor completely idle. Lower numbers are better.) Now consider the same test on an older, slower (but also completely idle) server, running an older FreeBSD release, but with an identical zpool configuration (88 spindles plus log and cache SSDs): FreeBSD nfs-backup-1.csail.mit.edu 9.1-RELEASE-p7 FreeBSD 9.1-RELEASE-p7 #34 r255669M: Fri Oct 11 13:04:40 EDT 2013 wollman@xyz.csail.mit.edu:/usr/obj/usr/src/sys/CSAIL amd64 Files were created in the same manner in a fresh, unsnapshotted dataset on this server, and the same shell script was used to time reads (even though the older machine has significantly less memory): x noholes2 (9.1) + oseek2 (9.1) * truncate2 (9.1) +------------------------------------------------------------------------------+ |** + x | |** * + + ++ xxx| ||A_| |_MA__| |A|| +------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 82.813107 83.970546 83.527358 83.427247 0.43568711 + 5 47.283031 51.450488 49.063725 49.493986 1.6446236 Difference at 95.0% confidence -33.9333 +/- 1.75456 -40.6741% +/- 2.10311% (Student's t, pooled s = 1.20304) * 5 44.460431 46.148431 44.636276 44.937538 0.70509763 Difference at 95.0% confidence -38.4897 +/- 0.854768 -46.1357% +/- 1.02457% (Student's t, pooled s = 0.586083) This is more like what I was expecting! But there's still this anomaly: x oseek2 (9.1) + truncate2 (9.1) +------------------------------------------------------------------------------+ | +++ + + x x x x x| ||___M___A______| |____________M___A_________________| | +------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 47.283031 51.450488 49.063725 49.493986 1.6446236 + 5 44.460431 46.148431 44.636276 44.937538 0.70509763 Difference at 95.0% confidence -4.55645 +/- 1.84536 -9.20606% +/- 3.72846% (Student's t, pooled s = 1.2653) Why is it 9% faster to read the beginning of a 1 TiB file that was crated with truncate(1) than to read the beginning of a 1 TiB file that was created with dd oseek=? Even more disturbing: the newer, faster server is in fact faster when reading from disk: x noholes (9.2) + noholes2 (9.1) +------------------------------------------------------------------------------+ |x x x x * + ++ +| | |_________________________A___M______________________| |__A_| | +------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 71.300931 82.89801 77.044237 76.452097 4.3961747 + 5 82.813107 83.970546 83.527358 83.427247 0.43568711 Difference at 95.0% confidence 6.97515 +/- 4.55587 9.12356% +/- 5.95912% (Student's t, pooled s = 3.12379) But the faster (9.2) server is much, much slower when reading from holes: x truncate (9.2) + truncate2 (9.1) +------------------------------------------------------------------------------+ |++ | |++ + x x x x x | ||A| |_____A_M____|| +------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 89.187745 101.74893 98.825988 97.497923 4.9188155 + 5 44.460431 46.148431 44.636276 44.937538 0.70509763 Difference at 95.0% confidence -52.5604 +/- 5.1245 -53.9092% +/- 5.25601% (Student's t, pooled s = 3.51368) Repeating the same experiments in a fresh dataset with compression turned on gives the expected results: ZFS automatically punches a big hole in the "no holes" test file, and the read performance is indistinguishable among all three creation methods. Unfortunately, this means that the newer, faster 9.2 server slows all accesses down to the speed of reading holes, whereas the older, slower 9.1 server speeds all accesses up: x noholes+lzjb (9.2) + noholes2+lzjb (9.1) +------------------------------------------------------------------------------+ | + x | |++++ x x xx| ||A_| |_AM|| +------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 100.23982 103.88802 102.84441 102.38784 1.4890753 + 5 43.747693 45.784714 44.914809 44.902244 0.82465096 Difference at 95.0% confidence -57.4856 +/- 1.75541 -56.1449% +/- 1.71447% (Student's t, pooled s = 1.20362) The 9.2 machine is unquestionably faster, at least when shoving data around without touching the filesystem ("dd if=/dev/zero of=/dev/null" with other parameters as for the filesystem-based tests): x overhead (9.2) + overhead2 (9.1) +------------------------------------------------------------------------------+ | x + + | | x x x + + +| ||M__A___| |____________A_______M____|| +------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 5.971316 6.357959 5.980432 6.0935198 0.17238797 + 5 8.015839 9.168025 8.948119 8.6249242 0.55062308 Difference at 95.0% confidence 2.5314 +/- 0.595023 41.5426% +/- 9.76484% (Student's t, pooled s = 0.407985) What's going on here? -GAWollman