From owner-freebsd-fs@FreeBSD.ORG Sat Jan 25 19:43:29 2014 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4CB70AFD for ; Sat, 25 Jan 2014 19:43:29 +0000 (UTC) Received: from mail-pd0-x233.google.com (mail-pd0-x233.google.com [IPv6:2607:f8b0:400e:c02::233]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 197E31DE2 for ; Sat, 25 Jan 2014 19:43:29 +0000 (UTC) Received: by mail-pd0-f179.google.com with SMTP id q10so4299099pdj.24 for ; Sat, 25 Jan 2014 11:43:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=delphix.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=MU98LGyRnnOHUg5SGOxaVgAJ7ksDXUs8JIDpFQv5ukU=; b=Y1+5aL1L2QwAlUQe224xfMVLKeU16/YQxuQQ6gQJTqIQiCnFEUaGDL7KABd9negAyW eenUeIiiqJb/fnB0MA0xHPolMsDQlBp4XIN482jVOtCoi4BfA5x+fWWSOc+UkCevXKyq 8nbD3dZev3yDcBkc/PZX2kLZ/yItYN6a2efos= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=MU98LGyRnnOHUg5SGOxaVgAJ7ksDXUs8JIDpFQv5ukU=; b=dDHOcFYpCQuUjvwoFczOhxmPxtLkVPGYS9Xf0mscfDyvxjTVCYgdiRrKRVC/NIn9eK d3PXB6TpDXQVHTpJ68UrdkmzAQlX7l2USWc0PGW/9aTJWF4Q9nRrB2d9+Vc4qAX+VDIa D85NrzIerb06Aana8MJZgcD9JvW7RMTxAl9rdc1dLEqX01nMeCX0+OtNW/YipFcFoW8a DqtnfyLeVvIRXgbJ0RdWa2Arf4U6gyyOGx5vIFBA/7SlZRGIvMFc8RiNw4fpxyAWo5x/ LU2phGZ4DeEgFjwdd+NSc+lEkktiLfabq0OIHlp+8i5sFWcXXSmvVHwWuw2/MYaVSEps 9h3g== X-Gm-Message-State: ALoCoQlCbeikIs8mYbqVBQNzVJK8QGRe1VL3U15mvrgm+6UgAyIUxen547n75WuR4qbGtuFy8WNO MIME-Version: 1.0 X-Received: by 10.68.198.97 with SMTP id jb1mr21372280pbc.104.1390679008689; Sat, 25 Jan 2014 11:43:28 -0800 (PST) Received: by 10.70.44.9 with HTTP; Sat, 25 Jan 2014 11:43:28 -0800 (PST) In-Reply-To: <21219.21690.165105.561362@hergotha.csail.mit.edu> References: <21219.21690.165105.561362@hergotha.csail.mit.edu> Date: Sat, 25 Jan 2014 11:43:28 -0800 Message-ID: Subject: Re: ZFS read performance anomaly / performance regression From: Matthew Ahrens To: Garrett Wollman Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.17 Cc: freebsd-fs X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Jan 2014 19:43:29 -0000 How are you reading the files? Could you be reading in less than block-size chunks (default 128K)? This sounds similar to issue discussed in the thread with subject "ZFS slow reads for unallocated blocks", which was comparing sparse vs cached reads. Since your storage is so fast, prefetch will probably read and cache the blocks by the time you issue the read() syscall, so you may be hitting a similar issue. To quote my conclusion from the previous thread: Yes, this is also a problem on illumos, though much less so on my system, only about 2x. It looks like the difference is due to the fact that the zeroed dbufs are not cached, so we have to zero the entire dbuf (e.g. 128k) for every read syscall (e.g. 8k). Increasing the size of the reads to match the recordsize results in performance parity between reading cached data and sparse zeros. --matt On Fri, Jan 24, 2014 at 10:07 PM, Garrett Wollman wrote: > Consider three 1-TiB files, equal in dignity, all on the same ZFS > dataset: > > -rw-r--r-- 1 root wheel 1099511627776 Jan 7 19:03 test > -rw-r--r-- 1 root wheel 1099511627776 Jan 24 22:38 test-truncate > -rw-r--r-- 1 root wheel 1099511628800 Jan 24 21:32 test2 > > File "test" was created by writing zeros; "test2" was created by > seeking to 1 TiB and writing a block of zeros, and "test-truncate" was > created with "truncate -s 1T". Compression is disabled on this > dataset, and "uname -a" reports: > > FreeBSD nfs-prod-3.csail.mit.edu 9.2-RELEASE-p2 FreeBSD 9.2-RELEASE-p2 #2 > r260414M: Tue Jan 7 17:32:43 EST 2014 wollman@xyz.csail.mit.edu:/usr/obj/usr/src/sys/CSAIL > amd64 > > A naive observer would expect that reading "test" should be > significantly slower than reading either of the other files, since it > requires reading data from the disk, and that there should be no > significant difference between the other two, which consist entirely > of cacheable metadata and have no, or very little, data stored on > disk. > > But no: > > x noholes > + oseek > * truncate > > +------------------------------------------------------------------------------+ > |x x x x * *+ * * * + + > + + | > | |________AM______| > |________A__M___|__|_______A__M______|| > > +------------------------------------------------------------------------------+ > N Min Max Median Avg Stddev > x 5 71.300931 82.89801 77.044237 76.452097 4.3961747 > + 5 97.931676 111.36284 108.52787 106.5687 5.5595081 > Difference at 95.0% confidence > 30.1166 +/- 7.30929 > 39.3928% +/- 9.56062% > (Student's t, pooled s = 5.01171) > * 5 89.187745 101.74893 98.825988 97.497923 4.9188155 > Difference at 95.0% confidence > 21.0458 +/- 6.80337 > 27.5281% +/- 8.89887% > (Student's t, pooled s = 4.66482) > > (Values are the time to read 128 GiB of each file, which is more than > the available memory for caching on this server. Measurements were > interleaved abcabcabcabcabc, but server was not rebooted nor > completely idle. Lower numbers are better.) > > Now consider the same test on an older, slower (but also completely > idle) server, running an older FreeBSD release, but with an identical > zpool configuration (88 spindles plus log and cache SSDs): > > FreeBSD nfs-backup-1.csail.mit.edu 9.1-RELEASE-p7 FreeBSD 9.1-RELEASE-p7 > #34 r255669M: Fri Oct 11 13:04:40 EDT 2013 wollman@xyz.csail.mit.edu:/usr/obj/usr/src/sys/CSAIL > amd64 > > Files were created in the same manner in a fresh, unsnapshotted > dataset on this server, and the same shell script was used to time > reads (even though the older machine has significantly less memory): > > x noholes2 (9.1) > + oseek2 (9.1) > * truncate2 (9.1) > > +------------------------------------------------------------------------------+ > |** + > x | > |** * + + ++ > xxx| > ||A_| |_MA__| > |A|| > > +------------------------------------------------------------------------------+ > N Min Max Median Avg Stddev > x 5 82.813107 83.970546 83.527358 83.427247 0.43568711 > + 5 47.283031 51.450488 49.063725 49.493986 1.6446236 > Difference at 95.0% confidence > -33.9333 +/- 1.75456 > -40.6741% +/- 2.10311% > (Student's t, pooled s = 1.20304) > * 5 44.460431 46.148431 44.636276 44.937538 0.70509763 > Difference at 95.0% confidence > -38.4897 +/- 0.854768 > -46.1357% +/- 1.02457% > (Student's t, pooled s = 0.586083) > > This is more like what I was expecting! But there's still this > anomaly: > > x oseek2 (9.1) > + truncate2 (9.1) > > +------------------------------------------------------------------------------+ > | +++ + + x x x x > x| > ||___M___A______| > |____________M___A_________________| | > > +------------------------------------------------------------------------------+ > N Min Max Median Avg Stddev > x 5 47.283031 51.450488 49.063725 49.493986 1.6446236 > + 5 44.460431 46.148431 44.636276 44.937538 0.70509763 > Difference at 95.0% confidence > -4.55645 +/- 1.84536 > -9.20606% +/- 3.72846% > (Student's t, pooled s = 1.2653) > > Why is it 9% faster to read the beginning of a 1 TiB file that was > crated with truncate(1) than to read the beginning of a 1 TiB file > that was created with dd oseek=? > > Even more disturbing: the newer, faster server is in fact faster when > reading from disk: > > x noholes (9.2) > + noholes2 (9.1) > > +------------------------------------------------------------------------------+ > |x x x x * + > ++ +| > | |_________________________A___M______________________| > |__A_| | > > +------------------------------------------------------------------------------+ > N Min Max Median Avg Stddev > x 5 71.300931 82.89801 77.044237 76.452097 4.3961747 > + 5 82.813107 83.970546 83.527358 83.427247 0.43568711 > Difference at 95.0% confidence > 6.97515 +/- 4.55587 > 9.12356% +/- 5.95912% > (Student's t, pooled s = 3.12379) > > But the faster (9.2) server is much, much slower when reading from > holes: > > x truncate (9.2) > + truncate2 (9.1) > > +------------------------------------------------------------------------------+ > |++ > | > |++ + x x x > x x | > ||A| > |_____A_M____|| > > +------------------------------------------------------------------------------+ > N Min Max Median Avg Stddev > x 5 89.187745 101.74893 98.825988 97.497923 4.9188155 > + 5 44.460431 46.148431 44.636276 44.937538 0.70509763 > Difference at 95.0% confidence > -52.5604 +/- 5.1245 > -53.9092% +/- 5.25601% > (Student's t, pooled s = 3.51368) > > Repeating the same experiments in a fresh dataset with compression > turned on gives the expected results: ZFS automatically punches a big > hole in the "no holes" test file, and the read performance is > indistinguishable among all three creation methods. Unfortunately, > this means that the newer, faster 9.2 server slows all accesses down > to the speed of reading holes, whereas the older, slower 9.1 server > speeds all accesses up: > > x noholes+lzjb (9.2) > + noholes2+lzjb (9.1) > > +------------------------------------------------------------------------------+ > | + > x | > |++++ x > x xx| > ||A_| > |_AM|| > > +------------------------------------------------------------------------------+ > N Min Max Median Avg Stddev > x 5 100.23982 103.88802 102.84441 102.38784 1.4890753 > + 5 43.747693 45.784714 44.914809 44.902244 0.82465096 > Difference at 95.0% confidence > -57.4856 +/- 1.75541 > -56.1449% +/- 1.71447% > (Student's t, pooled s = 1.20362) > > The 9.2 machine is unquestionably faster, at least when shoving data > around without touching the filesystem ("dd if=/dev/zero of=/dev/null" > with other parameters as for the filesystem-based tests): > > x overhead (9.2) > + overhead2 (9.1) > > +------------------------------------------------------------------------------+ > | x + + > | > | x x x + + > +| > ||M__A___| > |____________A_______M____|| > > +------------------------------------------------------------------------------+ > N Min Max Median Avg Stddev > x 5 5.971316 6.357959 5.980432 6.0935198 0.17238797 > + 5 8.015839 9.168025 8.948119 8.6249242 0.55062308 > Difference at 95.0% confidence > 2.5314 +/- 0.595023 > 41.5426% +/- 9.76484% > (Student's t, pooled s = 0.407985) > > What's going on here? > > -GAWollman > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >