Date: Thu, 11 Apr 2013 11:23:21 -0700 From: Jeremy Chadwick <jdc@koitsu.org> To: Adam Nowacki <nowakpl@platinum.linux.pl> Cc: freebsd-fs@freebsd.org Subject: Re: ZFS slow reads for unallocated blocks Message-ID: <20130411182321.GA57336@icarus.home.lan> In-Reply-To: <20130411171428.GA56127@icarus.home.lan> References: <5166EA43.7050700@platinum.linux.pl> <20130411171428.GA56127@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Apr 11, 2013 at 10:14:28AM -0700, Jeremy Chadwick wrote: > On Thu, Apr 11, 2013 at 06:52:19PM +0200, Adam Nowacki wrote: > > This one is quite weird - reads from files that were created and > > resized with ftruncate (so no actual data on disk) are considerably > > slower and use more CPU time than files with data. If compression is > > enabled this will also affect files with long runs of zeroes as ZFS > > won't write any data to disk in this case. There is no I/O on the > > pool during the read tests - all fits into 10GB ARC. > > > > FreeBSD storage 9.1-RELEASE FreeBSD 9.1-RELEASE #0: Sat Feb 23 > > 15:51:26 UTC 2013 root@storage:/usr/obj/usr/src/sys/GENERIC > > amd64 > > > > Mem: 264M Active, 82M Inact, 12G Wired, 100M Cache, 13M Buf, 3279M Free > > Swap: 5120M Total, 5120M Free > > > > # zfs create -o atime=off -o recordsize=128k -o compression=off -o > > mountpoint=/home/testfs home/testfs > > > > --- truncated file: > > > > # time truncate -s 1G /home/testfs/trunc1g > > 0.000u 0.000s 0:00.00 0.0% 0+0k 0+0io 0pf+0w > > > > # time dd if=/home/testfs/trunc1g of=/dev/null bs=1024k > > 1024+0 records in > > 1024+0 records out > > 1073741824 bytes transferred in 0.434817 secs (2469410435 bytes/sec) > > 0.000u 0.435s 0:00.43 100.0% 26+2813k 0+0io 0pf+0w > > > > # time dd if=/home/testfs/trunc1g of=/dev/null bs=16k > > 65536+0 records in > > 65536+0 records out > > 1073741824 bytes transferred in 3.809560 secs (281854564 bytes/sec) > > 0.000u 3.779s 0:03.81 98.9% 25+2755k 0+0io 0pf+0w > > > > # time cat /home/testfs/trunc1g > /dev/null > > 0.070u 14.031s 0:14.19 99.3% 15+2755k 0+0io 0pf+0w > > ^^^^^^^ 14 seconds compared to 1 second for random data > > > > --- file filled with zeroes: > > > > # time dd if=/dev/zero of=/home/testfs/zero1g bs=1024k count=1024 > > 1024+0 records in > > 1024+0 records out > > 1073741824 bytes transferred in 2.375426 secs (452020732 bytes/sec) > > 0.000u 0.525s 0:02.37 21.9% 23+2533k 1+1io 0pf+0w > > > > # time dd if=/home/testfs/zero1g of=/dev/null bs=1024k > > 1024+0 records in > > 1024+0 records out > > 1073741824 bytes transferred in 0.199078 secs (5393571244 bytes/sec) > > 0.000u 0.200s 0:00.20 100.0% 26+2808k 0+0io 0pf+0w > > > > # time dd if=/home/testfs/zero1g of=/dev/null bs=16k > > 65536+0 records in > > 65536+0 records out > > 1073741824 bytes transferred in 0.436472 secs (2460046434 bytes/sec) > > 0.015u 0.421s 0:00.43 100.0% 26+2813k 0+0io 0pf+0w > > > > # time cat /home/testfs/zero1g > /dev/null > > 0.023u 1.156s 0:01.18 99.1% 15+2779k 0+0io 0pf+0w > > > > --- file filled with random bytes: > > > > # time dd if=/dev/random of=/home/testfs/random1g bs=1024k count=1024 > > 1024+0 records in > > 1024+0 records out > > 1073741824 bytes transferred in 16.116569 secs (66623474 bytes/sec) > > 0.000u 13.214s 0:16.11 81.9% 25+2750k 0+1io 0pf+0w > > > > # time dd if=/home/testfs/random1g of=/dev/null bs=1024k > > 1024+0 records in > > 1024+0 records out > > 1073741824 bytes transferred in 0.207115 secs (5184280044 bytes/sec) > > 0.000u 0.208s 0:00.20 100.0% 26+2808k 0+0io 0pf+0w > > > > # time dd if=/home/testfs/random1g of=/dev/null bs=16k > > 65536+0 records in > > 65536+0 records out > > 1073741824 bytes transferred in 0.432518 secs (2482536705 bytes/sec) > > 0.023u 0.409s 0:00.43 97.6% 26+2828k 0+0io 0pf+0w > > > > # time cat /home/testfs/random1g > /dev/null > > 0.031u 1.053s 0:01.08 100.0% 15+2770k 0+0io 0pf+0w > > > > --- compression on: > > > > # zfs create -o atime=off -o recordsize=128k -o compression=lzjb -o > > mountpoint=/home/testfs home/testfs > > > > --- file filled with zeroes: > > > > # time dd if=/dev/zero of=/home/testfs/zero1g bs=1024k count=1024 > > 1024+0 records in > > 1024+0 records out > > 1073741824 bytes transferred in 1.007765 secs (1065468404 bytes/sec) > > 0.000u 0.458s 0:01.01 44.5% 26+2880k 1+1io 0pf+0w > > > > # time dd if=/home/testfs/zero1g of=/dev/null bs=1024k > > 1024+0 records in > > 1024+0 records out > > 1073741824 bytes transferred in 0.630737 secs (1702360431 bytes/sec) > > 0.000u 0.630s 0:00.63 100.0% 25+2742k 0+0io 0pf+0w > > > > # time dd if=/home/testfs/zero1g of=/dev/null bs=16k > > 65536+0 records in > > 65536+0 records out > > 1073741824 bytes transferred in 4.089175 secs (262581530 bytes/sec) > > 0.015u 4.036s 0:04.09 98.7% 25+2758k 0+0io 0pf+0w > > > > # time cat /home/testfs/zero1g > /dev/null > > 0.031u 15.863s 0:15.95 99.6% 15+2754k 0+0io 0pf+0w > > ^^^^^^^ > > > > --- it appears recordsize has a huge effect on this (recordsize=32k): > > > > # zfs create -o atime=off -o recordsize=32k -o compression=off -o > > mountpoint=/home/testfs home/testfs > > > > # time truncate -s 1G testfs/trunc1g > > 0.000u 0.000s 0:00.01 0.0% 0+0k 1+0io 0pf+0w > > > > # time cat /home/testfs/trunc1g > /dev/null > > 0.047u 5.842s 0:05.93 99.1% 15+2761k 0+0io 0pf+0w > > ^^^^^^ > > > > --- recordsize=4k: > > > > # zfs create -o atime=off -o recordsize=4k -o compression=off -o > > mountpoint=/home/testfs home/testfs > > > > # time truncate -s 1G testfs/trunc1g > > 0.000u 0.000s 0:00.00 0.0% 0+0k 0+0io 0pf+0w > > > > # time cat /home/testfs/trunc1g > /dev/null > > 0.047u 1.441s 0:01.52 97.3% 15+2768k 0+0io 0pf+0w > > ^^^^^^ > > Take a look at src/bin/cat/cat.c, function raw_cat(). > > Therein lies the answer. > > TL;DR -- cat != dd. I wanted to follow up on this, because I received an off-list private flame basically telling me to shut the fuck up. Compression has nothing to do with this. recordsize plays a role for what should become obvious reasons -- keep reading. Again: cat is not dd. cat has its own set of logic for how it calculates the "optimal block size" to use when calling read(2). dd, operates differently, and lets you set the blocksize using bs. A modified cat binary (work/src/bin/cat/cat) which prints out the relevant "calculation" bits within raw_cat(). And yes, $cwd is a ZFS filesystem, with a default recordsize (128KB): $ truncate -s 1g testfile $ time work/src/bin/cat/cat testfile > /dev/null raw_cat() PHYSPAGES_THRESHOLD: 32768 BUFSIZE_MAX: 2097152 BUFSIZE_SMALL: 131072 _SC_PHYS_PAGES: 2088077 bsize = 4096 real 0m13.067s user 0m0.000s sys 0m13.070s cat, in my case, issues read(fd, 4096). That's a 4KByte block size. Now let's use dd with the same size (bs=4096): $ time dd if=testfile of=/dev/null bs=4k 262144+0 records in 262144+0 records out 1073741824 bytes transferred in 13.031543 secs (82395601 bytes/sec) real 0m13.033s user 0m0.023s sys 0m13.000s Increase the block size to dd (from 4k up to 8k) and the speed will double: $ time dd if=testfile of=/dev/null bs=8k 131072+0 records in 131072+0 records out 1073741824 bytes transferred in 6.545117 secs (164052352 bytes/sec) real 0m6.546s user 0m0.026s sys 0m6.519s Now let's move to a UFS2+SU filesystem, where the behaviour is different: $ truncate -s 1g /tmp/testfile $ time work/src/bin/cat/cat /tmp/testfile > /dev/null raw_cat() PHYSPAGES_THRESHOLD: 32768 BUFSIZE_MAX: 2097152 BUFSIZE_SMALL: 131072 _SC_PHYS_PAGES: 2088077 bsize = 4096 real 0m1.191s user 0m0.031s sys 0m1.159s $ time dd if=/tmp/testfile of=/dev/null bs=4k 262144+0 records in 262144+0 records out 1073741824 bytes transferred in 0.898440 secs (1195117846 bytes/sec) real 0m0.900s user 0m0.039s sys 0m0.860s $ time dd if=/tmp/testfile of=/dev/null bs=8k 131072+0 records in 131072+0 records out 1073741824 bytes transferred in 0.768045 secs (1398019512 bytes/sec) real 0m0.769s user 0m0.024s sys 0m0.745s My conclusion is that ZFS handles sparse/truncated files very differently than UFS. Those familiar with the internals of ZFS can probably explain this dilemma. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130411182321.GA57336>